The Institute for Operations Research and the Management Sciences (INFORMS) launched the Certified Analytics Professional (CAP®) program in 2013. A key component is an exam covering a variety of topics in predictive analytics, data mining, statistics and operations research. A practice exam was published at the same time; questions from that practice exam (with edits for clarity and correction) are shown below in italics, together with the correct answers, a bit of commentary, plus pointers to the courses at Statistics.com that cover the relevant topic.
- A multiple linear regression was built to try to predict customer expenditures based on 200 independent variables (behavioral and demographic). 10,000 rows of data were fed into a stepwise regression, each row representing one customer. 1,000 customers were male, and 9,000 customers were female. The final model had an adjusted R-squared of 0.27 and seven independent variables. Increasing the number of rows of data to 100,000 and rerunning the stepwise regression will most likely:
a) have no impact upon the adjusted R-squared. R-squared is a measure of the extent to which variation in the target variable is explained by the regression model. Basing the model on more information from the same source as the original information is unlikely to result in the model having more explanatory power. This topic is covered in Regression.
b) increase the impact of the male customers.
c) change the heteroskedasticity of the residuals in a favorable manner.
d) decrease the number of independent variables in the final model.
- A clothing company wants to use analytics to decide which customers to send a promotional catalogue in order to attain a targeted response rate. Which of the following techniques would be the most appropriate to use for making this decision?
a) Integer programming
b) Logistic regression The key phrase here is "response rate." This rate is essentially a proportion derived from a binary target variable (respond/no-respond). Logistic regression models a target binary variable with predictor variables (categorical or continuous). None of the other techniques mentioned does this. This topic is covered in Predictive Analytics 1, and in greater depth in Logistic Regression.
c) Analysis of variance
d) Linear regression
- Which of the following is an effective optimization method?
a) Analysis of variance (ANOVA)
b) Generalized linear regression model (GLM)
c) Box-Jenkins Method (ARIMA)
d) Mixed integer programming (MIP) All the other methods are statistical analysis methods. This topic is covered in Integer and Nonlinear Programming and Network Flow.
- A box and whisker plot for a dataset will MOST clearly show:
a) the difference between the second quartile and the median.
b) the 90% confidence interval around the mean.
c) where the [actual-predicted] error value is not zero.
d) if the data is skewed and, if so, in which direction. The skew will show up in the "whiskers" - if one is longer than the other, the data are skewed. This topic is covered in Statistics 1.
- Which of the following statements is true of modeling a multi-server checkout line?
a) A queuing model can be used to estimate service rates.
b) A queuing model can be used to estimate average arrivals.
c) Variability in arrival and service times will tend to play a critical role in congestion. Service rates and average arrivals are data inputs to a queuing model, not outputs. A Poisson distribution is typically used to model a queue, so it is relevant. Variability in arrival and service times will indeed make a big difference in congestion models (more variability requires larger volume to produce stable service times). This topic is covered in Risk Simulation and Queuing.
d) Poisson distributions are not relevant.
- A company is considering designing a new automobile. Their options are a design based on current gasoline engine technology or a government proposed "Green" technology. You are a government official whose job is to encourage automakers to adopt the "Green" technology. You cannot provide funding for development costs, but you can provide a subsidy for every car sold. The development costs and the wholesale price, in thousands of dollars, of the cars are shown in the table below:
(numbers in $ thousands)
(numbers in $ thousands)
Wholesale Price/vehicle 25 40 Variable Cost/vehicle 15 35 Fixed Cost 100,000 200,000
How large a subsidy per vehicle sold will be required, assuming there will be enough demand to motivate the switch?
a) Greater than $5000 - Green vehicles cost $20,000 more to produce per car, and yield $15,000 more in revenue per car, for a net of -$5000 per car. However, there are also more fixed costs to be recouped, so the subsidy must be greater than $5000. No advanced statistical methods are required for this problem! However, Statistics 1, Statistics 2, & Statistics 3 give you good practice in calculations like this.
b) Less than $5000
c) Cannot be determined
d) Equal to $5000
- A furniture maker would like to determine the most profitable mix of items to produce. There are well-known budgetary constraints. Each piece of furniture is made of a predetermined amount of material with known costs, and demand is known. Which of the following analytical techniques is the most appropriate one to solve this problem?
a) Optimization - Regression, data mining and forecasting all contribute to estimation and prediction. However, all the variables are known here - they do not need to be predicted or estimated. This is an optimization problem. This topic is covered in Optimization - Linear Programming.
b) Multiple regression
c) Data mining
- You have simulated the NPV of a decision. It ranges between - $10 million and +$10 million. To best present the likelihood of possible outcomes, you should:
a) present a single NPV estimate to avoid confusion.
b) present a histogram to show likelihood of various NPV ranges. A single NPV would mislead (it would conceal the variability). Arbitrarily removing outliers would also mislead (sometimes outliers are the most interesting and important cases). Relaxing constraints should only be done if the constraints merit relaxation or to perform what-if analysis, not to faithfully present the likelihood of possible outcomes. This topic is covered in Risk Simulation and Queueing.
c) trim all outliers to present the most balanced diagram.
d) relax constraints associated with extreme points in the simulation.
- A company ships products from a single dock at their warehouse. The time to load shipments depends on the experience of the crew, products being shipped and weather. The company thinks there is significant unmet demand for their products and would like to build another dock in order to meet this demand. They ask you to build a model and determine if the revenue from the additional products sold will cover the cost of the second dock within two years of it becoming operational. Which of the following is the MOST appropriate modeling approach?
a) Optimization because it is a transportation problem.
b) Optimization because the company’s objective to maximize profit and capacity at the dock is a limited resource.
c) Forecasting because you can determine the throughput at the dock, calculate the net revenue and compare this with the cost of the new dock.
d) Discrete event simulation because there are a sequence of discrete random events through time. The key here is the limited nature of your charge - to determine revenue from the new dock. Since sales potential is unlimited, the question is really how fast the goods can flow through the system, which depends on the discrete random events. This topic is covered in Risk Simulation and Queueing.
- A project seeks to build a predictive data-mining model of customer profitability based upon a series of independent variables including customer transaction history, demographics, and externally purchased credit-scoring information. There are currently 100,000 unique customers available for use in building the predictive model. Which of the following strategies would reflect the BEST allocation of these 100,000 customer data points?
a) Use 70,000 randomly selected data points when building the model, and hold the remaining 30,000 out as a test dataset. Predictive data-mining models typically build a model on one subset of the data, and evaluate it on another "holdout" subset. This topic is covered in Predictive Analytics 1.
b) Use all 100,000 data points when building the model.
c) Build four separate models and randomly partition the data into 4 separate datasets with 25,000 data points each.
d) Use 1,000 randomly selected data points when building the model.
- Conjoint analysis in market research applications can:
a) give its best estimates of customer preference structure based on in-depth interviews with a small number of carefully chosen subjects.
b) only trade off relative importance to customers of features with similar scales.
c) allow calculation of relative importance of varying features and attributes to customers. Conjoint analysis is not limited to small samples, does not require similar scales, and has no set limit on attributes & levels (though practical considerations of implementation may so dictate). This topic is covered in Choice Modeling.
d) only trade off among a limited number of attributes and levels.
- One of the main advantages of tree-based models and neural networks is that they:
a) are easy to interpret, use, and explain.
b) build models with higher R-squared than other regression techniques.
c) reveal interactions without having to explicitly build them into the model. Neural nets are effective, but not easy to explain. No one technique consistently produces higher goodness-of-fit statistics; this is a function of the data and the model. The various data mining models do not differ greatly in their susceptibility to missing data. This topic is covered in Discrete Choice Modeling and Conjoint Analysis.
d) can be modeled even when there is a significant amount of missing data.
- The monthly profit made by a clothing manufacturer is proportional to the monthly demand, up to a maximum demand of 1000 units, which corresponds to the plant producing at full capacity. (Any excess demand over 1000 units will be satisfied by some other manufacturer, and hence yield no additional profit.) The monthly demand is uncertain, but the average demand is reliably estimated at 1000 units. At this level of demand the monthly profit is $3,000,000. Which of the following statements must be true of the expected monthly profit, P?
a) P can have any positive value.
b) P is possibly greater than $3,000,000.
c) P is equal to $3,000,000.
d) P is less than $3,000,000. $3,000,000 is monthly profit if the plant produces each month at maximum capacity of 1000 units per month. However, since average demand is exactly 1000 months, there will be some months that will cause the plant to fall short, so monthly profit must be less than $3,000,000. No advanced statistical methods are required for this problem! However, Statistics 1, Statistics 2, & Statistics 3 give you good practice in calculations like this.
- After building a predictive model and testing it on new data, an under prediction by a forecasting system can be detected by its:
c) mean absolute deviation.
d) mean squared error.
- All times in the decision tree below are given in hours. What is the expected travel time (in hours) of the optimal (minimum travel time) decision?
d) 7.0 The steps are (1) figure out the expected travel time of each path, (2) sum the expected times for each decision (fly or drive), (3) choose the decision with the lower expected time (drive), and (4) report the expected travel time for that choice. This topic is covered in Risk Simulation and Queueing.
- A segmentation of customers who shop at a retail store may be performed using which of the following methods?
a) Monte Carlo Markov Chain and ANOVA
b) Clustering, factor and control charts
c) Decision tree and recursive function analyses
d) Clustering and decision tree. Clustering, an unsupervised method, has long been used as a segmentation technique. The machine learning technique of decision trees, also known as classification and regression trees (CART), also produces rules that can be used to divide customers into segments. CART is supervised learning. These topics are covered in Predictive Analytics 1, Predictive Analytics 2, Predictive Analytics 3.
- When analyzing responses of a survey of why people like a certain restaurant, factor analysis could reduce the dimension in which of the following ways?
a) Collapse several survey questions regarding food taste, health value, ingredients and consistency into one general unobserved "food quality" variable. Answer (b) describes clustering, not factor analysis. Answers (c) and (d) do not describe anything informative. Note that factor analysis is also used for "feature reduction" (variable reduction) in data mining. This topic is covered in Factor Analysis and Predictive Analytics 3.
b) Condense similar survey respondent answers into clusters of like-minded customers for market segment analysis.
c) Reduce the variability of individual subject ratings by centering each respondent’s ratings around his or her average rating.
d) Decrease variability by analyzing inter-rater reliability on the question items before offering the survey to a wide number of respondents.
Note: The CAP® practice exam also includes a few "generalist" or "soft skills" questions. These are shown below just for your information.
- Which of the following best describes the data and information flow within an organization?
a. Information assurance
b) Information strategy
c) Information mapping
d) Information architecture
- In the kickoff meeting with a client for a new project, which of the following is the MOST important information to obtain?
a) Timeline and implementation plan
b) Analytical model to use
c) Business issue and project goal
d) Available budget
- An analytics professional is responsible for maintaining a simulation model that is used to determine the staffing levels required for a specific operational business process. Assuming that the operational team always uses the number of staff determined by the model, which of the following is the most important maintenance activity?
a) Ensure that all of the model input data items are available when needed.
b) Determine if there has been a change in model accuracy over time.
c) Ensure that all users are reviewing the model results in a timely fashion.d) Determine that the model's reports are understood by the users
Want to be notified of future courses?Yes