Analytics - Quiz A 10-question quiz drawing from various analytics areas. 1.In fitting a model to predict whether a person viewing an ecommerce web site will click on a particular link, a certain company drew the training data from web logs of the browsing records of prior visitors. Various variables were found to be useful in predicting the target, including a binary variable indicating whether or not the person made a purchase. How should that variable be handled: a. It should be included as a predictor due to its likely predictive power. b. It should be excluded since it is uncorrelated with the target variable. c. It should be excluded since it will not be available in new data. d. It should be included, but only in models that rely on binary input variables. 2.A dataset has 2000 records and 50 variables with 6% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. What is the approximate probability that a given record will be removed? a. 50 b. 0.06 c. 0.03 d. 0.95 3.Two predictive models have been fit to data with a binary target variable, using the 100+ predictor variables that are available. One is a logistic regression model, the other is a neural net using the maximum number of variables, layers, nodes and cycles permitted by the software. Which of the following is/are true: (i) The logistic regression model is less likely to overfit the training data; (ii) The difference in accuracy between training and validation data is likely to be greater for the neural net than the logistic regression; (iii) A simpler neural net may perform worse on the training data, but better on the validation data a. All of the above b. i and ii only c. ii only d. i and iii only 4.A business wishes to segment its customers into a small number of groups so that it can effectively target marketing efforts at different customer types. Which of the following correctly describes the process and the order of the steps: a. This is a predictive modeling task: select variables, decide number of clusters, apply clustering method(s), normalize the data, describe the clusters. b. This is an unsupervised learning task: select variables, normalize the data, apply clustering method(s), decide number of clusters, denormalize the data, describe the clusters. c. This is a clustering task: select variables, apply clustering method(s), normalize the data, decide number of clusters, denormalize the data, describe the clusters. d. This is an unsupervised learning task: decide number of clusters, apply clustering method(s), select variables, normalize the data, describe the clusters. 5. Souvenir sales at a beach resort in Queensland, Australia are shown in this figure as raw data and as transformed data. Choose the statement, below, that is most accurate: a. The appropriate function to represent the trend relationship between demand (Y) and time (X) is a linear one, since the logarithmic transformation of the y-axis produces an approximate linear relationship with respect to trend, and annual seasonality effe b. The appropriate model for these data is an annual seasonal one, since there are annual sales peaks and there is no trend involved. c. The lower figure is a logarithmic transformation of the upper figure, and it shows that the appropriate model for the relationship between Y (demand) and X (time) is an exponential one, with seasonal effects. d. The lower figure is an exponential transformation of the data in the upper figure, and its purpose is to account for the seasonal effects. 6. The attached figure is a set of association rules derived from transactional data on cosmetic purchases. Which of the following statements is most accurate: a. The underlying data are in the form of a count matrix (columns = products, rows = customers, cells = number of purchases over time), and Support (c) indicates the percentage of all customers who purchase the Consequent items. b. The underlying data are in the form of a count matrix (columns = products, rows = customers, cells = number of purchases over time), and Support (c) indicates the percentage of all transactions in which the Consequent item is purchased. c. The underlying data are a binary matrix (columns = products, rows = transactions, cells = 0/1 for purchase or no purchase), and the lift ratio, applied to transactions, = P(Consequent d. The 4th rule is interpreted as follows: "If Brushes are purchased, the probability is 0.5636 that Bronzer + Concealer + Nail Polish will be purchased in a subsequent transaction." 7.Consider two different text mining tasks: (i) Mining the "contact us" submissions from a web site, to predict purchase/no-purchase, (ii) Mining internal email correspondence in a natural resource company to determine relevance to an environmental enforcement action. Think carefully about the process of preparing the text for predictive modeling, and the scenario involved. Which of the following is most true: a. Normalizing all email addresses (and replacement with a single term denoting "emailtoken") would probably be appropriate in case (i) but not case (ii) b. The email addresses in case (ii) will probably occur with very low and roughly equal frequency and not be meaningful for prediction. c. Some email addresses in case (i) will probably occur with highly unequal frequency, and merit stemming d. Developing concepts will be important in case (i) for the purpose of extracting meaning from individual documents. 8.In considering the use of logistic regression, neural networks, and classification & regression trees for prediction and classification, (choose the best answer) a. Logistic regression is best at capturing a linear relationship for predicting continuous data outcomes, while both neural nets and classification & regression trees excel at capturing interactions in the predictors. b. Both neural nets and classification & regression trees produce "black box" models, while logistic regression requires more computation time. c. Unlike neural nets, both logistic regression and classification & regression trees produce models that help explain the effect of predictor variables on the target. d. Logistic regression is computationally efficient, while both neural nets and classification produce decision rules that are easily explained to non-statisticians. 9.A direct response advertising firm, in a test of a popup web offer presented to all visitors, gets a response rate of 1.5% with no predictive model applied. It develops a logistic regression model to estimate the probability that visitors will respond. In validating the model on a holdout sample, it gets a lift of 2 on the top decile. Which of the following is true? a. The predictive model will effectively lift the response probability for the average visitor by 50%. b. The predictive model will increase the response probability for the average visitor by 100%. c. Those 10% of the visitors predicted as most likely to respond will respond at an average rate of 3%. d. Those 2% of the visitors predicted as most likely to respond will respond at an average rate of 10%. e. The popup offer will yield a full (100%) response rate if limited to the top 1.5%. 10.A political consultant wants to predict how individual voters will vote, and has data on whether the voter has voted in the past 10 years worth of primary and general elections, data on 100+ demographic attributes of the neighborhood in which the voter lives, as well as purchased data on 200+ consumer spending variables. Which of the following would NOT be useful in dealing with the issues of dimension reduction and feature selection: a. Correlation analysis b. Principal components analysis c. Replacing some raw variables with derived variables d. Using a neural net with fewer layers than normal e. Variable elimination based on domain knowledge First Name Last Name Email(Required)