Machine Learning in Business John C. Hull Chapter 3 Supervised Learning: Linear Regression Copyright John C. Hull 2019 1 Linear Regression Linear regression is a very popular tool because once you have made the assumption that the model is linear you do not need huge amount of data In ML we refer to the constant term as the bias and the coefficients as weights Copyright John C. Hull 2019 2 Linear Regression continued Assume n observations and m features. Model is Y = a+b1X1+b2X2+.+bmXm+e Standard approach is to choose a and the bi to minimize the mean square error (mse). 1 n mse Y j a b1 X 1, j b2 X 2, j ... bm X m , j n j 1 2

This can be done analytically by inverting a matrix. When the number of features is very large it may be more computationally efficient to use numerical (gradient descent) methods Copyright John C. Hull 2019 3 Gradient Descent (brief description: more details in Chapter 6) The objective is to minimize a function by changing parameters. Steps are as follows: 1. 2. 3. 4. 5. Choose starting value for parameters Find the steepest slope: i.e. the direction in which parameter have to be changed to reduce the objective function by the greatest amount Take a step down the valley in the direction of the steepest slope Repeat steps 2 and 3 Continue until you reach the bottom of the valley Copyright John C. Hull 2019 4

Categorical Features Categorical features are features where there are a number of non-numerical alternatives We can define a variable for each alternative. The variable equals 1 if the alternative is true and zero otherwise. But sometimes we do not have to do this because there is a natural ordering of variables, e.g.: small=1, medium=2, large=3 assist. prof=1, assoc. prof=2, full prof =3 Copyright John C. Hull 2019 5 Regularization Linear regression can over-fit, particularly when there are a large number of correlated features. Results for validation set may not then be as good as for training set Regularization is a way of avoiding overfitting and reducing the number of features. Alternatives: Ridge Lasso Elastic net We must first normalize feature values Copyright John C. Hull 2019 6

Ridge regression (analytic solution) Reduce magnitude of regression coefficients by choosing a parameter l and minimizing m mse l bi2 i 1 What happens as l increases? Copyright John C. Hull 2019 7 Lasso Regression (must use gradient descent) Similar to ridge regression except we minimize m mse l bi i 1 This has the effect of completely eliminating the less important factors Copyright John C. Hull 2019 8 Elastic Net Regression (must use gradient descent)

Middle ground between Ridge and Lasso Minimize m m mse l1 bi2 l2 bi i 1 i 1 Copyright John C. Hull 2019 9 Baby Example (from Chapter 1) Age (years) Salary ($) 25 135,000 55 260,000 27

105,000 35 220,000 60 240,000 65 265,000 45 270,000 40 300,000 50 265,000 30 105,000

Copyright John C. Hull 2019 10 Baby Example continued We apply regularization to the model: where Y is salary and X is age Copyright John C. Hull 2019 11 Z-score normalization Observ. 1 2 3 4 5 6 7 8 9 10 X 1.29 0 0.836

1.14 8 0.58 1 1.191 1.545 0.128 0.22 7 0.482 0.93 6 X2 1.128 X3 0.988 X4 0.874 X5 0.782 0.778 1.046 0.693 0.943

0.592 0.850 0.486 0.770 0.652 0.684 0.688 0.672 1.235 1.731 0.016 0.354 1.247 1.901 0.146 0.449 1.230 2.048 0.253 0.511

1.191 2.174 0.333 0.544 0.361 0.910 0.232 0.861 0.107 0.803 0.004 0.745 Copyright John C. Hull 2019 12 Ridge Results (l=0.02 is similar to quadratic model) l a b1 b2

b3 b4 b5 0 216.5 32,623 135,403 155,315 42,559 0.02 216.5 97.8 36.6 215,49 3 8.5

0.10 216.5 56.5 28.1 3.7 Copyright John C. Hull 2019 35.0 44.6 15.1 28.4 13 Lasso Results (l=1 is similar to the quadratic model) l a b1

b2 b3 b4 b5 0 216.5 32,623 135,403 155,315 42,559 0.02 216.5 646.4 2,046.6 215,49

3 0.0 3,351.0 2,007.9 0.1 216.5 355.4 0.0 494.8 0.0 196.5 1 216.5 147.4 0.0 0.0

99.3 0.0 Copyright John C. Hull 2019 14 Elastic Net Results: l1 = 0.02, l2=1 =216.5 +96.7 +21.1 2 26.0 4 45.5 5 Copyright John C. Hull 2019 15 Iowa House Price Case Study The objective is to predict the prices of house in Iowa from features 800 observations in training set, 600 in validation set, and 508 in test set Copyright John C. Hull 2019 16 Iowa House Price Results (No regularization) 2 categorical variables included. Natural ordering for Basement quality. 25 dummy variables created for neighborhood

Lot area (squ ft) 0.07 Number of half bathrooms 0.02 Overall quality (scale from 1 to 10) 0.21 Number of bedrooms 0.08 Overall condition (scale from 1 to 10) 0.10 Total rooms above grade 0.08 Year built Year remodeled 0.16 0.03 Number of fireplaces 0.03 Parking spaces in garage 0.04

Basement finished squ ft 0.09 Garage area (squ ft) 0.05 Basement unfinished squ ft 0.03 Wood deck (squ ft) 0.02 Total basement squ ft 0.14 Open porch (squ ft) 0.03 1st floor squ ft 0.15 Enclosed porch (squ ft0 0.01 2nd floor squ ft 0.13 Neighborhood (25 alternatives)

0.12 Living area 0.05 to 0.16 Basement quality (6 natural ordering) Copyright 0.01 John C. Hull 2019 17 Ridge Results for validation set 13.00% % Variance Unexplained 12.80% 12.60% 12.40% 12.20% 12.00% 11.80% 11.60% 11.40% l 11.20%

0 0.05 0.1 0.15 0.2 0.25 Copyright John C. Hull 2019 0.3 0.35 0.4 18 Lasso Results for validation set 0.14 % VarianceUnexplained 0.135 0.13 0.125

0.12 0.115 0.11 0.105 l 0.1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Copyright John C. Hull 2019 0.08 0.09

0.1 19 Non-zero weights for Lasso when l=0.1(overall quality and total living area were most important) Feature Lot Area (square feet) Overall quality (Scale from 1 to 10) Year built Year remodeled Finished basement (square feet) Total basement (square feet) First floor (square feet) Living area (square feet) Number of fireplaces Parking spaces in garage Garage area (square feet) Neighborhoods (3 out of 25 non-zero) Basement quality Copyright John C. Hull 2019 Weight 0.04 0.30 0.05 0.06 0.11

0.11 0.03 0.29 0.02 0.03 0.07 0.01, 0.02, and 0.08 0.02 20 Summary of Iowa House Price Results With no regularization correlation between features leads to some negative weights which we would expect to be positive Improvements from Ridge is modest Lasso leads to a much bigger improvement in this case Elastic net similar to Lasso in this case Mean squared error for test set for Lasso with l=0.1 is 14.4% so that 85.6% of variance is explained Copyright John C. Hull 2019 21 Logistic Regression The objective is to classify observations into a positive outcome and negative outcome using data on features Probability of a positive outcome is assumed to be a sigmoid function: where Y is related linearly to the values of the features:

Can use regularization Copyright John C. Hull 2019 22 The Sigmoid Function 1.2 Q 1 0.8 0.6 0.4 0.2 Y 0 -10 -5 0 Copyright John C. Hull 2019 5

10 23 Maximum Likelihood Estimation We use the training set to maximize This cannot be maximized analytically but we can use a gradient ascent algorithm Copyright John C. Hull 2019 24 Lending Club Case Study Data consists of loans made and whether they proved to be good or defaulted. (A restriction is that you do not have data for loans that were never made.) We use only four features Home ownership (rent vs. own) Income Debt to income Credit score Training set has 8,695 observations (7,196 good loans and 1,499 defaulting loans). Test set has 5,196 observations (4,858 good loans and 1,058 defaulting loans) Copyright John C. Hull 2019

25 The Data Home Ownership 1=owns, 0 =rents 1 1 0 1 Income ($000) Debt to Income (%) Credit score 1=Good, 0=Default 44.304 136.000 38.500 88.000

. .. 18.47 20.63 33.73 5.32 690 670 660 660 0 1 0 1 . . Copyright John C. Hull 2019 26 Results for Lending Club Training Set X1= Home Ownership X2= Income X3= Debt to income ratio X4 = Credit score

Copyright John C. Hull 2019 27 Decision Criterion The data set is imbalanced with more good loans than defaulting loans There are procedures for creating a balanced data set With a balanced data set we could classify an observation as positive if Q > 0.5 and negative otherwise However this does not consider the cost of misclassifying a bad loan and the lost profit from misclassifying a good loan A better approach is to investigate different thresholds, Z If Q > Z we accept a loan If Q Z we reject the loan Copyright John C. Hull 2019 28 Test Set Results Z = 0.75: Z=0.80 : Z=0.85: Outcome positive (no default)

Outcome negative (default) Outcome positive (no default) Outcome negative (default) Outcome positive (no default) Outcome negative (default) Predict no default 77.59% Predict default 4.53% 16.26% 1.62% Predict no default 55.34% Predict

default 26.77% 9.75% 8.13% Predict no default 28.65% Predict default 53.47% 3.74% 14.15% Copyright John C. Hull 2019 29 Definitions The confusion matrix Predict positive outcome Predict negative outcome

Outcome positive TP FN Outcome negative FP TN Copyright John C. Hull 2019 30 Test Set Ratios for different Z values Z = 0.75 Z = 0.80 Z = 0.85 Accuracy 79.21% 63.47% 42.80%

True Positive Rate 94.48% 67.39% 34.89% True Negative Rate 9.07% 45.46% 79.11% False Positive Rate 90.93% 54.54% 20.89% Precision 82.67% 85.02%

88.47% Copyright John C. Hull 2019 31 As we change the Z criterion we get an ROC curve (receiver operating characteristics) curve 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 False Positive Rate 0.1 0 0 0.2

0.4 0.6 Copyright John C. Hull 2019 0.8 1 32 Area Under Curve (AUC) The area under the curve is a popular way of summarizing the predictive ability of a model to estimate a binary variable When AUC =1 the model is perfect. When AUC =0.5 the model has no predictive ability When AUC=0 the model is 100% wrong Copyright John C. Hull 2019 33 Choosing Z The value of Z can be based on The expected profit from a loan that is good, P The expected loss from a loan that defaults, L We need to maximize PTPLTP

Copyright John C. Hull 2019 34 A Simple Alternative to regression : k-nearest neighbors Normalize data Measure the distance in n-dimensional space of the new data from the data for which there are labels (i.e. known outcomes) Distance of point with feature values xi from point with feature values yi is (x y ) Choose the k closest data items and average their labels For example if you are forecasting car sales in a certain area, three nearest neighbors for GDP growth and interest rates give sales of 5.2, 5.4 and 5.6 million units, the forecast would be the average of these or 5.4 million units. If you are forecasting whether a loan will default and that of the five nearest neighbors four defaulted and one was good loan, you would forecast that the loan will default . 2 i i i Copyright John C. Hull 2019 35