Machine Learning in Business John C. Hull

Machine Learning in Business John C. Hull

Machine Learning in Business John C. Hull Chapter 3 Supervised Learning: Linear Regression Copyright John C. Hull 2019 1 Linear Regression Linear regression is a very popular tool because once you have made the assumption that the model is linear you do not need huge amount of data In ML we refer to the constant term as the bias and the coefficients as weights Copyright John C. Hull 2019 2 Linear Regression continued Assume n observations and m features. Model is Y = a+b1X1+b2X2+.+bmXm+e Standard approach is to choose a and the bi to minimize the mean square error (mse). 1 n mse Y j a b1 X 1, j b2 X 2, j ... bm X m , j n j 1 2

This can be done analytically by inverting a matrix. When the number of features is very large it may be more computationally efficient to use numerical (gradient descent) methods Copyright John C. Hull 2019 3 Gradient Descent (brief description: more details in Chapter 6) The objective is to minimize a function by changing parameters. Steps are as follows: 1. 2. 3. 4. 5. Choose starting value for parameters Find the steepest slope: i.e. the direction in which parameter have to be changed to reduce the objective function by the greatest amount Take a step down the valley in the direction of the steepest slope Repeat steps 2 and 3 Continue until you reach the bottom of the valley Copyright John C. Hull 2019 4

Categorical Features Categorical features are features where there are a number of non-numerical alternatives We can define a variable for each alternative. The variable equals 1 if the alternative is true and zero otherwise. But sometimes we do not have to do this because there is a natural ordering of variables, e.g.: small=1, medium=2, large=3 assist. prof=1, assoc. prof=2, full prof =3 Copyright John C. Hull 2019 5 Regularization Linear regression can over-fit, particularly when there are a large number of correlated features. Results for validation set may not then be as good as for training set Regularization is a way of avoiding overfitting and reducing the number of features. Alternatives: Ridge Lasso Elastic net We must first normalize feature values Copyright John C. Hull 2019 6

Ridge regression (analytic solution) Reduce magnitude of regression coefficients by choosing a parameter l and minimizing m mse l bi2 i 1 What happens as l increases? Copyright John C. Hull 2019 7 Lasso Regression (must use gradient descent) Similar to ridge regression except we minimize m mse l bi i 1 This has the effect of completely eliminating the less important factors Copyright John C. Hull 2019 8 Elastic Net Regression (must use gradient descent)

Middle ground between Ridge and Lasso Minimize m m mse l1 bi2 l2 bi i 1 i 1 Copyright John C. Hull 2019 9 Baby Example (from Chapter 1) Age (years) Salary ($) 25 135,000 55 260,000 27

105,000 35 220,000 60 240,000 65 265,000 45 270,000 40 300,000 50 265,000 30 105,000

Copyright John C. Hull 2019 10 Baby Example continued We apply regularization to the model: where Y is salary and X is age Copyright John C. Hull 2019 11 Z-score normalization Observ. 1 2 3 4 5 6 7 8 9 10 X 1.29 0 0.836

1.14 8 0.58 1 1.191 1.545 0.128 0.22 7 0.482 0.93 6 X2 1.128 X3 0.988 X4 0.874 X5 0.782 0.778 1.046 0.693 0.943

0.592 0.850 0.486 0.770 0.652 0.684 0.688 0.672 1.235 1.731 0.016 0.354 1.247 1.901 0.146 0.449 1.230 2.048 0.253 0.511

1.191 2.174 0.333 0.544 0.361 0.910 0.232 0.861 0.107 0.803 0.004 0.745 Copyright John C. Hull 2019 12 Ridge Results (l=0.02 is similar to quadratic model) l a b1 b2

b3 b4 b5 0 216.5 32,623 135,403 155,315 42,559 0.02 216.5 97.8 36.6 215,49 3 8.5

0.10 216.5 56.5 28.1 3.7 Copyright John C. Hull 2019 35.0 44.6 15.1 28.4 13 Lasso Results (l=1 is similar to the quadratic model) l a b1

b2 b3 b4 b5 0 216.5 32,623 135,403 155,315 42,559 0.02 216.5 646.4 2,046.6 215,49

3 0.0 3,351.0 2,007.9 0.1 216.5 355.4 0.0 494.8 0.0 196.5 1 216.5 147.4 0.0 0.0

99.3 0.0 Copyright John C. Hull 2019 14 Elastic Net Results: l1 = 0.02, l2=1 =216.5 +96.7 +21.1 2 26.0 4 45.5 5 Copyright John C. Hull 2019 15 Iowa House Price Case Study The objective is to predict the prices of house in Iowa from features 800 observations in training set, 600 in validation set, and 508 in test set Copyright John C. Hull 2019 16 Iowa House Price Results (No regularization) 2 categorical variables included. Natural ordering for Basement quality. 25 dummy variables created for neighborhood

Lot area (squ ft) 0.07 Number of half bathrooms 0.02 Overall quality (scale from 1 to 10) 0.21 Number of bedrooms 0.08 Overall condition (scale from 1 to 10) 0.10 Total rooms above grade 0.08 Year built Year remodeled 0.16 0.03 Number of fireplaces 0.03 Parking spaces in garage 0.04

Basement finished squ ft 0.09 Garage area (squ ft) 0.05 Basement unfinished squ ft 0.03 Wood deck (squ ft) 0.02 Total basement squ ft 0.14 Open porch (squ ft) 0.03 1st floor squ ft 0.15 Enclosed porch (squ ft0 0.01 2nd floor squ ft 0.13 Neighborhood (25 alternatives)

0.12 Living area 0.05 to 0.16 Basement quality (6 natural ordering) Copyright 0.01 John C. Hull 2019 17 Ridge Results for validation set 13.00% % Variance Unexplained 12.80% 12.60% 12.40% 12.20% 12.00% 11.80% 11.60% 11.40% l 11.20%

0 0.05 0.1 0.15 0.2 0.25 Copyright John C. Hull 2019 0.3 0.35 0.4 18 Lasso Results for validation set 0.14 % VarianceUnexplained 0.135 0.13 0.125

0.12 0.115 0.11 0.105 l 0.1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Copyright John C. Hull 2019 0.08 0.09

0.1 19 Non-zero weights for Lasso when l=0.1(overall quality and total living area were most important) Feature Lot Area (square feet) Overall quality (Scale from 1 to 10) Year built Year remodeled Finished basement (square feet) Total basement (square feet) First floor (square feet) Living area (square feet) Number of fireplaces Parking spaces in garage Garage area (square feet) Neighborhoods (3 out of 25 non-zero) Basement quality Copyright John C. Hull 2019 Weight 0.04 0.30 0.05 0.06 0.11

0.11 0.03 0.29 0.02 0.03 0.07 0.01, 0.02, and 0.08 0.02 20 Summary of Iowa House Price Results With no regularization correlation between features leads to some negative weights which we would expect to be positive Improvements from Ridge is modest Lasso leads to a much bigger improvement in this case Elastic net similar to Lasso in this case Mean squared error for test set for Lasso with l=0.1 is 14.4% so that 85.6% of variance is explained Copyright John C. Hull 2019 21 Logistic Regression The objective is to classify observations into a positive outcome and negative outcome using data on features Probability of a positive outcome is assumed to be a sigmoid function: where Y is related linearly to the values of the features:

Can use regularization Copyright John C. Hull 2019 22 The Sigmoid Function 1.2 Q 1 0.8 0.6 0.4 0.2 Y 0 -10 -5 0 Copyright John C. Hull 2019 5

10 23 Maximum Likelihood Estimation We use the training set to maximize This cannot be maximized analytically but we can use a gradient ascent algorithm Copyright John C. Hull 2019 24 Lending Club Case Study Data consists of loans made and whether they proved to be good or defaulted. (A restriction is that you do not have data for loans that were never made.) We use only four features Home ownership (rent vs. own) Income Debt to income Credit score Training set has 8,695 observations (7,196 good loans and 1,499 defaulting loans). Test set has 5,196 observations (4,858 good loans and 1,058 defaulting loans) Copyright John C. Hull 2019

25 The Data Home Ownership 1=owns, 0 =rents 1 1 0 1 Income ($000) Debt to Income (%) Credit score 1=Good, 0=Default 44.304 136.000 38.500 88.000

. .. 18.47 20.63 33.73 5.32 690 670 660 660 0 1 0 1 . . Copyright John C. Hull 2019 26 Results for Lending Club Training Set X1= Home Ownership X2= Income X3= Debt to income ratio X4 = Credit score

Copyright John C. Hull 2019 27 Decision Criterion The data set is imbalanced with more good loans than defaulting loans There are procedures for creating a balanced data set With a balanced data set we could classify an observation as positive if Q > 0.5 and negative otherwise However this does not consider the cost of misclassifying a bad loan and the lost profit from misclassifying a good loan A better approach is to investigate different thresholds, Z If Q > Z we accept a loan If Q Z we reject the loan Copyright John C. Hull 2019 28 Test Set Results Z = 0.75: Z=0.80 : Z=0.85: Outcome positive (no default)

Outcome negative (default) Outcome positive (no default) Outcome negative (default) Outcome positive (no default) Outcome negative (default) Predict no default 77.59% Predict default 4.53% 16.26% 1.62% Predict no default 55.34% Predict

default 26.77% 9.75% 8.13% Predict no default 28.65% Predict default 53.47% 3.74% 14.15% Copyright John C. Hull 2019 29 Definitions The confusion matrix Predict positive outcome Predict negative outcome

Outcome positive TP FN Outcome negative FP TN Copyright John C. Hull 2019 30 Test Set Ratios for different Z values Z = 0.75 Z = 0.80 Z = 0.85 Accuracy 79.21% 63.47% 42.80%

True Positive Rate 94.48% 67.39% 34.89% True Negative Rate 9.07% 45.46% 79.11% False Positive Rate 90.93% 54.54% 20.89% Precision 82.67% 85.02%

88.47% Copyright John C. Hull 2019 31 As we change the Z criterion we get an ROC curve (receiver operating characteristics) curve 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 False Positive Rate 0.1 0 0 0.2

0.4 0.6 Copyright John C. Hull 2019 0.8 1 32 Area Under Curve (AUC) The area under the curve is a popular way of summarizing the predictive ability of a model to estimate a binary variable When AUC =1 the model is perfect. When AUC =0.5 the model has no predictive ability When AUC=0 the model is 100% wrong Copyright John C. Hull 2019 33 Choosing Z The value of Z can be based on The expected profit from a loan that is good, P The expected loss from a loan that defaults, L We need to maximize PTPLTP

Copyright John C. Hull 2019 34 A Simple Alternative to regression : k-nearest neighbors Normalize data Measure the distance in n-dimensional space of the new data from the data for which there are labels (i.e. known outcomes) Distance of point with feature values xi from point with feature values yi is (x y ) Choose the k closest data items and average their labels For example if you are forecasting car sales in a certain area, three nearest neighbors for GDP growth and interest rates give sales of 5.2, 5.4 and 5.6 million units, the forecast would be the average of these or 5.4 million units. If you are forecasting whether a loan will default and that of the five nearest neighbors four defaulted and one was good loan, you would forecast that the loan will default . 2 i i i Copyright John C. Hull 2019 35

Recently Viewed Presentations

  • Are You Smarter Than a 5th Grader?

    Are You Smarter Than a 5th Grader?

    Are You Smarter Than a 5th Grader? * * * * * * * * * * * * * * * * * * * * Are You Smarter Than a 5th Grader? 1,000,000 5th Grade Geography 5th Grade...
  • Ag.10.1c No document Dual brand quality - latest

    Ag.10.1c No document Dual brand quality - latest

    Nov. 2017 - Principle of quantitative ingredients declaration "QUID" Feb. 2018 - announce extension from food . JRC aims to develop a "common basket of items marketed in most EU countries" Eurostat
  • Introduction

    Introduction

    New and Improved! New: not existing before; made, introduced, or discovered recently or now for the first time Non-routine: work are jobs and tasks that are performed irregularly or being performed for the first time. Since these tasks and jobs...
  • ACHIEVEMENT STANDARD 3.6 - macphysed

    ACHIEVEMENT STANDARD 3.6 - macphysed

    hose. 23/05/2012. what is a skill? a performer who has skill tends to do simple things very well and with apparent ease. the unskilled performer while able to perform the skills, does so very poorly.
  • NURS 2410 Unit 5 Metro Community College Nancy

    NURS 2410 Unit 5 Metro Community College Nancy

    Figure 7-19 Cover-uncover test. With the child at your eye level, ask the child to look at a picture on the wall. A, As you cover one eye with an index card or paper cup, observe for any movement of...
  • Chapter 1

    Chapter 1

    VO. 2. at rest, submaximal: same/ - SV means the oxygen demands are being met more efficientlyVO. 2. maximal: (SV X HR X AVO. 2 diff) - why this equation?? All these factors cause the increase in VO 2. maxWhat...
  • International Monetary System and Exchange-Rate Systems

    International Monetary System and Exchange-Rate Systems

    Adamant Inc. of Vermont imports Italian wine. On November 1st it bids €62,500 for a batch of rare wine, but it will not know until December 15th whether the bid is accepted. To protect against a possible appreciation of €,...
  • PSK31 - kd0hkd.com

    PSK31 - kd0hkd.com

    Hamscope. My Shack. IC-718. Interface cables purchased from Associated Radio. eMachine running Windows 7. Ham Radio Deluxe. PC Interface. The interface provides matching and isolation between the audio inputs and outputs of the PC and the transceiver.