INTRODUCTION TO MACHINE LEARNING LEARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, LAST TIME Learning using statistical models relating unknown hypothesis to observed data 3 techniques

Maximum likelihood Maximum a posteriori Bayesian inference A REVIEW HW4 asks you to learn parameters of a Bayes Net (Nave Bayes) Suppose we wish to estimate P(C|A,B) The unknown parameters in the conditional probability distribution are coin flips P(C|AB) A,B q1 A,B

q2 A,B q3 A,B q4 A REVIEW Example data ML estimates P(C|AB) A,B q1=3/8 A,B

q2=1/11 A,B q3=4/11 A,B q4=6/13 A B C # obs 1 1 1 3

1 1 0 5 1 0 1 1 1 0 0

10 0 1 1 4 0 1 0 7 0 0 1

6 0 0 0 7 MAXIMUM A POSTERIORI Example data MAP Estimates assuming virtual counts a=b=2 virtual counts 2H,2T A,B A,B A,B A,B

P(C|AB) q1=(3+2)/(8+4) q2=(1+2)/ (11+4) q3=(4+2)/ (11+4) q4=(6+2)/ (13+4) A B C # obs 1 1 1

3 1 1 0 5 1 0 1 1 1 0 0

10 0 1 1 4 0 1 0 7 0 0

1 6 0 0 0 7 TOPICS IN MACHINE LEARNING Tasks & settings Classification Ranking Clustering Regression Decision-making Supervised Unsupervised Semi-supervised

Active Reinforcement learning Techniques Bayesian learning Decision trees Neural networks Support vector machines Boosting Case-based reasoning Dimensionality reduction Applications Document retrieval Document classification Data mining Computer vision Scientific discovery Robotics WHAT IS LEARNING?

Mostly generalization from experience: Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future M.R. Genesereth and N.J. Nilsson, in Logical Foundations of AI, 1987 Concepts, heuristics, policies Supervised vs. un-supervised learning SUPERVISED LEARNING Agent is given a training set of input/output pairs (x,y), with y=f(x) Task: build a model that will allow it to predict f(x) for a new x UNSUPERVISED LEARNING

Agent is given a training set of data points x Task: learn patterns in the data (e.g., clusters) REINFORCEMENT LEARNING Agent acts sequentially in the real world, chooses actions a1,,an, receives reward R Must decide which actions were most responsible for R DEDUCTIVE VS. INDUCTIVE REASONING Deductive reasoning: General

rules (e.g., logic) to specific examples Inductive reasoning: Specific examples to general rules INDUCTIVE LEARNING Basic form: learn a function from examples f is the unknown target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h f given a training Instance set of examples D

of supervised learning Classification task: f {0,1,,C} (usually C=1) Regression task: f reals INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)

E.g., curve fitting: INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

INDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: h=D is a trivial, but perhaps uninteresting solution (caching) OVERFITTING An issue that can arise in all learning tasks where the hypothesis too closely matches the known data Essentially

we incorporate the noise of the data into our prediction strategy for unseen data this is BAD! Extreme slides cases shown on previous two CLASSIFICATION TASK The target function f(x) takes on values True and False A example is positive if f is True, else it is negative

The set X of all examples is the example set The training set is a subset of X a small one! LOGIC-BASED INDUCTIVE LEARNING Here, examples (x, f(x)) take on discrete values LOGIC-BASED INDUCTIVE LEARNING Here, examples (x, f(x)) take on discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not REWARDED CARD EXAMPLE

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards rewarded Background knowledge KB: ((r=1) v v (r=10)) NUM(r) ((r=J) v (r=Q) v (r=K)) FACE(r) ((s=S) v (s=C)) BLACK(s) ((s=D) v (s=H)) RED(s) Training set D: REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S]) REWARDED CARD EXAMPLE

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards rewarded Background knowledge KB: ((r=1) v v (r=10)) NUM(r) ((r=J) v (r=Q) v (r=K)) FACE(r) ((s=S) v (s=C)) BLACK(s) There are several possible ((s=D) v (s=H)) RED(s) inductive hypotheses Training set D: REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S]) Possible inductive hypothesis: h (NUM(r) BLACK(s) REWARD([r,s])) LEARNING A LOGICAL PREDICATE (CONCEPT CLASSIFIER)

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Example: CONCEPT describes the precondition of an action, e.g., Unstack(C,A) E is the set of states CONCEPT(x) HANDEMPTYx, BLOCK(C) x, BLOCK(A) x, CLEAR(C) x, ON(C,A) x Learning CONCEPT is a step toward learning an action description LEARNING A LOGICAL PREDICATE (CONCEPT CLASSIFIER)

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates LEARNING A LOGICAL PREDICATE (CONCEPT CLASSIFIER) Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD)

Observable predicates A(x), B(X), (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, ) where S(A,B,) is a sentence built with the observable predicates, e.g.: CONCEPT(x) A(x) (B(x) v C(x)) HYPOTHESIS SPACE An hypothesis is any sentence of the form: CONCEPT(x) S(A,B, ) where S(A,B,) is a sentence built using the observable predicates The set of all hypotheses is called the hypothesis space H

An hypothesis h agrees with an example if it gives the correct value of CONCEPT INDUCTIVE LEARNING SCHEME Training set D - + - + + - + +

+ + - + - + - + + - Inductive hypothesis h + Example set X

Hypothesis space H {[A, B, , CONCEPT]} {[CONCEPT(x) S(A,B, )]} SIZE OF HYPOTHESIS SPACE n observable predicates 2n entries in truth table defining CONCEPT and each entry can be filled with True or False In the absence of any restriction (bias), there are n 22 hypotheses to choose from n = 6 2x1019 hypotheses! MULTIPLE INDUCTIVE HYPOTHESES

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards rewarded Background knowledge KB: ((r=1) v v (r=10)) NUM(r) ((r=J ) v (r=Q) v (r=K)) FACE(r) ((s=S) v (s=C)) BLACK(s) ((s=D) v (s=H)) RED(s) Training set D: REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J ,S]) h1 NUM(r) BLACK(s) REWARD([r,s]) h2 BLACK(s) (r=J) ) REWARD([r,s]) h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s]) h4 ([r,s]=[5,H]) ([r,s]=[J) ,S]) REWARD([r,s]) agree with all the examples in the training set MULTIPLE INDUCTIVE HYPOTHESES Deck of cards, with each card designated by [r,s], its rank and suit, and some cards rewarded

Background knowledge KB: ((r=1) v v (r=10)) NUM(r) ((r=J ) v (r=Q) v (r=K)) FACE(r) ((s=S) v (s=C)) BLACK(s) ((s=D) v (s=H)) RED(s) Training set D: REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J ,S]) Need for a system of preferences called an inductive bias to compare possible hypotheses h1 NUM(r) BLACK(s) REWARD([r,s]) h2 BLACK(s) (r=J) ) REWARD([r,s]) h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S]) REWARD([r,s]) h4 ([r,s]=[5,H]) ([r,s]=[J) ,S]) REWARD([r,s]) agree with all the examples in the training set NOTION OF CAPACITY

It refers to the ability of a machine to learn any training set without error A machine with too much capacity is like a botanist with photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before A machine with too little capacity is like the botanists lazy brother, who declares that if its green, its a tree Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine KEEP-IT-SIMPLE (KIS) BIAS

Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only high-level observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is KEEP-IT-SIMPLE (KIS) BIAS

Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only high-level observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax Einstein: A theory must be as simple as possible, but not simpler than this Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well)

There are much fewer simple hypotheses than complex ones, hence the hypothesis space is KEEP-IT-SIMPLE (KIS) BIAS Examples Use much fewer observable predicates than the Iftraining the biasset allows only sentences S that are conjunctions k << npredicate, predicates picked from Constrain theoflearnt e.g., to use only high-level observable predicates

such asof NUM, the n observable predicates, then the size BLACK, and RED and/or to have simple k HFACE, is O(n ) syntax Motivation

If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is SUPERVISED LEARNING FLOW CHART Datapoints Target function Training set Unknown concept we want to approximate Hypothesis space Choice of learning algorithm Learner Observations we

have seen Better quantities to assess performance Inductive Hypothesis Test set Prediction Observations we will see in the future CAPACITY IS NOT THE ONLY CRITERION Accuracy on training set isnt the best measure of performance Training set D

- + - + + - + + + + - + - +

+ + - - Example set X + Test Learn Hypothesis space H GENERALIZATION ERROR A hypothesis h is said to generalize well if it achieves low error on all examples in X

- + - + + + + + - + - Test -

+ + Learn + + - - Example set X + Hypothesis space H OVERFITTING (RESTATED)

If we have a situation where our training error continues to decrease While our generalization error on the other hand begins to increase We can safely say our model has fallen victim to overfitting What does this have to do with capacity? ASSESSING PERFORMANCE OF A LEARNING ALGORITHM Samples from X are typically unavailable Take out some of the training set

Train on the remaining training set Test on the excluded instances Cross-validation CROSS-VALIDATION Split original set of examples, train Examples D - + - + - - - + +

+ + - + - + + Train + + + Hypothesis space H

CROSS-VALIDATION Evaluate hypothesis on testing set Testing set - - - + + - + + +

+ - + Hypothesis space H CROSS-VALIDATION Evaluate hypothesis on testing set Testing set - + + + +

- + Test + - + - Hypothesis space H CROSS-VALIDATION Compare true concept against prediction 9/13 correct Testing set

- + ++ ++ -- -+ ++ ++ +- -+ -++ --

Hypothesis space H TENNIS EXAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) TENNIS EXAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) (W=Weak) Training errors = 3/10 Testing errors = 4/4 TENNIS EXAMPLE Evaluate learning algorithm

PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 1/4 TENNIS EXAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 2/4 HOW TO CONSTRUCT A BETTER LEARNER? Ideas?

TEN(ISH) COMMANDMENTS OF MACHINE LEARNING Thou shalt not: Train on examples in the testing set Form assumptions by peeking at the testing set, then formulating inductive bias Overfit the training data REINFORCEMENT LEARNING Enough talk, lets see some machine learning in action! Klondike Solitaire Klondike Solitaire

Difficulties: size of state space; large number of stochastic elements; variability of utility of actions Strategy: Reinforcement Learning reward long-term learning Representation State: want to find a way to effectively reduce/collapse state space Action: need to find way of incorporating

key components of an action and its potential results Algorithm SARSA: State1-Action1-Reward-State22 Action ( , ) = ( , ) + [ + ( , ) ( , ) ] Different learning policies vary in: State representation Action representation

Reward function Results Win Pct. vs. Learning Time 4.5 4 3.5 3 Random P1' P4' P5' P6' Win % 2.5 2

1.5 1 0.5 0 0 200 400 600 Games played (learning time) 800 1000

1200 Clear Winner Emerges Win Pct. vs. Learning Time 3 2.5 Win % 2 1.5 P4' 1 0.5 0 0

200 400 600(learning time) Games played - Slow, but steady progress 800 1000 1200 Keep Playing... Win Percentage vs. Learning Time Long-term play 3.2 3.1

Win Rate 3 Trial 1 2 3 4 5 2.9 2.8 2.7 2.6 2.5 0

250000 500000 750000 1000000 Games Played - Seems to converge, BUT: average win rates for last 10,000 games were: 3.13, 3.08, 3.4, 3, 3.2 3 !!! NEXT TIME Decision tree learning Read R&N 18.1-3