Feature-based Distributed Language Modeling with a Mixture of ...

Feature-based Distributed Language Modeling with a Mixture of ...

ACL 2008: Semi-supervised Learning Tutorial John Blitzer and Xiaojin Zhu http://ssl-acl08.wikidot.com What is semi-supervised learning (SSL)? Labeled data (entity classification) , says Mr. Cooper, vice president of Firing Line Inc., a Philadelphia gun shop. Lots more unlabeled data , Yahoos own Jerry Yang is right The details of Obamas San Francisco mis-adventure

Labels person location organization Can we build a better model from both labeled and unlabeled data? Who else has worked on SSL? Canonical NLP problems Tagging (Haghighi and Klein 2006) Chunking, NER (Ando & Zhang 2005) Parsing (McClosky & Charniak 2006)

Outside the classic NLP canon Entity-attribute extraction (Bellare et al. 2007) Sentiment analysis (Goldberg & Zhu 2006) Link spam detection (Zhou et al. 2007) Your problem? Anti-SSL arguments: practice If a problem is important, well find the time / money / linguists to label more data IP-HLN NP-SBJ NP DT

Penn Chinese Treebank VV 2 years to annotate 4000 sentences NP NN NN

NN The national track & field championships concluded I want to parse the baidu zhidao question-answer database. Whos going to annotate it for me? Anti-SSL arguments: theory But Tom Cover said: (Castelli & Cover 1996) Under a specific generative model, labeled samples are exponentially more useful than unlabeled The semi-supervised models in this tutorial make different assumptions than C&C (1996)

Today well also discuss new, positive theoretical results in semi-supervised learning Why semi-supervised learning? I have a good idea, but I cant afford to label lots of data! I have lots of labeled data, but I have even more unlabeled data SSL: Its not just for small amounts of labeled data anymore! Domain adaptation: I have labeled data from 1 domain, but I want a model for a different domain Goals of this tutorial 1) Cover the most common classes of semisupervised learning algorithms

2) For each major class, give examples of where it has been used for NLP 3) Give you the ability to know which type of algorithm is right for your problem 4) Suggest advice for avoiding pitfalls in semisupervised learning Overview 1) Bootstrapping (50 minutes) Co-training Latent variables with linguistic side information 2) Graph-regularization (45 minutes) 3) Structural learning (55 minutes)

Entity recognition, domain adaptation, and theoretical analysis Some notation Bootstrapping: outline The general bootstrapping procedure Co-training and co-boosting Applications to entity classification and entity-attribute extraction SSL with latent variables, prototype learning and applications Bootstrapping

On labeled data, minimize error On unlabeled data, minimize a proxy for error derived from the current model Most semi-supervised learning models in NLP 1) Train model on labeled data 2) Repeat until converged a) Label unlabeled data with current model b) Retrain model on unlabeled data Back to named entities Nave Bayes model Features: Label: Person Parameters estimated from counts

Bootstrapping step Data Estimate parameters Says Mr. Cooper, vice president Label unlabeled data Mr. Balmer has already faxed Retrain model Mr. Balmer has already faxed

Update action Label Balmer Person Bootstrapping folk wisdom Bootstrapping works better for generative models than for discriminative models Discriminative models can overfit some features Generative models are forced to assign probability mass to all features with some count Bootstrapping works better when the nave Bayes assumption is stronger Mr. is not predictive of Balmer if we know the entity is a person

Two views and co-training Make bootstrapping folk wisdom explicit There are two views of a problem. Assume each view is sufficient to do good classification Named Entity Classification (NEC) 2 views: Context vs. Content says Mr. Cooper, a vice president of . . . General co-training procedure On labeled data, maximize accuracy On unlabeled data, constrain models from different views to agree with one another With multiple views, any supervised learning

algorithm can be co-trained Co-boosting for named entity classification Collins and Singer (1999) A brief review of supervised boosting Boosting runs for t=1T rounds. On round t, we choose a base model For NLP, the model at round t, identifies the presence of a particular feature and guesses or abstains Final model: and weight Boosting objective Current model,

steps 1. . .t-1 loss 0-1 loss exp loss Co-boosting objective superscript: view view 2 loss subscript: round of boosting view 1

loss Unlabeled co-regularizer Scores of individual ensembles ( x- and y-axis ) vs. Co-regularizer term ( z-axis ) score magnitude important for disagreement score magnitude not important for agreement Co-boosting updates Optimize each view separately. Set hypothesis ,

to minimize Similarly for view 1 Each greedy update is guaranteed to decrease one view of the objective Basic co-boosting walk-through Labeled: Mr. Balmer has already faxed Unlabeled: says Mr. Smith, vice president of Adam Smith wrote The Wealth of Nations Co-boosting step Update context view

Data Mr. Balmer has already faxed Label unlabeled data says Mr. Smith, vice president Update content view Label Person says Mr. Smith, vice president Label unlabeled data Adam Smith wrote The Wealth of Nations Update context view Update action

Adam Smith wrote . . . Label Person Co-boosting NEC Results Data: 90,000 unlabeled named entities Seeds: Location New York, California, U.S Person context Mr. Organization name I.B.M., Microsoft Organization context Incorporated Create labeled data using seeds as rules Whenever I see Mr. ____, label it as a person Results Baseline (most frequent) 45% Co-boosting: 91%

Entity-attribute extraction Bellare et al. (2008) Entities: companies, countries, people Attributes: C.E.O., market capitalization, border, prime minister, age, employer, address Extracting entity-attribute pairs from sentences The population of China L x M y L,M,R = context exceeds R

x,y = content Data and learning Input: seed list of entities and attributes 2 views: context and content Training: co-training decision lists and selftrained MaxEnt classifier Problem: No negative instances Just treat all unlabeled instances as negative Re-label most confident positive instances Examples of learned attributes Countries and attributes , ,

Companies and attributes , , Where can co-boosting go wrong? Co-boosting enforces agreement on unlabeled data If only 1 view can classify correctly, this causes errors Co-boosting step Update context view Data Update action

Mr. Balmer has already faxed Label unlabeled data says Mr. Cooper, vice president Label Person Update content view says Mr. Cooper, vice president Label unlabeled data Cooper Tires spokesman John Label Person SSL with latent variables Maximize likelihood treating unlabeled labels as hidden Y= Y=

RW= RW= MW= LW= MW= LW= Labeled data gives us basic label-feature structure. Maximum likelihood (MLE) via EM fills in the gaps Where can MLE go wrong? Unclear when likelihood and error are related

Collins & Singer (1999) : co-boosting 92%, EM: 83% Mark Johnson. Why doesnt EM find good HMM POS-taggers? EMNLP 2007. How can we fix MLE? Good solutions are high likelihood, even if theyre not maximum likelihood Coming up: Constrain solutions to be consistent with linguistic intuition Prototype-driven learning Haghighi and Klein 2006 Standard SSL labeled data Prototype learning (part of speech) unlabeled data

prototypes NN training data president percent VBD said, was had JJ

new, last, other Each instance is partially labeled Prototypes force representative instances to be consistent Using prototypes in HMMs

Cooper, vice president of . . . MW=presiden t LW = vice suffix = dent EM Algorithm: Constrained forward-backward Haghighi and Klein (2006) use Markov random fields

Incorporating distributional similarity Represent each word by bigram counts with most frequent words on left & right president LW=vice: 0.1 LW=the: 0.02 k-dimensional representation via SVD ... RW=of: 0.13 ... RW=said: 0.05

Similarity between a word and prototype Well see a similar idea when discussing structural learning Results: Part of speech tagging Prototype Examples (3 prototypes per tag) NN president IN of JJ

new VBD said NNS shares DET the CC

and TO to CD million NNP Mr. PUNC .

VBP are Results BASE PROTO PROTO+SIM 46.4% 67.7% 80.5% Results: Classified Ads Goal: Segment housing advertisements

Size Restrict Terms Location Remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, and schools. Paid water and garbage. No dogs allowed. Prototype examples Results

LOCATION near, shopping BASE 46.4% TERMS paid, utilities PROTO 53.7%

SIZE large, spacious PROTO+SIM 71.5% RESTRICT dogs, smoking Computed from bag-ofwords in current sentence Comments on bootstrapping Easy to write down and optimize. Hard to predict failure cases

Co-training encodes assumptions as 2-view agreement Prototype learning enforces linguistically consistent constraints on unsupervised solutions Co-training doesnt always succeed Structural learning section Prototype learning needs good SIM features to perform well Entropy and bootstrapping Haffari & Sarkar 2007. Analysis of Semi-supervised Learning with the Yarowsky Algorithm. Variants of Yarowsky algorithm minimize entropy of p(y | x) on unlabeled data. Other empirical work has looked at minimizing entropy directly.

Entropy is not error. Little recent theoretical work connecting entropy & error More bootstrapping work McClosky & Charniak (2006). Effective Self-training for Parsing. Self-trained Charniak parser on WSJ & NANC. Aria Haghighis prototype sequence toolkit. http://code.google.com/p/prototype-sequence-toolkit/ Mann & McCallum (2007). Expectation Regularization. Similar to prototype learning, but develops a regularization framework for conditional random fields. Graph-based Semi-supervised Learning From items to graphs Basic graph-based algorithms

Mincut Label propagation and harmonic function Manifold regularization Advanced graphs Dissimilarities Directed graphs 38 Text classification: easy example Two classes: astronomy vs. travel Document = 0-1 bag-of-word vector Cosine similarity Easy, by x1=bright asteroid, y1=astronomy

word overlap x2=yellowstone denali, y2=travel x3=asteroid comet? x4=camp yellowstone? 39 Hard example x1=bright asteroid, y1=astronomy x2=yellowstone denali, y2=travel x3=zodiac? x4=airport bike? No word overlap Zero cosine similarity Pretend you dont know English

40 Hard example x1 asteroid 1 bright 1 x3 x4

x2 comet zodiac 1 airport 1 bike 1 yellowstone

1 denali 1 41 Unlabeled data comes to the rescue x1 asteroid 1 bright

1 comet zodiac x5 x6 1 1 1 1

x7 x3 x4 1 x2 1 1 1

1 1 1 1 1 1 bike 1 denali

x9 1 airport yellowstone x8 Intuition 1. Some unlabeled documents are similar to the labeled documents same label 2. Some other unlabeled documents are similar to the above unlabeled

documents same label 3. ad infinitum We will formalize this with graphs. The graph Nodes Weighted, undirected edges Large weight similar d1 Known labels d3 d2 Want to know

transduction: induction: d4 How to create a graph Empirically, the following works well: 1. Compute distance between i, j 2. For each i, connect to its kNN. k very small but still connects the graph 3. Optionally put weights on (only) those edges 4. Tune Mincut (st-cut)

Mincut example: subjectivity Task: classify each sentence in a document into objective/subjective. (Pang,Lee. ACL 2004) NB/SVM for isolated classification Subjective data (y=1): Movie review snippets bold, imaginative, and impossible to resist Objective data (y=0): IMDB But there is more Mincut example: subjectivity Key observation: sentences next to each other tend to have the same label Two special labeled nodes (source, sink)

Every sentence connects to both: Mincut example: subjectivity Some issues with mincut Multiple equally min cuts, but different Lacks classification confidence These are addressed by harmonic functions and label propagation Harmonic Function An electric network interpretation Rij =

1 wij 1 +1 volt 0 Label propagation The graph Laplacian Closed-form solution Harmonic example 1: WSD WSD from context, e.g., interest, line

(Niu,Ji,Tan ACL 2005) xi: context of the ambiguous word, features: POS, words, collocations dij: cosine similarity or JS-divergence wij: kNN graph Labeled data: a few xis are tagged with their word sense. Harmonic example 1: WSD SENSEVAL-3, as percent labeled: (Niu,Ji,Tan ACL 2005) Harmonic example 2: sentiment Rating (0-3) from movie reviews (Goldberg,Zhu. NAACL06 workshop)

xi: movie reviews wij: cosine similarity btw positive sentence percentage (PSP) vectors of xi, xj PSP classifier trained on snippet data (Pang,Lee. ACL 2005) Harmonic example 2: sentiment Graph Accuracy Some issues with harmonic function It fixes the given labels yl What if some labels are wrong?

It cannot easily handle new test items Transductive, not inductive Add test items to graph, recompute Manifold regularization addresses these issues Manifold regularization Manifold example Text classification (Sindhwani,Niyogi,Belkin.ICML 2005) xi: mac/windows. TFIDF. wij: weighted kNN graph l = 50;u = 1411, test=485

Advanced topics So far edges denote symmetric similarity Larger weights similar labels What if we have dissimilarity knowledge? Two items probably have different labels What if the relation is asymmetric? related to but not always related to

Dissimilarity Political view classification (Goldberg, Zhu, Wright. AISTATS 2007) > deshrubinator: You were the one who thought it should be investigated last week. Dixie: No I didnt, and I made it clear. You are insane! YOU are the one with NO ****ING RESPECT FOR DEMOCRACY! They disagree different classes Indicators: quoting, !?, all caps (internet shouting), etc. Dissimilarity Recall to encode similarity between i,j: Wrong ways: small w = no preference; negative

w nasty optimization One solution (also see (Tong,Jin.AAAI07)) Overall depends on dissim, sim Directed graphs Spam vs. good webpage classification (Zhou,Burges,Tao. AIRW 2007) Hyperlinks as graph edges, a few webpages manually labeled Directed graphs Directed hyperlink edges

X X spam spam X more likely spam X X spam spam X may be good

Can define an analogous directed graph Laplacian + manifold regularization Caution Advantages of graph-based methods: Clear intuition, elegant math Performs well if the graph fits the task Disadvantages: Performs poorly if the graph is bad: sensitive to graph structure and edge weights Usually we do not know which will happen! Structural learning: outline The structural learning algorithm Application to named entity recognition

Domain adaptation with structural correspondence learning Relationship between structural and twoview learning Structural learning Ando and Zhang (2005). Use unlabeled data to constrain structure of hypothesis space Given a target problem (entity classification) Design auxiliary problems Look like target problem Can be trained using unlabeled data Regularize target problem hypothesis to be close to auxiliary problem hypothesis space What are auxiliary problems?

2 criteria for auxiliary problems 1) Look like target problem 2) Can be trained from unlabeled data Named entity classification: Predict presence or absence of left / middle / right words Left Mr. President Middle Right Thursday John

York Corp. Inc. said Auxiliary problems for sentiment classification Running with Scissors: A Memoir Labels Title: Horrible Horriblebook, book,horrible. horrible. This book was horrible. I read half of it,

suffering from suffering fromaaheadache headachethe theentire entiretime, time, and eventually i lit it on fire. One less Positive Negative copy in the world...don't world...don'twaste wasteyour

your money. I wish i had the time spent Auxiliary Problems reading this book back so i could use it for Presence or absence of frequent words and bigrams better purposes. This book wasted my life dont_waste, horrible, suffering Auxiliary problem hypothesis space Consider linear, binary

auxiliary predictors: weight vector for auxiliary problem i Given a new hypothesis weight vector , how far is it from ? Two steps of structural learning Step 1: Use unlabeled data and auxiliary problems to learn a representation : an approximation to Features:

1 0 . . . 0 1 low-dimensional representation 0.3 -1.0 .. .

0.7 -2.1 weights learned from labeled data Step 2: Use labeled data to learn weights for the new representation Unlabeled step: train auxiliary predictors For each unlabeled instance, create a binary presence / absence label (1) The book is so repetitive that I found myself yelling . I will definitely not buy another.

(2) An excellent book. Once again, another wonderful novel from Grisham Binary problem: Does not buy appear here? Mask and predict pivot features using other features Train n linear predictors, one for each binary problem Auxiliary weight vectors give us clues about feature conditional covariance structure Unlabeled step: dimensionality reduction gives n new features value of ith feature is the

propensity to see not buy in the same document We want a low-dimensional representation Many pivot predictors give similar information horrible, terrible, awful Compute SVD & use top left singular vectors Step 2: Labeled training Step 2: Use to regularize labeled objective Original, high-dimensional weight vector

low-dimensional weight vector for learned features Only high-dimensional features have quadratic regularization term Step 2: Labeled training Comparison to prototype similarity Uses predictor (weight vector) space, rather than counts Similarity is learned rather than fixed Results: Named entity recognition Data: CoNLL 2003 shared task Labeled: 204 thousand tokens of Reuters news data

Annotations: person, location, organization, miscellaneous Unlabeled: 30 million words of Reuters news data A glance of some of the rows of ROW # Features 4 Ltd, Inc, Plc, International, Association, Group 9 PCT, N/A, Nil, Dec, BLN, Avg, Year-on-Year

11 San, New, France, European, Japan 15 Peter, Sir, Charles, Jose, Paul, Lee Numerical Results (F-measure) Data size Model Baseline Co-training Structural 10k tokens

204k tokens 72.8 73.1 81.3 85.4 85.4 89.3 Large difference between co-training here and co-boosting (Collins & Singer 1999) This task is entity recognition, not classification We must improve over a supervised baseline Domain adaptation with structural learning

Blitzer et al. (2006): Structural Correspondence Learning (SCL) Blitzer et al. (2007): For sentiment: books & kitchen appliances Running with Scissors: A Memoir Avante Deep Fryer, Chrome & Black Title: Horrible book, horrible. Title: lid does does not not work workwell... well... This book was horrible. I read readhalf

halfofofit,it, I love the way the Tefal deep fryer suffering sufferingfrom fromaaheadache headachethethe entire entire time, cooks, however, I am returning returningmy my

and time,eventually and eventually i lit itError ion litfire. it onOne fire.less One second one26% due to a defective defective lid lid increase: 13%

copy less copy in theinworld...don't the world...don't wastewaste youryour closure. The lid may close initially, but money. I wish i had the time spent after a few uses it no longer stays reading this book back so i could use it for closed. I will not be purchasing this one

better purposes. This book wasted my life again. Pivot Features Pivot features are features which are shared across domains Unlabeled kitchen contexts Unlabeled books contexts Do not buy the Shark portable steamer . Trigger mechanism is defective. The book is so repetitive that I

found myself yelling . I will definitely not buy another. the very nice lady assured me that I must have a defective set . What a disappointment! A disappointment . Ender was talked about for <#> pages altogether. Maybe mine was defective . The directions were unclear its unclear . Its repetitive and boring

Use presence of pivot features as auxiliary problems Choosing pivot features: mutual information Pivot selection (SCL): Select top features by shared counts Pivot selection (SCL-MI): Select top features in two passes (1) Filter feature if min count in both domains < k (2) Select top filtered features by Books-kitchen example In SCL, not SCL-MI book one so all very about they like good when In SCL-MI, not SCL a_must a_wonderful loved_it weak dont_waste awful highly_recommended and_easy Sentiment Classification Data Product reviews from Amazon.com Books, DVDs, Kitchen Appliances, Electronics 2000 labeled reviews from each domain 3000 6000 unlabeled reviews

Binary classification problem Positive if 4 stars or more, negative if 2 or less Features: unigrams & bigrams Pivots: SCL & SCL-MI At train time: minimize Huberized hinge loss (Zhang, 2004) Visualizing (books & kitchen) negative vs. positive

books engaging plot <#>_pages poorly_designed the_plastic predictable awkward_to fascinating

espresso are_perfect leaking kitchen must_read grisham years_now a_breeze Empirical Results: books & DVDs books

90 baseline SCL dvd SCL-MI 85 82.4 80.4 80

79.7 75 70 76.8 77.2 75.4 75.4 76.2 75.8

75.4 74.3 74.0 72.8 70.7 70.9 66.1 72.7 70.6

68.6 65 D->B E->B K->B baseline loss due to adaptation: 7.6% SCL-MI loss due to adaptation: 0.7% B->D E->D

K->D on average, scl-mi reduces error due to adaptation by 36% 76.9 Structural learning: Why does it work? Good auxiliary problems = good representation Structural learning vs. co-training Structural learning separates unsupervised and supervised learning Leads to a more stable solution Structural learning vs. graph regularization

Use structural learning when auxiliary problems are obvious, but graph is not Understanding structural learning: goals Develop a relationship between structural learning and multi-view learning Discuss assumptions under which structural learning can perform well Give a bound on the error of structural learning under these assumptions Structural and Multi-view learning Context pivots LW=Mr.

RW=said RW=corp. Orthography features Balmer Smith Yahoo General Electric Context features RW=expounded LW=Senator RW=LLC LW=the

Orthography pivots Brown Microsoft Smith Canonical correlation analysis Canonical correlation analysis CCA (Hotelling, 1936) Mr. said Smith Microsoft

Correlated features from different views are mapped to similar areas of space Structural learning and CCA Some changes to structural learning (1) Minimize squared loss for auxiliary predictors (2) Block SVD by view: Train auxiliary predictors for view 1 using features from view 2 and vice versa CCA and semi-supervised learning Kakade and Foster (2007). Multi-view regression via canonical correlation analysis. Assume:

Contrast with co-training: K&F dont assume independence Semi-supervised learning procedure Training error using transformed inputs Regularize based on amount of correlation A bound on squared error under CCA Main theorem of Kakade & Foster (2007) Expected error of learned, Expected error of

transformed predictor best model Assumption: How good number of training is single view compared examples to joint model? amount of correlation When can structural learning break? Hard-to-define auxiliary problems Dependency parsing: How to define auxiliary problems for an edge? MT alignment: How to define auxiliary problems for a pair of words?

Combining real-valued & binary features highdimensional, sparse scaling, optimization lowdimensional, dense Other work on structural learning Scott Miller et al. (2004). Name Tagging with Word Clusters and Discriminative Training. Hierarchical clustering, not structural learning. Representation easily combines with binary features

Rie Ando, Mark Dredze, and Tong Zhang (2005). TREC 2005 Genomics Track Experiments at IBM Watson. Applying structural learning to information retrieval Ariadna Quattoni, Michael Collins, and Trevor Darrel (CVPR 2007). Learning Visual Representations using Images with Captions. SSL Summary Bootstrapping Easy to write down. Hard to analyze. Graph-based Regularization Works best when graph encodes information not easily represented in normal feature vectors

Structural Learning With good auxiliary problems, can improve even with lots of training data Difficult to combine with standard feature vectors Two take-away messages 1) Semi-supervised learning yields good results for small amounts of labeled data 2) I have lots of labeled data is not an excuse not to use semi-supervised techniques http://ssl-acl08.wikidot.com

Recently Viewed Presentations

  • Apresentação do PowerPoint

    Apresentação do PowerPoint

    Communication and Management - a crucial relationship Communication touches everything that takes place in an organization and is so intermingled with all other functions and processes that separating it from management is difficult.
  • Lesson Starter: place in order from most to least reactive

    Lesson Starter: place in order from most to least reactive

    Magnesium metal + Copper Sulphate solution- ... Magnesium + Potassium Sulphate-Lithium + Iron Nitrate-Lithium + Aluminium Sulphate-Reaction or no reaction. Zinc oxide + calcium ... Lesson Starter: place in order from most to least reactive
  • Lesson starter - Biology

    Lesson starter - Biology

    From the British Heart Foundation… A pacemaker is a small electrical device which is used to treat some abnormal heart rhythms that can cause your heart to beat too slowly or miss beats. Some pacemakers can also help the chambers...
  • Discussion points - RBF Health

    Discussion points - RBF Health

    Discussion points. Cost-Effectiveness of Results-Based Financing Programs in Zimbabwe and Zambia, HRITF BBL, November 3, 2016. Edit Velenyi, PhD, Sr.Economist
  • Growth Kinetics of Parent and Green Fluorescent Protein-Producing

    Growth Kinetics of Parent and Green Fluorescent Protein-Producing

    The green fluorescent protein (GFP) is a small polypeptide (27 kDa) from the jellyfish Aequora victoria that has been cloned and expressed in both prokaryotic and eukaryotic cells. Colonies of bacterial cells expressing GFP can be easily detected and counted...
  • Global I - Lancaster High School

    Global I - Lancaster High School

    Global Studies I Middle Ages Feudalism Crusades Byzantine Empire Commerical Rev Plague Nation-States Spreading of the Plague Began in China and became an epidemic Trade spread the plague all around the known world Effects of the Plague: 35 million people...
  • D de la defensiv Vitamina D din nou

    D de la defensiv Vitamina D din nou

    Aici, vit D activează gena PLC-gamma1 care va produce fosfolipază C - gamma1. /72. ... TERAPIA TREBUIE SĂ UMPLE DEPOZITELE, DAR SĂ ASIGURE ȘI NECESARUL ZILNIC DE 2,4 mcg (2,6 mcg în sarcină și 2,8 mcg în timpul alăptării).
  • Introduction to Regional Geography - I

    Introduction to Regional Geography - I

    Do you want it to be the same? TODALSIGS TODALSIGS is an acronym to help us remember the parts to a good map: T - Title O - Orientation D - Date A - Author L - Legend S -...