Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill Part of Speech Tagging Train a model on a set of hand-tagged sentences Find best sequence of POS tags for new sentence Generative Models Hidden Markov Model HMM Discriminative Models Maximum Entropy Markov Model (MEMM) Brown Corpus ~57,000 tagged sentences 87 tags (reduced to 45 for Penn TreeBank tagging) ~300 tags including compound tags that_DT fire's_NN+BEZ too_QL big_JJ ._. fires = fire_NN is_BEZ Hidden Markov Models DT (singular determiner) NN+BEZ (common noun + is)

That Fires QL (qualifier) too JJ (adjective) big Set of hidden states (POS tags) Set of observations (word tokens) Dependents ONLY on current tag HMM parameters Transition probabilities : P(ti|t0ti) = P(ti|ti-1) Observation probabilities: P(wi|t0tn,w0wn) = P(wi|ti) Initial tag distribution: P(t0) HMM Best Tag Sequence For HMM, the Viterbi algorithm finds the most probable tagging for a new sentence For re-ranking later, we want not the best tagging but the k best tagging for each sentence

HMM Beam Search Word 0 Word 0 Step1 Enumerate all possible tags for the first word Start ... Step 2 Word 0 Evaluate each tagging using trained HMM keep only the best k (first word sentence taggings) Word 1 Step 3 For each of the k taggings of the previous step, enumerate all possible tags for the second word

Step 4 Evaluate each two-word sentence tagging and discard all the k best. Repeat for all words in the sentence Word 2 ... Word 2 Word 2 Start ... ... Word 2 Word 1 Word 2 ... Word 2 MaxEnt Re-ranking

After beam search, we have the k best taggings for our sentence Use trained MaxEnt model to select most probable sequence of tags Word 1 ... ... ... ... ... ... ... Word t Start Word 1

Word t Word t Results Feature Current word Previous tag

Word contains a numeral -ing -ness -ity -ed -able -s -ion -al -ive -ly Word is capitalized Word is hyphenated Word is all uppercase Word is all uppercase with a numeral Word is capitalized and a word ending in Co. or Inc. is found within 3 words ahead Results Accuracy Baseline Most frequent class tagger: 73.41% (24%) HMM Viterbi tagger: 92.96% (32.76% on ) 92 60 91.5 50 91

40 90.5 30 90 20 89.5 10 Known Word Accuracy 89 1 2 3 4 5 Beam Search Width (K) 0 10 20

Recently Viewed Presentations

  • Love, Courtship and Marriage Produced by Simon Siew

    Love, Courtship and Marriage Produced by Simon Siew

    A man seeks to marry a woman and NOT a man and vice versa. The draw of a woman more often lies in her soft, gentle and kind nature and demeanor than for her looks (even those that is important...
  • Working With Words - Jefferson County Public Schools

    Working With Words - Jefferson County Public Schools

    Word Wall practice sheets can be found at this site. A typical week in the WWW block may look like… Monday Word Wall Making Words Tuesday Word Wall Making Words Wednesday Word Wall Rounding up the Rhymes Thursday Word Wall...
  • The PDCA Cycle - Purdue University

    The PDCA Cycle - Purdue University

    The PDCA Cycle. The most basic Quality Improvement Cycle. PDCA Cycle. Plan. Do. Check. Act. Plan. Define Customer requirements for product or service. 1. Marketing Research for new product or service. ... Is there an Action Plan for Implementation? 3....
  • Disadvantaged by where you live?: - University of York

    Disadvantaged by where you live?: - University of York

    Do people from neighbourhoods with poor reputations face 'postcode discrimination' when looking for work? Paper presented to the 2012 Social Policy Association conference, Social Policy in an unequal world,
  • Statistics 1: Elementary Statistics

    Statistics 1: Elementary Statistics

    Statistics 1: Elementary Statistics Section 4-6 Probability Chapter 3 Section 2: Fundamentals Section 3: Addition Rule Section 4: Multiplication Rule #1 Section 5: Multiplication Rule #2 Section 6: Simulating Probabilities Section 7: Counting Simulating Probabilities Simulation is a powerful and...
  • PowerPoint bemutató - Notaries of Europe

    PowerPoint bemutató - Notaries of Europe

    The brussels-iiAregulation among the community instruments . in the field of judicial cooperation in civil matters ) jurisdiction. recognition and enforcement. of judgments. conflict of laws. civil and commercial matters:
  • Welcome to Bonding - Arbuiso.com

    Welcome to Bonding - Arbuiso.com

    a double polar covalent bond and . also. forms what's called a. COORDINATE COVALENT BOND. The oxygen electrons coordinate this situation so that carbon "gets" an octet in a sort-of cheating way. Weird, but it happens! C=O ~ 81. My...
  • Science OF Sciences - Bibliotheca Alexandrina

    Science OF Sciences - Bibliotheca Alexandrina

    Science is a body of knowledge about the Universe. Mathematics is a language that can describe relationships and change in relationships in a rational way. Science generally uses mathematics as a tool to describe science. A few scientists including myself...