Classifying Noun Countability Using Google Ngrams

Classifying Noun Countability Using Google Ngrams

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU) July 30th, 2009 Lexical Knowledge from Ngrams 1 Hammer : Fast and multifunctional n-gram search engine Search ngram: FAST ng r am s INPUT: token, POS, chunk, NE July 30th, 2009 OUTPUT: frequency to text

Lexical Knowledge from Ngrams 22 Characteristics Search up to 7 grams with wildcards Multi-level input Token, POS, chunk, NE, combinations NOT, OR for POS, chunk, NE Multi-level output Token, POS, chunk, NE document information Original sentences, KWIC, ngram Display Show the results in the order of frequency Running Environment Single CPU, PC-Linux, 400MB process, 500GB disk July 30th, 2009 Lexical Knowledge from Ngrams 33

Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 July 30th, 2009 Lexical Knowledge from Ngrams 4 Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 5 Implementation: Overview 3. Display

2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 6 Implementation: Overview

1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 7 From n-gram to Inverted Index Example: 3-grams Ngram ID

Position=1 Position=2 Position=3 1 A B C 2 A B B 3 B A C

Posting list A pos=1 1 A pos=2 3 B pos=1 3 B pos=2 1 B pos=3 2 C pos=3 1 July 30th, 2009 2

2 3 Lexical Knowledge from Ngrams 8 Posting list Wide variation of posting list size (in 7-gram: 1.27B) #EOS# (100,906,888), , (55,644,989), the (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) 1 0 0 :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) 0 1 1

0 1 0 0 0 0 1 0 0 1 List of ngramID C pos=3 1 3 Encoded into pointer (freq=1) C pos=3 5

July 30th, 2009 Lexical Knowledge from Ngrams 9 Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use look ahead to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) 4 33 34 55 76 80 89 92 99 4 SKIP 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98 July 30th, 2009 Lexical Knowledge from Ngrams

10 Implementation: Overview 2. Filtering 1 Search candidates . Search request N-gram data Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

11 Filtering Not all candidate ngramIDs match the request A Freq=123 NN VB Freq=5 Freq=10 B PERSON LOC We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB July 30th, 2009 Lexical Knowledge from Ngrams

12 Implementation: Overview 3. Display 2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

13 Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC July 30th, 2009 Lexical Knowledge from Ngrams 14 Size of data Text 1.7 G words 200M sentences 2.4M articles Total 530GB

108 GB 8 GB 260 GB Suffix array For text N-gram data 8 GB Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Inverted index for n-gram data 40 GB 100 GB POS, chunk, NE for N-gram data

Others July 30th, 2009 Lexical Knowledge from Ngrams Wikipedia text 6 GB Wikipedia POS, chunk, NE 15 Future Work Other information (ex: parse, coref, relation, genre, discourse) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens July 30th, 2009 Lexical Knowledge from

Ngrams 16 Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 17

Recently Viewed Presentations

  • Evil, Mad or Possessed?

    Evil, Mad or Possessed?

    Dr Helen Morrison. Forensic Psychiatry. How to Research Killers Who Hear Command Hallucinations - Method 2. Frederic Myers. Access the subliminal mind. The Case of Jason Dalton.
  • Technology based Quality Evaluation Instrument for Teaching and

    Technology based Quality Evaluation Instrument for Teaching and

    These qualifications cater to the diverse needs of students and society from skills based vocational training to advanced theory based research degrees ODL and enhanced learner services Postal services Search services Online services International guidelines for distance education libraries Guidelines...
  • SEVENTH GRADE PROGRAM Southampton Intermediate school 2019 -

    SEVENTH GRADE PROGRAM Southampton Intermediate school 2019 -

    Social Studies - Five days a week for and average of 200 min. using the resources from National Geographic. Science - Five days a week for an average of 200 min. using resources from Discovery Learning. Spanish/French - Five days...
  • TOK Oral Presentation - Hillsboro IBDP

    TOK Oral Presentation - Hillsboro IBDP

    Objectives of ToK Oral Presentation. Identify and explore a KnowledgeIssue (KI) raisedby a Real-life Situation (RLS) Show insightful thinking about KI, supporting ideas about knowledge claims, justifying thinking, making connections with ToK concepts
  • Trade-off Analysis - MIT

    Trade-off Analysis - MIT

    Invest in scrubbers for all old plants and build one new, high efficiency, clean power plant (no DSM) Trade-Off Analysis: Example Trade-Off Analysis: Example 2 Trade-Off Analysis: Example of Uncertainty Policy Analysis: "Truths" Question / challenge the assumptions THE forecast...
  • Poetry Question - Common Errors

    Poetry Question - Common Errors

    Example from Clown Punk : The poem consists of a single stanza of 24 lines. Every line is written in pentameters (they have ten syllables each) which could suggest that the speaker will never change his opinion, much like the...
  • The Logical Arguments for Constitutional Democracy

    The Logical Arguments for Constitutional Democracy

    The Logical Arguments for Constitutional Democracy Peter J. Boettke Econ 828/Fall 2005 17 October Main Methodological Points to Emphasize Methodological Individualism Catallactic Model of Politics Normative Implications for a society of free men Tension Between Buchanan and Tullock Buchanan More...
  • MinecraftEdu Program: Train Stations Setup Explore Your World

    MinecraftEdu Program: Train Stations Setup Explore Your World

    Minecraft is a popular world building video game that enables its user to build, collaborate and innovate using virtual building blocks. MinecraftEdu is a version of Minecraft that allows educators and library staff to easily set up local Minecraft games...