Classifying Noun Countability Using Google Ngrams

Classifying Noun Countability Using Google Ngrams

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU) July 30th, 2009 Lexical Knowledge from Ngrams 1 Hammer : Fast and multifunctional n-gram search engine Search ngram: FAST ng r am s INPUT: token, POS, chunk, NE July 30th, 2009 OUTPUT: frequency to text

Lexical Knowledge from Ngrams 22 Characteristics Search up to 7 grams with wildcards Multi-level input Token, POS, chunk, NE, combinations NOT, OR for POS, chunk, NE Multi-level output Token, POS, chunk, NE document information Original sentences, KWIC, ngram Display Show the results in the order of frequency Running Environment Single CPU, PC-Linux, 400MB process, 500GB disk July 30th, 2009 Lexical Knowledge from Ngrams 33

Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 July 30th, 2009 Lexical Knowledge from Ngrams 4 Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 5 Implementation: Overview 3. Display

2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 6 Implementation: Overview

1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 7 From n-gram to Inverted Index Example: 3-grams Ngram ID

Position=1 Position=2 Position=3 1 A B C 2 A B B 3 B A C

Posting list A pos=1 1 A pos=2 3 B pos=1 3 B pos=2 1 B pos=3 2 C pos=3 1 July 30th, 2009 2

2 3 Lexical Knowledge from Ngrams 8 Posting list Wide variation of posting list size (in 7-gram: 1.27B) #EOS# (100,906,888), , (55,644,989), the (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) 1 0 0 :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) 0 1 1

0 1 0 0 0 0 1 0 0 1 List of ngramID C pos=3 1 3 Encoded into pointer (freq=1) C pos=3 5

July 30th, 2009 Lexical Knowledge from Ngrams 9 Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use look ahead to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) 4 33 34 55 76 80 89 92 99 4 SKIP 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98 July 30th, 2009 Lexical Knowledge from Ngrams

10 Implementation: Overview 2. Filtering 1 Search candidates . Search request N-gram data Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

11 Filtering Not all candidate ngramIDs match the request A Freq=123 NN VB Freq=5 Freq=10 B PERSON LOC We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB July 30th, 2009 Lexical Knowledge from Ngrams

12 Implementation: Overview 3. Display 2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

13 Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC July 30th, 2009 Lexical Knowledge from Ngrams 14 Size of data Text 1.7 G words 200M sentences 2.4M articles Total 530GB

108 GB 8 GB 260 GB Suffix array For text N-gram data 8 GB Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Inverted index for n-gram data 40 GB 100 GB POS, chunk, NE for N-gram data

Others July 30th, 2009 Lexical Knowledge from Ngrams Wikipedia text 6 GB Wikipedia POS, chunk, NE 15 Future Work Other information (ex: parse, coref, relation, genre, discourse) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens July 30th, 2009 Lexical Knowledge from

Ngrams 16 Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 17

Recently Viewed Presentations

  • Europe and Modernity 1750-1850 - University of Warwick

    Europe and Modernity 1750-1850 - University of Warwick

    Mark Philp. The quarrel between the ancients and the moderns. ... Richard Price: A Discourse on the Love of our Country (1790) After sharing in the benefits of one Revolution, I have been spared to be a witness to two...
  • New Jersey Center for Teaching and Learning Progressive

    New Jersey Center for Teaching and Learning Progressive

    A sample is considered random (or unbiased) when every possible sample of the same size has an equal chance of being selected. If a sample is biased, then information obtained from it may not be reliable. Example: To find out...
  • Cross-Layer Adaptation for QualityAware and Energy-Efficient Next Generation

    Cross-Layer Adaptation for QualityAware and Energy-Efficient Next Generation

    Histogram for probability distribution Group profiled cycles Use profiling window of n jobs with cycles [Cmin, Cmax] Partition profiling window into r equal-sized groups (Cmin = b0 < b1 <…<br=Cmax) Let ni be number of cycle usage that falls into...
  • Folie 1 - PROJECT CONSULT

    Folie 1 - PROJECT CONSULT

    IT Trends 201x: Gartner. By 2012, 20 percent of businesses will own no IT assets. By 2012, India-centric IT services companies will represent 20 percent of the leading cloud aggregators in the market (through cloud service offerings).
  • Paragraph Development - Winthrop

    Paragraph Development - Winthrop

    The third feature is the Big Old Tree. This tree stands two hundred feet tall and is probably about six hundred years old. These three landmarks are awe-inspiring, thus making my hometown the famous place that it is. *Now contains...
  • Government Blockchain Association (GBA) Healthcare Specialist ...

    Government Blockchain Association (GBA) Healthcare Specialist ...

    Blockchain Emerging as a Leader. PwC Global 2018 survey of . 600 executives from 15 territories. 84% say their organizations are involved with blockchain
  • Presented at EDAMBA summer school, Sorze (France) 23

    Presented at EDAMBA summer school, Sorze (France) 23

    Data Sourcing, Statistical Processing and Time Series Analysis. Presented at EDAMBA summer school, Soréze (France) 23 July - 27 July 2009. An Example from Research into Hedge Fund Investments
  • Présentation PowerPoint - doc-developpement-durable.org

    Présentation PowerPoint - doc-developpement-durable.org

    Projet d'écovillage pilote 2. Culture et agriculture biologiques. Les bases de l'agriculture biologique sont : l' amélioration de la fertilité du sol . par des moyens naturels (le semi direct, le compostage, l'agroforesterie, os, sang, cendre, mycorhizes …).