Scaleable Entity Resolution

Scaleable Entity Resolution

Scaleable Entity Resolution Kunho Kim Why We Need Entity Resolution? Why We Need Entity Resolution? Why We Need Entity Resolution? Why We Need Entity Resolution? Why We Need Entity Resolution? Entity Resolution (ER) Problem of identifying, matching, and grouping same name entities from a single collection or multiple ones in data Why is it important? Real world databases are often made up of data from multiple sources Unique identifier does not always exist Example

Finding a persons medical records from multiple hospital records Matching same products for a comparative online shopping service Entity Resolution (ER) Disambiguation Let Di={pi1, pi2, , pin} be a database Di, where each record pim has set of values of the attribute set Ti = {ti1, ti2, , tik}. Find an assignment function :DDi Ei where Ei is the set of real name entity, and (ppix) = (ppiy) if and only if pix and piy refer to the same entity Record linkage Let Dj={pj1, pj2, , pjm} is another database with the attribute set Tj = {tj1, tj2, , tjl} Find a matching function :DDiDj {0,1} where (ppix, pjy) = 1 if matches and 0 if not matches, and the result set R = {(ppix, pjy) | (ppix, pjy) = 1 ER in Scholarly Databases Several problems to solve Author name disambiguation Publication, author profile record linkage Serves as an important preprocessing step

Process author related queries Better biblometric analysis Scalability is the key issue Challenges Scalability is the key issue time complexity is O(n2) Database # of papers # of author mentions # of author mentions / papers Process time CiteSeerX 10,130,097

31,996,749 3.159 1 week Pubmed 24,358,073 87,976,808 3.612 4 weeks Web of Science 45,261,744 162,569,706

3.592 16 weeks (estimate) 25,000,000 20,000,000 15,000,000 10,000,000 Total # of papers in Pubmed (cumulative) 5,000,000 - 19501955196019651970197519801985199019952000200520102015 Entity Resolution Pipeline Database 1 Database

Preprocessing Preprocessing Blocking (Indexing) Blocks Pairwise Classification Blocking (Indexing) Blocks Preprocessing Clustering Database 2

Disambiguation Record Linkage Pairwise Classification Step 1: Preprocessing Normalize the representation of each attribute Example:D & and, Dr Doctor Parse some attributes if necessary Example:D Full name First + middle + last name Remove punctiations, diacritics Domain specific, differs from database to database Typically done with lookup tables and regular expressions Step 2: Blocking Essential step to make the algorithm scale Separates all data into small blocks with blocking key(ps)

Robert B Schwartz Simon D Schwartz Sherri M Schwartz Seth Schwartz R + Schwartz S + Schwartz Simon F Schwartz Tony Schwartz T + Schwartz

Step 3: Pairwise Classification Classifies whether each pair of records is from the same entity or not Classification features Set of string distances for record attributes (pexact, Jaccard, edit, Jaro-Winkler, Soundex etc.) Classification methods Rule-based heuristics Using machine learning classifiers Nave Bayes and support vector machine(pSVM) [Han et al. 2004] Online active SVM [Huang et al. 06] pLSA and LDA [Song et al. 07] Random Forest [Treeratpituk and Giles 09] [Khabsa et al. 2015] [Kim et al. 2016] Graphical Approaches [Bhattacharya and Getoor 07][Fan et al. 11][Hermansson et al. 13] Limitations on scalability

Improvement with user feedback [Godoi et al. 13] String Distance Metrics Jaccard :D Edit :D minimum number of operations required to transform S1 to S2 Jaro-Winkler Jaro Distance nm: number of matches (chars no farther than half length of longer string -1) nt : number of transpositions Jaro-Winkler Distance lprefix : length of common prefix (up to 4 chars) p : scale factor(0.1) String Distance Metrics Soundex :D give credit for phonetically similar strings Retain first letter Remove a, e, i, o, u, y, h, w Replace similar consonants to a same number (puse at most 3)

b, f, p, v 1 c, g, j, k, q, s, x, z 2 d, t 3 l4 m, n 5 r6 String Distance Metrics Step 4: Clustering Form clustered entities based on the pairwise classification results Clustering Methods

Agglomerative (pHierarchical) clustering [Mann and Yarowsky 03] K-spectral clustering [Han et al. 05] Density based clustering (pDBSCAN) [Huang et al. 06] [Khabsa et al. 15] Graphical approach using Markov Chain Monte Carlo(pMCMC) [Wick et al. 12] Applications Inventor Name Disambiguation [JCDL 16] [IJCAI-SBD 16] Financial Entity Record Linkage [SIGMOD-DSMM 16] Inventor Name Disambiguation Name Charles P. Spaulding Charles Spaulding Charles D. Spaulding Carl P. Spaulding Charles A. Spaulding Charles Anthony Spaulding Carl P. Spaulding Charles Spaulding

Name Title Title High temperature end fitting Tub Call center monitoring system Absolute encoder using multiphase analog signals Shower surround Tub/shower surround Charles A. Spaulding Shower surround Charles Anthony Spaulding Tub/shower surround

Charles Spaulding Tub Name Charles P. Spaulding Charles Spaulding Title High temperature end fitting End fitting for hoses Incremental encoder End fitting for hoses Name Carl P. Spaulding Title Absolute encoder using multiphase analog signals

Incremental encoder Carl P. Spaulding USPTO PatentsView Patent search tool serviced by United States Patent and Trademark Office(pUSPTO) Inventor name disambiguation challenge in 2015 to disambiguate all inventor records Raw data is publicly available via the competitions web page http:D//www.dev.patentsview.org/workshop Raw data contains all published US patent grants from 1976 to 2014 Total 5.4M patents, 12.3M inventor mentions Overview of the Process Preprocessing :D remove punctuation, diacritics Blocking :D First name initial + Last Name Pairwise classification :D Random Forest Clustering :D DBSCAN Parallelization with GNU Parallel Feature set Category

Inventor Affiliation Co-author Assignee Group Title Subcategory First name Middle name Last name Suffix Order City State Country Last name Last name Group Subgroup

Title Features Exact, Jaro-Winkler, Soundex Exact, Jaro-Winkler, Soundex Exact, Jaro-Winkler, Soundex Exact Order comparision Exact, Jaro-Winkler, Soundex Exact Exact # of name shared, IDF, Jaccard Exact, Jaro-Winkler, Soundex Exact Exact # of term shared Pairwise Classifier Selection Pairwise classifier is trained to distinguish whether each pair of inventor records is the same person or not Tested supervised classifiers with proposed feature set Mixture of two training datasets

4-fold cross validation Method Precision Recall F1 Nave Bayes 0.9246 0.9527 0.9384 Logistic Regression 0.9481 0.9877

0.9470 SVM 0.9613 0.9958 0.9782 Decision Tree 0.9781 0.9798 0.9789 Conditional Inference Tree 0.9821

0.9879 0.9850 Random Forest 0.9839 0.9946 0.9892 Clustering: DBSCAN Randomly select a point, expand based on the density Clustering: DBSCAN Randomly select a point, expand based on the density Evaluation USPTO PatentView Competition 2 Training sets

Mixture :D random mixture of IS and E&S dataset Common Characteristics :D subsampled E&S according to match characteristics of the USPTO database 5 Test Sets ALS, ALS Common :D inventors from Association of Medical Colleges(pAAMC) faculty IS :D Israeli inventors E&S :D patent records of engineers and scientists Phase2 :D random mixtures of above Calculate pairwise precision/recall/F1 Intel Xeon [email protected], 12 cores, 40 GB memory Total process time :D 6.5 hours Pairwise Precision / Recall / F1 Results

Test Set ALS ALS Common IS E&S Phase2 Training Set Precisio n Recall F1 Mixture 0.9963

0.9790 0.9786 Common 0.9960 0.9848 0.9904 Mixture 0.9841 0.9796 0.9818 Common

0.9820 0.9916 Mixture 0.9989 Common Test Set F1(Ours) F1(Winner) ALS 0.9904 0.9879

0.9868 ALS Common 0.9868 0.9815 0.9813 0.9900 IS 0.9900 0.9783 0.9989 0.9813

0.9900 E&S 0.9902 0.9835 Mixture 0.9992 0.9805 0.9898 Phase2 0.9837 0.9826

Common 0.9995 0.9810 0.9902 Average(pstddev.) 0.98820.0029 0.98270.0035 Mixture 0.9912 0.9760 0.9836

Common 0.9916 0.9759 0.9837 Detailed results on 5 test sets Comparison with the winner P Value: 0.03125 Financial Entity Record Linkage Financial Entity Identification and Information Integration (pFEIII) Challenge Identify matching entities across two of the databases FFIEC :D Federal Financial Institution Examination Council LEI :D Legal Entity Identifiers SEC :D Securities and Exchange Commission Limited set of shared attributes Name, address, city, state, zip

Process Overview Database 1 Preprocessing Blocking (Indexing) Preprocessing Blocks Exact Match Yes Match Database 2 No Pairwise

Classification Yes Match No Not Match Preprocessing Different name and address formats among the data sources Apply rules below to clean names and addresses Each rule is applied using a regular expression Rule Remove dots Remove article Abbreviation to full form & and Remove postfix company Remove postfix /

Example U.S. Bank US Bank The First First Corp. Corporation B&W B and W Trust Company Trust Bank /TA Bank Rules to clean the entity name Rule Remove dots Unify direction term & and Abbreviation to full form Example P.O. Box PO Box N North

M&T M and T Rd Road Rules to clean the entity address Blocking Compare only record pairs that can potentially match Prefix articles, capitalized or not (pe.g. The, A, An, a) are ignored Heuristic:D First word of the entity name + state Name State Blocking Key Value The First Bank PA First_PA First State Bank IL

First_IL First Bank First_PA PA Features Use common attributes among financial databases to generate features Entity name, street address, city, state, zip code Category Features Name Jaro-Winkler, Jaccard Address

Jaro-Winkler, Jaccard City Jaro-Winkler, Exact State Exact Zip Exact Results Task FFIEC LEI FFIEC SEC

Training Set Precision Recall F1 LEI 99.16% 95.77% 97.44% LEI + SEC 97.71% 94.56%

96.11% SEC 87.84% 84.78% 86.28% LEI + SEC 86.78% 85.65% 86.21% Best:D 99.24% / 96.37% / 97.44% Best:D 92.82% / 85.65% / 88.38% Recent / On-going Researches

Improve blocking for better scalability [Kim et al. 17] Typical blocking on author name disambiguation uses simple heuristic (pFirst name initial + Last name) Can we train the blocking function given the labeled data? Approach Use sequential covering algorithm to train blocking function Can learn disjunctive normal form (pDNF) and conjunctive normal form (pCNF) with same algorithm, thanks to De Morgans Laws Showed CNF is better for blocking on AND problems empty values on many attributes Ensuring disjointness of each blocks with proposed disjoint CNF blocking Improve pairwise classification with deep neural networks Can we automatically learn feature representation instead of manual feature engineering? Improvement of the classification result itself PubmedSeer: Disambiguated Author Search on Pubmed A specialty search engine for disambiguated pubmed authors Author name disambiguation serves as an important pre-processing step for variety of problems Processing author-related query Calculating author-related statistics

Studying relationship between authors Researchers use Authority explorer (phttp:D//abel.lis.illinois.edu/cgi-bin/ exporter/search.pl) , which was built around 2010 Built with Elasticsearch + django PubmedSeer API : Author Name Disambiguation API for Pubmed Author name disambiguation (pAND) is studied for a long time (pTorvik and Smalheiser 2009, Levin et al. 2012, Liu et al. 2014, Khabsa et al. 2014, Kim et al. 2016) , but limited access of the disambiguated data is available for users Few publicly available codes :D Early version of CiteSeerX, AMiner Some scholarly search engines provide author search module CiteSeerX, Google Scholar, DBLP, Semantic Scholar, Limitation :D hard to extract data Some organizations are registering researchers and give unique ID ORCID :D 4.3M IDs, 1.7M with records High quality but not complete SCOPUS, ResearcherID PubmedSeer API : Author Name

Disambiguation API for Pubmed Goal :D Provide appropriate web service for users to use the disambiguated author data easily

Recently Viewed Presentations

  • HF for the Next Millennium - .\|/. H a y s e e d N e t w o r ...

    HF for the Next Millennium - .\|/. H a y s e e d N e t w o r ...

    HF for the Next Millennium ... you can see what changes are made with the Digital Twin PBT. You can also see what filter you have selected, and the mode you are using. ... Floating Point DSP Digital IF Filters...
  • Values, goals and well-being in emerging adulthood

    Values, goals and well-being in emerging adulthood

    Kathryn Williams and Joseph Ciarrochi. Some questions about values. Are some values 'healthier' than others? Idiosyncratic theories. Universal theories. Humanistic theories. What if some values aresubject to social pressure?
  • LSLAP Training:

    LSLAP Training:

    LSLAP Manual Chapter. from the LSLAP Website. The Interview:During the Interview. Take charge of the interview - begin by explaining how our program works and our limitations and requirements. Complete the forms - waiver/file review, statistics, copy ID, give client...
  • Writing Papers Lirong Xia Graduate Skills Seminar, 2019

    Writing Papers Lirong Xia Graduate Skills Seminar, 2019

    Writing is a skill you keep getting better at. The first draft of anything is sh*t---Ernest Hemingway. If you don't get accepted … DON'T GIVE UP. my first independent work was rejected for 7 times
  • Operational Risk Management

    Operational Risk Management

    Operational Risk Management for Airport Emergency Planning Martinez Jacobs Airports Fire Chief Risk Management Vision Create Airport Operational Environment in which every employee is trained and motivated to personally manage risk in their daily duties to ensure operational readiness Accept...
  • Building Buddies with a Bee Bop - flmusiced.org

    Building Buddies with a Bee Bop - flmusiced.org

    Building Buddies with a Bee Bop. Meeting the Needs of the Young Child With Autism Through Music. What is Autism Spectrum Disorder? As defined by Autism Speaks, Autism spectrum disorder (ASD) and autism are both general terms for a group...
  • Daoyin, Qigong and Healing - Chris Low

    Daoyin, Qigong and Healing - Chris Low

    Mythos & Logos. Before looking at these, I would ask you to compare information derived from the ECG traces of 2 study participants: Taken over the same time period, the average heartbeat interval [LOGOS] can be precisely determined for both....
  • Fully Alive! - Weebly

    Fully Alive! - Weebly

    Fully Alive - Grade 8. Theme One - Created and Loved by God. A Psalm is a sacred song or poem that praises God. Psalms are found in the . Old Testament