Extração de Informação e Processamento de Linguagem Natural ...

Extração de Informação e Processamento de Linguagem Natural ...

Keynote address: Stefan Schulz Medical University of Graz (Austria) purl.org/steschu Annotating clinical narratives with SNOMED CT: The thorny way towards interoperability of clinical routine data "Classical" AI workflow Data Acquisition D Representation

Reasoning Output "Classical" AI workflow Data Acquisition D Reasoning A Output A Reasoning B Output B Representation "Classical" AI workflow

Data Acquisition Representation A Reasoning Output A Representation B Reasoning Output B D "Classical" AI workflow Data Acquisition A

DA Representation Reasoning Output A Data Acquisition B DB Representation Reasoning Output B Data reliability Data interoperability high Data

Acquisition A DA DA=DB DA Data Acquisition B DB DA DB DB low Data reliability Data interoperability unstructured

representation structured representation high Interpretation A DA Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more DA=DB DA Interpretation B DB

DA DB DB low Focus of the talk Structured extracts from unstructured clinical data: reliability and interoperability Empirical study on inter-annotator agreement Analysis of examples for inter-annotator disagreement Mechanisms to improve agreement better data reliability better interoperability better training data better gold standards

Annotating clinical narratives with SNOMED CT Annotating clinical narratives with SNOMED CT Coding observation map metadata phenomena configurations observed Vocabulary Annotation Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more symbolic representation

map symbols metadata (configurations) configurations Annotating clinical narratives with SNOMED CT Huge clinical reference terminology representable as OWL EL (quasi-) ontological definitional and qualifying axioms eHealth standard, maintained by transnational SDO SNOMED CT

multiple hierarchies ~300,000 "concepts" preferred terms and synonyms in several languages covers disorders, procedures, body parts, substances, devices, organisms, qualities Annotation: Sources of complexity Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more

Clinical narrative - sequence of Tokens - syntactic structures - relations at various levels Compactness Agrammaticality Short forms Implicit contexts best text span to annotate? Nave or analytic annotation? Map SNOMED CT Ontology - entities, codes

- relations - logical constructors - axioms Terminology - preferred terms - synonyms - definitions Ill-defined concepts Similar concepts Pre-coordination vs. postcoordination Complex annotations (> 1 concept) Degree of formality? Examples Clinical text SNOMED CT concepts (FSNs) 'Duodenal structure (body structure)' " the duodenum . The mucosa is"

"Hemorrhagic shock ? ? ? suspected dengue" 'Duodenal mucous membrane structure (body structure)' 'Traffic accident on public road (event)' after RTA " "travel history of 'Mucous membrane structure (body structure)' 'Traffic accident on public road (event)', 'Renal tubular acidosis (disorder)' 'Traffic accident on public road (event)' or 'Renal tubular acidosis (disorder)' 'Suspected dengue (situation)'

'Suspected (qualifier value)' 'Dengue (disorder)' Coding / Annotation guidelines Examples: 1. German coding guidelines for ICD and OPS, 171 pages 2. Using SNOMED CT in CDA models: 147 pages 3. CHEMDNER-patents: annotation of chemical entities in patent corpus: annotation manual 30 pages 4. CRAFT Concept Annotation guidelines: 47 pages 5. Gene Ontology Annotation conventions: 7 pages Complex rule sets, requiring intensive training 1. 2. 3. 4. 5. http://www.dkgev.de/media/file/21502.Deutsche_Kodierrichtlinien_Version_2016.pdf http://www.snomed.org/resource/resource/249 http://www.biocreative.org/media/store/files/2015/cemp_patent_guidelines_v1.pdf http://bionlp-corpora.sourceforge.net/CRAFT/guidelines/CRAFT_concept_annotation_guidelines.pdf http://geneontology.org/page/go-annotation-conventions

Annotation experiments in ASSESS-CT Annotation experiments in ASSESS-CT EU project on the fitness of purpose of SNOMED CT as a core reference terminology for the EU: www.assess-ct.eu Feb 2015 Jul 2016 Scrutinising clinical, technical, financial, and organisational aspects of reference terminology introduction Summary of results: brochure published, scientific papers to appear http://assess-ct.eu/fileadmin/assess_ct/final_brochure/assessct_final_brochure.pdf Annotation of clinical narratives Comparing

SNOMED CT vs. UMLS derived terminology Resources Parallel corpus: 60 clinical text snippets from 6 languages, high diversity For each language: 2 annotators * 40 samples 20 snippets annotated twice Annotators trained by webinars

follow annotation guideline (10 pages) Nitroglycerin pump spray as required Amantadine bds Allopurinol 300 tablet every other day (last dose on 20091130) Mefenamic acid 500 mg up to 3x daily for pain in conjunction with simultaneous administration of a drug to protect the stomach e. g. Pantoprazole 40mg. Torasemide bds Melperone 50 mg p. m. 387404004;385074009;225 761000 372763006;229799001 387135004;385055001;225 760004

medians of Mandible and Maxilla the fragments are dislocated. Normal mucous membranes in mouth pharynx and on the larynx. Hyoid and thyroid cartilage are intact. Fragmental fractures of the two upper vertebrae of the cervical spine. Otherwise the cervical spine is intact. Oesophagus as well as trachea are torn at the lower end of the neck. 260528009 387185008;258684004; e.g. 229798009;22253000 chunking into noun phrases

79970003;416118004; annotation of chunks by sets of 373517009;69695003 codes 395821003;258684004 give preference 318034005;229799001 to maximally 442519006;258684004; pre-coordinated422133006 codes Intact teeth are in the 11163003;245543004; 7understanding text and assign mouth. 123851003 maximally codes Fractures are visible onspecific the 263172003;263156006;

123735002 17621005;33044003; 71248005 21387005;52940008; 11163003 13321001;207984009; 207983003 122494005;11163003 262793000;282459005; 261122009;123958008 Principal quantitative results (English) Concept coverage [95% CI] SNOMED CT Alternative Text annotations English .86 [.82-.88] .88 [.86-.91]

Term coverage [95% CI] SNOMED CT .68 [.64; .70] Alternative .73 [.69; .76] Text annotations English Inter annotator agreement Krippendorff's Alpha [95% CI] SNOMED CT Alternative Text annotations .37 [.33-.41] .36 [.32-.40]

Krippendorff, Klaus (2013). Content analysis: An introduction to its methodology, 3rd edition. Thousand Oaks, CA: Sage. Agreement map: text annotations (English) SNOMED CT UMLS SUBSET green: agreement yellow: only annotated by one coder red: disagreement Systematic error analysis Creation of gold standard for SNOMED CT 20 English text samples annotated twice 208 NPs Analysis of English SNOMED CT annotations by two additional terminology experts Consensus finding, according to pre-established annotation guidelines Inspection, analysis and classification of text annotation disagreements Presentation of some disagreement cases for SNOMED CT Reasons for disagreement

Human issues Lack of domain knowledge / carelessness Tokens Annotator #1 Annotator #2 "IV" 'Structure of abductor 'Abducens hallucis muscle (body nerve structure structure)' (body structure) ' Gold standard 'Abducens nerve structure (body structure)' Retrieval error (synonym not recognised) Tokens

Annotator #1 "Glibenclamide" 'Glyburide (substance)' Annotator #2 Gold standard 'Glyburide (substance)' Non-compliance with annotation rules Ontology issues (I) Polysemy ("dot categories")* Tokens Annotator #1 Annotator #2 Gold standard 'Lymphoma"

'Malignant lymphoma (disorder)' 'Malignant lymphoma category (morphologic abnormality)' 'Malignant lymphoma (disorder)' *Alexandra Arapinis, Laure Vieu: A plea for complex categories in ontologies. Applied Ontology 10(3-4): 285-296 (2015) Ontology issues (I) Polysemy ("dot categories")* Tokens Annotator #1 Annotator #2 Gold standard 'Lymphoma"

'Malignant lymphoma (disorder)' 'Malignant lymphoma category (morphologic abnormality)' 'Malignant lymphoma (disorder)' "Pseudo-polysemy" Incomplete definitions Tokens "Former Smoker" Annotator #1 Annotator #2 Gold standard 'In the past

(qualifier value)' 'Smoker (finding)' 'History of (contextual qualifier) (qualifier value)' 'Ex-smoker (finding)' 'Smoker (finding)' *Alexandra Arapinis, Laure Vieu: A plea for complex categories in ontologies. Applied Ontology 10(3-4): 285-296 (2015) Ontological issues (II) Incomplete definitions Tokens Annotator #1 Annotator #2 "Motor: 'Skeletal muscle structure (body structure)' 'Muscle finding

(finding)' 'Normal (qualifier value)' 'Normal (qualifier value)' normal bulk and tone" Gold standard 'Skeletal muscle normal (finding)' Ontological issues (II) Normal findings, incomplete definitions Tokens Annotator #1 Annotator #2

"Motor: 'Skeletal muscle structure (body structure)' 'Muscle finding (finding)' 'Normal (qualifier value)' 'Normal (qualifier value)' normal bulk and tone" Gold standard 'Skeletal muscle normal (finding)' Fuzziness of qualifiers

Tokens Annotator #1 "Significant 'Significant (qualifier value)' bleeding" 'Bleeding (finding)' Annotator #2 'Severe (severity modifier) (qualifier value)' 'Bleeding (finding)' Gold standard 'Moderate (severity modifier) (qualifier value)' 'Bleeding (finding)' Interface term (synonym) issues Tokens "Blood

Annotator #1 'Blood (substance)' extravasati 'Extravasation (morphologic on" abnormality)' Annotator #2 Gold standard 'Hemorrhage (morphologic abnormality)' 'Hemorrhage (morphologic abnormality)' "extravasation of blood" Interface term (synonym) issues

Tokens Annotator #1 "Blood 'Blood (substance)' extravasati 'Extravasation (morphologic on" abnormality)' Annotator #2 Gold standard 'Hemorrhage (morphologic abnormality)' 'Hemorrhage (morphologic abnormality)'

"extravasation of blood" Tokens Annotator #1 "anxious" 'Anxiety (finding)' Annotator #2 Gold standard 'Worried (finding)' 'Anxiety (finding)' "anxious cognitions" Language issues Ellipsis / anaphora "Cold and wind are provoking factors." (provoking factors for angina) "These ailments have substantially increased since October 2013" (weakness)

"No surface irregularities" (breast) "Significant bleeding" (intestinal bleeding) Ambiguity of short forms "IV" (intravenous? Fourth intracranial nerve?) Co-ordination: "normal factors 5, 9, 10, and 11" Scope of negation "no tremor, rigidity or bradykinesia" Addressed by annotation guideline Manageable by human annotators Known challenges for NLP systems Prevention and remediation of annotation disagreements Prevention: annotation processes Training with continuous feedback

Early detection of inter annotator disagreement triggers guideline enforcement / guideline revision Tooling Optimised concept retrieval (fuzzy, substring, synonyms) Guideline enforcement by appropriate tools Postcoordination support (complex syntactic expessions instead of grouping of concepts Anti-patterns, e.g. avoid unrelated primitive concepts (?) Prevention: improve terminology structure Fill gaps equivalence axioms (reasoning) Self-explaining labels (FSNs), especially for qualifiers Scope notes / text definitions where necessary Manage polysemy Flag navigational and modifier concepts Strengthen ontological foundations Upper-level ontology alignment Clear division between domain entities and information entities

Overhaul problematic subhierarchies, especially qualifiers Prevention: improve content maintenance Analysis of real data to support terminology maintenance process Harvest notorious disagreements between text passages and annotations from clinical datasets Compare concept frequency and concept co-occurrence between comparable institutions and users to detect imbalances Stimulate community processes for ontology-guided content evolution: Crowdsourcing of interface terms by languages, dialects specialties, user groups (separation of interface terminologies from reference terminologies is one of the ASSESS-CT recommendations) Remediation of annotation disagreements Remediation of annotation disagreements Exploit ontological dependencies / implications Concept A

'Mast cell neoplasm (disorder)' Concept B 'Mast cell neoplasm (morphologic abnormality)' 'Isosorbide dinitrate' 'Isosorbide dinitrate (product)' (substance)' 'Palpation (procedure)' 'Palpation - action (qualifier value)' 'Blood pressure taking 'Blood pressure (procedure)' (observable entity)' 'Increased size 'Increased (qualifier (finding)' value)' 'Finding of heart rate 'Heart rate (finding)' (observable entity)'

Dependency A subclassOf AssociatedMorphology some B A subclassOf HasActiveIngredient some B A subclassOf Method some B A subclassOf hasOutcome some B A subclassOf isBearerOf some B A subclassOf Interprets some B Experiment Gold standard expansion: Step 1: include concepts linked by attributive relations: A subclassOf Rel some B Step 2: include additional first-level taxonomic relations: A subclassOf B Language of text sample Gold standard expansion no expansion

English expansion step 1 expansion step 2 F measure 0.28 0.28 0.29 only insignificant improvement possibly due to missing relations in SNOMED CT, e.g. haemorrhage - blood Conclusion (I) Low inter-annotator agreement limits successful use of clinical terminologies / ontologies for manual annotation scenarios for benchmarking of NLP-based annotations for optimised training data for ML Structured data essential for many intelligent systems, but unreliable information extracted

from clinical narratives raises patient safety issues when used for decision support Conclusion (II) Prevention of disagreements Education, tooling, guideline support Terminology content improvement: labelling, scope notes, ontological clarity, full definitions, community processes High coverage interface terminologies Remediation of disagreements So far no clear evidence of ontology-based resolution of agreement issues Big data approaches ? Conclusion (III) R & D required: "Learning systems" for improvement terminology content / structure / tooling. Clinical "big data" underused resource Harmonization of annotation guideline creation and validation efforts Formulate and enforce good quality criteria for clinical terminologies used as annotation vocabularies

Better ontological underpinning of clinical terminologies Ontologically founded patterns for recurring clinical documentation tasks: Information extraction rather than concept mapping* *Martnez-Costa C et al. Semantic enrichment of clinical models towards semantic interoperability. JAMIA 2015 May;22(3):565-76 Thanks for your attention Slides will be accessible via at purl.org/steschu Acknowledgements: ASSESS CT team: Jose Antonio Miarro-Gimnez, Catalina MartnezCosta, Daniel Karlsson, Kirstine Rosenbeck Geg, Kornl Mark, Benny Van Bruwaene, Ronald Cornet, Marie-Christine Jaulent, Pivi Hmlinen, Heike Dewenter, Reza Fathollah Nejad, Sylvia Thun, Veli Stroetmann, Dipak Kalra Contact: [email protected] Vibhu Agarwal, Tanya Podchiyska, Juan M. Banda, Veena Goel, Tiffany I. Leung, Evan P. Minty, Timothy E. Sweeney, Elsie Gyang, Nigam H. Shah: Learning statistical models of phenotypes using noisy labeled training data. JAMIA 23(6): 11661173 (2016)

Recently Viewed Presentations

  • Title

    Title

    Binding technical standards: These standards constitute an effective instrument to strengthen Level 3 of the Lamfalussy structure, which currently is limited to the adoption of non-binding guidelines. ... Outline Solvency II Timeline "Lamfalussy" process of decision making EU Industry view...
  • Microsoft Azure CASE STUDY Azure-Powered Software Solution is

    Microsoft Azure CASE STUDY Azure-Powered Software Solution is

    Repree streamlines business for the real estate professional. Electronic signatures, client management and auditing enable the user to focus on the business of real estate rather than dealing with the mundane. Repree also provides tools for the brokerage that enable...
  • Cover Page - Simon Fraser University

    Cover Page - Simon Fraser University

    Alnair Innovations is a joint effort of five talented and diligent engineering science students at Simon Fraser University. All members are fourth year undergraduate students with Arash, Claret, and William specializing in Electronics Engineering option while HinHeng and Jun specializing...
  • Apresentação do PowerPoint

    Apresentação do PowerPoint

    Communication and Management - a crucial relationship Communication touches everything that takes place in an organization and is so intermingled with all other functions and processes that separating it from management is difficult.
  • The Elements: The d-Block - chemistryworkshopjr

    The Elements: The d-Block - chemistryworkshopjr

    The oxidation number of chromium in chromate (CrO 4 ²ˉ) and dichromate (Cr 2 O 7 ²ˉ) in the. ... In neutral or faintly alkaline solutions:(a) A notable reaction is the oxidation of iodide to iodate:2MnO 4 - + H...
  • Sed adipiscing velit id augue

    Sed adipiscing velit id augue

    Hundreds of compounds and molecules have been extensively evaluated as adjuvants but only few currently uses in vaccine preparations. Vaccine adjuvant systems: enhancing the efficacy of subunit protein antigens. P. Mohammed et al. Int. J. Pharm. 2008, 364(2), 272-280 Vaccine...
  • Exercise 2.1 Posterior inference: Suppose you have a

    Exercise 2.1 Posterior inference: Suppose you have a

    Show algebraically that your posterior mean of θ always lies between your prior mean, ??+?, and the observed relative frequency of heads, ??. Show that, if the prior distribution on θ is uniform, the posterior variance of θ is always...
  • El Programa de la Psicología Genética

    El Programa de la Psicología Genética

    "El problema de la educación me interesa vivamente porque tengo la impresión de que en él hay una enormidad de cosas para reformar, pero pienso que el rol del psicólogo es antes que nada dar los hechos que puede utilizar...