Slayt 1 - kafkas.edu.tr

Slayt 1 - kafkas.edu.tr

Big Data Analytics in USA and Turkey / 1 Motivation Can we learn from the past to become better in the future ?? Healthcare Data is becoming more complex !! In 2012, worldwide digital healthcare data was estimated to be equal to 500 petabytes and is expected to reach 25,000 petabytes in 2020. Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.

2 Organization of this Tutorial Introduction Motivating Examples Sources and Techniques for Big Data in Healthcare Structured EHR Data Unstructured Clinical Notes Medical Imaging Data Genetic Data Other Data (Epidemiology & Behavioral) Final Thoughts and Conclusion 3 INTRODUCTION 4 Definition of Big Data

A collection of large and complex data sets which are difficult to process using common database management tools or traditional data processing applications. Big data refers to the tools, Volume processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities Variety Velocity according to zdnet.com Big data is not just about size. Finds insights from complex, noisy, heterogeneous, longitudinal, and voluminous data. It aims to answer questions that were previously unanswered. The challenges include capturing, storing, searching, sharing & analyzing.

BIG DAT A Veracity The four dimensions (Vs) of Big Data 5 Volume of data ProteomicsDB 8 covers 92% (18,097 of 19,629) of known human genes that are annotated in the Swiss-Prot database. ProteomicsDB has a data volume of 5.17 TB Volume of data Data from millions of patients have already been collected and stored in an electronic format, and these accumulated data could potentially enhance health-care services and increase research opportunities.10,11 Visible Human Project, which has archived 39 GB of female datasets.12

variety of data types and structures. For example, sequencing technologies produce omics data systematically at almost all levels of cellular components, from genomics, proteomics, and metabo-lomics to protein interaction and phenomics.13 Much of the data that are unstructured14 (eg, notes from EHRs,15,16 clinical trial results,17,18 medical images,19 and medical sensors) provide many opportunities and a unique challenge to formulate new investigations. velocity refers to producing and processing data. The new generation of sequencing technologies enables the production of billions of DNA sequence data each day at a relatively low cost. Because faster speeds are required for gene sequencing,1,20 big data technologies will be tailored to match the speed of producing data, as is required to process them Veracity Veracity is important for big data as, for example, personal health

records may contain typographical errors, abbreviations, and cryptic notes. Ambulatory measurements are sometimes taken within less reliable, uncontrolled environments compared to clinical data, which are collected by trained practi-tioners. The use of spontaneous unmanaged data, such as those from social media, can lead to wrong predictions as the data context is not always known. Furthermore, sources are often biased toward those young, internet savvy, and expressive online. value Last but not least, real value to both patients and healthcare systems can only be realized if challenges to analyze big data can be addressed in a coherent fashion. It should be noted that many of the underlying principles of big data have been explored by the research community for years in other domains. Nevertheless, new theories and approaches are needed for analyzing big health data. The total projected healthcare spending in the UK by 2021 will make

6.4% of the gross domestic product (GDP) whilst the total projected healthcare share of the GDP in the United States is expected to reach 19.9% by 2022 [4]. Big Data Technologies In most of the cases reported, we found multiple technologies that were used together, such as artificial intelligence (AI), along with Hadoop,24 and data mining tools. Parallel computing In recent years, novel parallel computing models, such as MapReduce25 by Google, have been proposed for a new big data infrastructure. More recently, an open-source MapReduce package called Hadoop24 was released by Apache for distributed data management. The Hadoop Distributed File System (HDFS) supports concurrent data access to clustered machines. Hadoop-based services can also be viewed as cloud-computing platforms, which allow for centralized data storage as well as remote access across the Internet.

cloud computing cloud computing is a novel model for sharing con-figurable computational resources over the network26 and can serve as an infrastructure, platform, and/or software for providing an integrated solution Reasons for Growing Complexity/Abundance of Healthcare Data Standard medical practice is moving from relatively ad-hoc and subjective decision making to evidence-based healthcare. More incentives to professionals/hospitals to use EHR technology. Additional Data Sources Development of new technologies such as capturing devices, sensors, and mobile applications. Collection of genomic information became cheaper. Patient social communications in digital forms are increasing. More medical knowledge/discoveries are being accumulated. 1 9 Big Data Challenges in Healthcare

Inferring knowledge from complex heterogeneous patient sources. Leveraging the patient/data correlations in longitudinal records. Understanding unstructured clinical notes in the right context. Efficiently handling large volumes of medical imaging data and extracting potentially useful information and biomarkers. Analyzing genomic data is a computationally intensive task and combining with standard clinical data adds additional layers of complexity. Capturing the patients behavioral data through several sensors; their various social interactions and communications. 2 0 Overall Goals of Big Data Analytics in Healthcare Big Data Analytics Electronic Health Records

Genomic Behavioral Lower costs Evidence + Insights Improved outcomes Public Health through smarter decisions Take advantage of the massive amounts of data and provide right intervention to the right patient at the right time. Personalized care to the patient. Potentially benefit all the components of a healthcare system i.e., provider, payer, patient, and management.

8 Purpose of this Presentation Two-fold objectives: Introduce the data mining researchers to the sources available and the possible challenges and techniques associated with using big data in healthcare domain. Introduce Healthcare analysts and practitioners to the advancements in the computing field to effectively handle and make inferences from voluminous and heterogeneous healthcare data. The ultimate goal is to bridge data mining and medical informatics communities to foster interdisciplinary works between the two communities. PS: Due to the broad nature of the topic, the primary emphasis will be on introducing healthcare data repositories, challenges, and concepts to data scientists. Not much focus will be on describing the details of any particular techniques and/or solutions. 2 2

Disclaimers Being a recent and growing topic, there might be several other resources that might not be covered here. Presentation here is more biased towards the data scientists perspective and may be less towards the healthcare management or healthcare providers perspective. Some of the website links provided might become obsolete in the future. Since this topic contains a wide varieties of problems, there might be some aspects of healthcare that might not be covered in the presentation 23 MOTIVATING EXAMPLES 24 EXAMPLE 1: Heritage Health Prize http://www.heritagehealthprize.com

Over $30 billion was spent on unnecessary hospital admissions. Goals: Identify patients at high-risk and ensure they get the treatment they need. Develop algorithms to predict the number of days a patient will spend in a hospital in the next year. Outcomes: Health care providers can develop new strategies to care for patients before its too late reduces the number of unnecessary hospitalizations. Improving the health of patients while decreasing the costs of care. Winning solutions use a combination of several predictive models. 25 EXAMPLE 2: Penalties for Poor Care - 30-Day Readmissions Hospitalizations account for more than 30% of the 2 trillion annual cost of healthcare in the United States. Around 20% of all hospital admissions occur within 30 days of a previous discharge. not only expensive but are also potentially harmful, and most importantly, they are often preventable.

Medicare penalizes hospitals that have high rates of readmissions among patients with heart failure, heart attack, and pneumonia. Identifying patients at risk of readmission can guide efcient resource utilization and can potentially save millions of healthcare dollars each year. Effectively making predictions from such complex hospitalization data will require the development of novel advanced analytical models. 26 EXAMPE 3: White House unveils BRAIN Initiative The US President unveiled a new bold $100 million research initiative designed to revolutionize our understanding of the human brain. BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative. Find new ways to treat, cure, and even prevent brain disorders, such as Alzheimers disease, epilepsy, and traumatic brain injury. Every dollar we invested to map the human genome returned $140 to our economy... Today, our scientists are mapping the human brain to unlock the answers to Alzheimers.

-- President Barack Obama, 2013 State of the Union. advances in "Big Data" that are necessary to analyze the huge amounts of information that will be generated; and increased understanding of how thoughts, emotions, actions and memories are represented in the brain. : NSF Joint effort by NSF, NIH, DARPA, and other private partners. http://www.whitehouse.gov/infographics/brain-initiative 14 EXAMPLE 4: GE Head Health Challenge Challenge 1: Methods for Diagnosis and Prognosis of Mild Traumatic Brain Injuries. Challenge 2: The Mechanics of Injury: Innovative Approaches For Preventing And Identifying Brain Injuries. In Challenge 1, GE and the NFL will award up to $10M for two types of solutions: Algorithms and Analytical Tools, and Biomarkers and other technologies. A total of $60M in

funding over a period of 4 years. 15 Healthcare Continuum Sarkar, Indra Neil. "Biomedical informatics and translational medicine." Journal of Translational Medicine 8.1 (2010): 22. 16 Data Collection and Analysis Effectively integrating and efficiently analyzing various forms of healthcare data over a period of time can answer many of the impending healthcare problems. Jensen, Peter B., Lars J. Jensen, and Sren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012). 17 Organization of this Presentation Introduction

Motivating Examples Sources and Techniques for Big Data in Healthcare Structured EHR Data Unstructured Clinical Notes Medical Imaging Data Genetic Data Other Data (Epidemiology & Behavioral) Final Thoughts and Conclusion 31 SOURCES AND TECHNIQUES FOR BIG DATA IN HEALTHCARE 32 Outline Electronic Health Records (EHR) data Healthcare Analytic Platform Resources

33 ELECTRONIC HEALTH RECORDS (EHR) DATA 34 Data Clinical data Genomic data DNA sequences Structured EHR Unstructured EHR Medical Images Behavior data Social network data

Mobility sensor data Health data 35 Billing data - ICD codes ICD stands for International Classification of Diseases ICD is a hierarchical terminology of diseases, signs, symptoms, and procedure codes maintained by the World Health Organization (WHO) In US, most people use ICD-9, and the rest of world use ICD-10 Pros: Universally available Cons: medium recall and medium precision for characterizing patients (250) Diabetes mellitus (250.0) Diabetes mellitus without mention of complication (250.1) Diabetes with ketoacidosis

(250.2) Diabetes with hyperosmolarity (250.3) Diabetes with other coma (250.4) Diabetes with renal manifestations (250.5) Diabetes with ophthalmic manifestations (250.6) Diabetes with neurological manifestations (250.7) Diabetes with peripheral circulatory disorders (250.8) Diabetes with other specified manifestations (250.9) Diabetes with unspecified complication 36 Billing data CPT codes CPT stands for Current Procedural Terminology created by the American Medical Association CPT is used for billing purposes for clinical services Pros: High precision Cons: Low recall Codes for Evaluation and Management: 9920199499 (99201 - 99215) office/other outpatient services (99217 - 99220) hospital observation services (99221 - 99239) hospital inpatient services

(99241 - 99255) consultations (99281 - 99288) emergency dept services (99291 - 99292) critical care services 37 Lab results The standard code for lab is Logical Observation Identifiers Names and Codes (LOINC) Challenges for lab Many lab systems still use local dictionaries to encode labs Diverse numeric scales on different labs Often need to map to normal, low or high ranges in order to be useful for analytics Missing data not all patients have all labs The order of a lab test can be predictive, for example, BNP indicates high likelihood of heart failure Time

Lab Value 1996-03-15 12:50:00.0 CO2 29.0 1996-03-15 12:50:00.0 BUN 16.0 1996-03-15 12:50:00.0 HDL-C 37.0

1996-03-15 12:50:00.0 K 4.5 1996-03-15 12:50:00.0 Cl 102.0 1996-03-15 12:50:00.0 Gluc 86.0 38

Medication Standard code is National Drug Code (NDC) by Food and Drug Administration (FDA), which gives a unique identifier for each drug Not used universally by EHR systems Too specific, drugs with the same ingredients but different brands have different NDC RxNorm: a normalized naming system for generic and branded drugs by National Library of Medicine Medication data can vary in EHR systems can be in both structured or unstructured forms Availability and completeness of medication data vary Inpatient medication data are complete, but outpatient medication data are not Medication usually only store prescriptions but we are not sure whether patients actually filled those prescriptions 39 Clinical notes Clinical notes contain rich and diverse source of information

Challenges for handling clinical notes Ungrammatical, short phrases Abbreviations Misspellings Semi-structured information Copy-paste from other structure source Lab results, vital signs Structured template: SOAP notes: Subjective, Objective, Assessment, Plan 27 Summary of common EHR data Availability Recall ICD High

Medium CPT High Poor Lab High Medium Precision Medium High High Format Structured

Structured Pros Easy to work with, a good approximation of disease status Easy to work with, high precision Mostly structured High data validity Cons

Disease code often used for screening, therefore disease might not be there Missing data Data normalization and ranges Medication Medium Inpatient: High Outpatient: Variable Inpatient: High Outpatient: Variable Structured and unstructured High data validity

Clinical notes Medium Medium Prescribed not necessary taken Difficult to process Medium high Unstructured More details about doctors thoughts Joshua C. Denny Chapter 13: Mining Electronic Health Records in the Genomics Era. PLoS Comput Biol. 2012 December; 8(12): 28

Analytic Platform Large-scale Healthcare Analytic Platform 42 Analytic Platform Information Feature Healthcare Extraction Selection Analyt ics Predictive Modeling 43

Analytic Platform Information Extraction Structured EHR Feature extraction Feature Selection Context Healthcare ics Patient Feature Representation Analyt Selection

Predictive Modeling Classification Regression Patient Similarity Unstructured EHR 44 Analytic Platform Feature Predictive Selection Modeling

Information Extraction Structured EHR Feature extraction HPatealthcare Context Analyt Repres Unstructured EHR ient entation cs

Classification Feature i Regression Selection Patient Similarity 45 CLINICAL TEXT MINING 46 Text Mining in Healthcare Text mining Information Extraction

Name Entity Recognition Information Retrieval Clinical text vs. Biomedical text Biomedical text: medical literatures (well-written medical text) Clinical text is written by clinicians in the clinical settings Meystre et al. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. IMIA 2008 Zweigenbaum et al. Frontiers of biomedical text mining: current progress, BRIEFINGS IN BIOINFORMATICS. VOL 8. NO 5. 358-375 Cohen and Hersh, A survey of current work in biomedical text mining. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 5771. 47 Auto-Coding: Extracting Codes from Clinical Text Problem Automatically assign diagnosis codes to clinical text Significance

The cost is approximately $25 billion per year in the US Available Data Medical NLP Challenges from 2007 Subsections from radiology reports: clinical history and impression Potential Evaluation Metric: F-measure = 2P*R/(P+R), where P is precision, and R is recall. Example References Aronson et al. From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches. BioNLP 2007 Crammer et al. Automatic Code Assignment to Medical Text. BioNLP 2007 Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. JAMIA. 2004:392-402 Pakhomov SV, Buntrock JD, Chute CG. Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. JAMIA 2006:51625. 48

Context Analysis - Negation Negation: e.g., ...denies chest pain NegExpander [1] achieves 93% precision on mammographic reports NegEx [2] uses regular expression and achieves 94.5% specificity and 77.8% sensitivity NegFinder [3] uses UMLS and regular expression, and achieves 97.7 specificity and 95.3% sensitivity when analyzing surgical notes and discharge summaries A hybrid approach [4] uses regular expression and grammatical parsing and achieves 92.6% sensitivity and 99.8% specificity 1. Aronow DB, Fangfang F, Croft WB. Ad hoc classification of radiology reports. JAMIA 1999:393-411 2. Chapman et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. JBI 2001:301-10. 3. Mutalik PG, et al. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. JAMIA 2001:598-609. 4. Huang Y, Lowe HJ. A novel hybrid approach to automated negation detection in clinical radiology reports. JAMIA 2007 49

Context Analysis - Temporality Temporality: e.g., fracture of the tibia 2 years ago TimeText [1] can detect temporal relations with 93.2% recall and 96.9% precision on 14 discharge summaries Context [2] is an extension of NegEx, which identifies negations (negated, affirmed), temporality (historical, recent, hypothetical) experiencer (patient, other) Scope of the context Trigger Clinical concepts Terminatio n 1. Zhou et al. The Evaluation of a Temporal Reasoning System in Processing Clinical Discharge Summaries. JAMIA 2007. 2. Chapman W, Chu D, Dowling JN. ConText: An Algorithm for Identifying Contextual Features from Clinical Text. BioNLP 2007

37 CASE 1: CASE BASED RETRIEVAL Sondhi P, Sun J, Zhai C, Sorrentino R, Kohn MS. Leveraging medical thesauri and physician feedback for improving medical literature retrieval for case queries. JAMIA. 2012 51 Input: Case Query Patient with smoking habit and weight loss. The frontal and lateral chest X rays show a mass in the posterior segment of the right upper lobe as well as a right hilar enlargement and obliteration of the right paratracheal stripe. On the chest CT the contours of the mass are lobulated with heterogeneous enhancement.Enlarged mediastinal and hilar lymph nodes are present. 52

Goal: Find Relevant Research Articles to a Query Additional Related Information Disease MeSH 53 Challenge 1: Query Weighing Patient with smoking habit and weight loss. The frontal and lateral chest X rays show a mass in the posterior segment of the right upper lobe as well as a right hilar enlargement and obliteration of the right paratracheal stripe. On the chest CT the contours of lobulated the mass are with enhancement. heterogeneous Enlarged

hilar lymph nodes mediastinal are present.and Queries are long Not all words useful IDF does not reflect importance Semantics decide weight

54 Method: Semantic Query Weighing Included UMLS semantic types Disease or syndrome, Body part organ or organ component, Sign or symptom, Finding, Acquired abnormality, Congenital abnormality, Mental or behavioral dysfunction, Neoplasm, Pharmacologic substance, Individual Behavior Identify important UMLS Semantic Types based on their definition Assign higher weights to query words under these types

55 Challenge 2: Vocabulary Gap Patient with smoking habit and weight loss. The frontal chest X rays show a massand lateral in the posterior segment of the right upper lobe as well as a right hilar enlargement and obliteration of the right paratracheal stripe. On the chest CT the contours of the mass are

lobulated with heterogeneous enhancement.Enlarged and hilar lymph nodes are present. mediastinal Matching variants x ray, x-rays, x rays Matching synonyms CT or x rays Knowledge gap

56 Method: Additional Query Keywords Female patient, 25 years old, with fatigue and a swallowing disorder (dysphagia worsening during a meal). The frontal chest X-ray shows opacity with clear contours in contact with the right heart border. Right hilar structures are visible through the mass. The lateral X-ray confirms the presence of a mass in the anterior mediastinum. On CT images, the mass has a relatively homogeneous tissue density. Additional keywords: Thymoma, Lymphoma, Dysphagia, Esophageal obstruction, Myasthenia gravis, Fatiguability, Ptosis Asked physicians to provide additional keywords Adding them with low weight helps Any potential diagnosis keywords help greatly Gives us insights into better query formulation 57 Challenge 3: Pseudo-Feedback

General vocabulary gap solution: Apply Pseudo-Relevance Feedback What if very few of the top N are relevant? No idea which keywords to pick up 58 Method: Medical Subject Heading (MeSH) Feedback Any case related query usually relates only to handful of conditions How to guess the condition of the query? Select MeSH terms from top N=10 ranked documents Select MeSH terms covering most query keywords Use them for feedback 59

Method: MeSH Feedback Doc 1 Lung Neoplasms Doc 2 Bronchitis Doc 3 Cystic Fibrosis Doc 4 Lung Neoplasms Doc 5 Hepatitis

60 Method 2: MeSH Feedback Doc 1 Lung Neoplasms Doc 2 Bronchitis Doc 3 Cystic Fibrosis Doc 4 Lung Neoplasms Doc 5

Hepatitis Filtration List Lung Neoplasms Bronchitis 61 Method 2: MeSH Feedback Doc 1 Lung Neoplasms Filtration List Lung Neoplasms Bronchitis Doc 2

Bronchitis Doc 3 Cystic Fibrosis Reduce Weight Doc 4 Lung Neoplasms Leave Unchanged Doc 5 Hepatitis Reduce

Weight 62 Method: MeSH Feedback Doc 1 Lung Neoplasms Filtration List Doc 1 Lung Neoplasms Lung Neoplasms Bronchitis Doc 2 Bronchitis Doc 2

Bronchitis Doc 3 Cystic Fibrosis Reduce Weight Doc 4 Lung Neoplasms Doc 4 Lung Neoplasms Leave Unchanged Doc 3

Cystic Fibrosis Doc 5 Hepatitis Doc 5 Hepatitis Reduce Weight 63 Retrieval Results Data + Knowledge helps! Best performing run

51 CASE 2: HEART FAILURE SIGNS AND SYMPTOMS Roy J. Byrd, Steven R. Steinhubl, Jimeng Sun, Shahram Ebadollahi, Walter F. Stewart. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. International Journal of Medical Informatics 2013 52 Framingham HF Signs and Symptoms Framingham criteria for HF* are signs and symptoms that are documented even at primary care visits * McKee PA, Castelli WP, McNamara PM, Kannel WB. The natural history of congestive heart failure: the Framingham study. N Eng l J Med. 1971;285(26):1441-6. 53 Natural Language Processing (NLP) Pipeline Criteria extraction comes from sentence level.

Encounter label comes from the entire note. Roy J. Byrd, Steven R. Steinhubl, Jimeng Sun, Shahram Ebadollahi, Walter F. Stewart. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. International Journal of Medical Informatics 2013 54 Performance on Encounter Level on Test Set 1 0. 9 0. 8 0. 7 Overall Affire d 0. 6

Denied 0. 5 0. 4 0. 3 Recall Precision F-Score Recall Precimsion F-Score Machine-learning method Rule-based method Machine 0. learning method: decision tree 2 Rule-based 0. method is to construct grammars by 1

computational linguists 0 Manually constructed rules are more accurate but more effort to construct than automatic rules from learning a decision tree 55 Potential Impact on Evidence-based Therapies No symptoms Framingham symptoms Clinical diagnosis Opportunity for early intervention 3,168 patients eventually all diagnosed with HF

70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% Preceding Framingham diagnosis After Framingham diagnosis After clinical diagnosis 0.00% Applying text mining to extract Framingham symptoms can help trigger early intervention Vhavakrishnan R, Steinhubl SR, Sun J, et al. Potential impact of predictive models for early detection of heart failure on the initiation of evidence-based therapies. J Am Coll Cardiol. 2012;59(13s1):E949-E949.

56 Analytic Platform Information Extraction Structured EHR Feature Pat Feature Selection Predictiv e Modeling Context

Healthcare ics ient Feature ation Analyt Selection Classification Regression extraction Represent Patient Similarity Unstructured EHR

70 KNOWLEDGE+DATA FEATURE SELECTION Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA (2012). 71 Combining Knowledge- and Data-driven Risk Factors Knowledge Knowledge base Clinical data Data Risk factor

gathering Data processing Knowledge risk factors Potential risk factors Combination Risk factor augmentation augmentatio n Combined risk

factors Target condition 72 Risk Factor Augmentation Model Accuracy: The selected risk factors are highly predictive of the target condition Sparse feature selection through L1 regularization Minimal Correlations: Between data driven risk factors and knowledge driven risk factors Among the data driven risk factors Model error Correlation among data-driven features

Correlation between data- and knowledgedriven features Dijun Luo, Fei Wang, Jimeng Sun, Marianthi Markatou, Jianying Hu,Shahram Ebadollahi, SOR: Scalable Orthogonal Regression for Low-Redundancy Feature Selection and its Healthcare Applications. SDM12 Sparse Penalty 60 Prediction Results using Selected Features 0. 8 +15 +20 0 0 +10

+5 0 0 0.7 5 AU C 0. 7 0.6 5 0. 6 0.5 5 0.5 0 all

+Hypertension CAD 100 +diabet es 200 knowledge features 300 400 Number of features 50 0

600 AUC significantly improves as complementary data driven risk factors are added into existing knowledge based risk factors. A significant AUC increase occurs when we add first 50 data driven features Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA (2012). 61 Clinical Validation of Data-driven Features 9 out of 10 are considered relevant to HF The data driven features are complementary to the existing knowledge-driven features 75 Analytic Platform

Information Extraction Structured extraction Predictive Modeling Context EHR Feature Feature Selection Classification

ealthcare HPatient Analyt Representation Selection Feature ics Regression Patient Similarity Unstructured EHR 76 PREDICTIVE MODEL

77 Anatomy of Clinical Predictive Model Prediction Models Continuous outcome: Regression Categorical outcome: Classification Logistic regression Survival outcome Cox Proportional Hazard Regression Patient Similarity Case study: Heart failure onset prediction 78 PATIENT SIMILARITY 79 Intuition of Patient Similarity

Doctor Similarit y search Patient 80 Intuition of Patient Similarity Patient Doctor 81 Summary on Patient Similarity Patient similarity learns a customized distance metric for a

specific clinical context Extension 1: [SDM11a] Composite distance integration (Comdi) How to jointly learn a distance by multiple parties without data sharing? Extension 2: Interactive metric update (iMet) [SDM11b] How to interactively update an existing distance measure? 1. Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of heterogeneous patient records. SIGKDD Explorations 14(1): 16-24 (2012) 2. Fei Wang, Jimeng Sun, Shahram Ebadollahi: Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM 2011: 59-70 56 3.Fei Wang, Jimeng Sun, Jianying Hu, Shahram Ebadollahi: iMet: Interactive Metric Learning in Healthcare Applications. SDM 2011: 944-955 69 CASE STUDY: HEART

FAILURE PREDICTION 83 Motivations for Early Detection of Heart Failure Heart failure (HF) is a complex disease Huge Societal Burden For payers Reduce cost and hospitalization Improve the existing clinical guidance of HF prevention For providers Slow or potentially reverse disease progress Improve quality of life, reduce mortality 84 Predictive Modeling Study Design Goal: Classify HF cases against control patients Population 50,625 Patients (Geisinger Clinic PCPs) Cases: 4,644 case patients

Controls 45,981 matched on age, gender and clinic Cases Controls 85 Predictive Modeling Setup Observation Window Prediction Window Index date Diagnosis date We define Diagnosis date and index date Prediction and observation windows Features are constructed from the observation window and predict

HF onset after the prediction window 86 Features We construct over 20K features of different types Through feature selection and generalization, we result in the following predictive features Feature type Cardinality Predictive Features DIAGNOSIS ventricular hypertrophy, angina, atrial fibrillation, MI, COPD Demographics 11 Framingham

15 Lab 1,264 Medication 3,922 Vital 6 Age, race, gender, smoking status rales, cardiomegaly, acute pulmonary edema, HJReflex, ankle edema, nocturnal cough, DOExertion, hepatomegaly, pleural effusion eGFR, LVEF, albumin, glucose, cholesterol, creatinine, cardiomegaly, heart rate, hemoglobin antihypertensive, lipid-lowering, CCB, ACEI, ARB, beta

blocker, diuretic, digitalis, antiarrhythmic blood pressure and heart rate 87 Prediction Performance on Different Prediction Windows HF Onset (ObservationWindow=720, PredictionWindow=variable) 1 0.95 0.9 AUC 0.85 0.8 0.75 0.7 0.65 0.6 0.55

0.5 0 90 180 270 360 450 540 630 720 Prediction Window (days)

Setting: observation window = 720 days, classifiers = random forest, evaluation mechanism = 10-fold cross-validation for 10 times Observation: AUC slowly decreases as the prediction window increases 88 Prediction Performance on Different Observation Windows HF Onset (ObservationWindow=variable, PredictionWindow=180) 1 0.95 0.9 AUC 0.85 0.8 0.75 0.7 0.65 0.6

0.55 0.5 30 90 180 270 360 450 630 540 720 810

900 Observation Window (days) Setting: prediction window= 180 days, classifiers= random forest, evaluation mechanism =10-fold cross-validation Observation: AUC increases as the observation window increases. i.e., more data for a longer period of time will lead to better performance of the predictive model Combined features performed the best at observation window = 720 days 89 Analytic Platform Information Extraction Structured EHR Feature extraction

Feature Selection Context Healthcare ics Patient Feature Representation Analyt Selection Predictive Modeling Classification Regression Patient Similarity Unstructured

EHR 90 RESOURCES 91 Unstructured Clinical Data Dataset i2b2 Informatics for Integrating Biology & the Bedside Computational Medicine center Link

https://www.i2b2.org/NLP/D at aSets/Main.php Description Clinical notes used for clinical NLP challenges 2006 Deidentification and Smoking Challenge 2008 Obesity Challenge 2009 Medication Challenge 2010 Relations Challenge 2011 Co-reference Challenge Classifying Clinical Free Text Using Natural http://computationalmedicine. Language Processing org/challenge/previous 92 Structured EHR Dataset Link

Description Texas Hospital Inpatient Discharge http://www.dshs.state.tx.us/thcic/ ho spitals/Inpatientpudf.shtm Patient: hospital location, admission type/source, claims, admit day, age, icd9 codes + surgical codes Framingham Health Care Data Set http://www.framinghamheartstudy.o Medicare Basic Stand Alone Claim Public Use Files

Genetic dataset for cardiovascular disease rg/share/index.html Inpatient, skilled nursing facility, outpatient, http://resdac.advantagelabs.com/c home health agency, hospice, carrier, durable medical equipment, prescription drug event, ms-data/files/bsa-puf and chronic conditions on an aggregate level http://www.virec.research.va.gov VHA Medical SAS Datasets /M edSAS/Overview.htm Patient care encounters primarily for Veterans: inpatient/outpatient data from VHA facilities Discharge data from 1051 hospitals in 45 states with diagnosis, procedures,

status, demographics, cost, length of stay Discharge data for licensed general acute http://www.oshpd.ca.gov/HID/Pro hospital in CA with demographic, CA Patient Discharge Data du diagnostic and treatment information, cts/PatDischargeData/PublicDataS disposition, total charges et/index.html Nationwide Inpatient Sample http://www.hcupus.ahrq.gov/nisoverview.js p ICU data including demographics, http://mimic.physionet.org/database diagnosis, clinical measurements, lab MIMIC II Clinical Database .html

results, interventions, notes Thanks to Prof. Joydeep Ghosh from UT Austin for providing this information 93 Software MetaMap maps biomedical text to UMLS metathesaurus Developed by NLM for parsing medical article not clinical notes http://metamap.nlm.nih.gov/ cTAKES: clinical Text Analysis and Knowledge Extraction System Using Unstructured Information Management Architecture (UIMA) framework and OpenNLP toolkit http://ctakes.apache.org/ 94 Organization of this Tutorial Introduction Motivating Examples Sources and Techniques for Big Data in

Healthcare Structured EHR Data Unstructured Clinical Notes Medical Imaging Data Genetic Data Other Data (Epidemiology & Behavioral) Final Thoughts and Conclusion 95 MEDICAL IMAGE DATA 96 Image Data is Big !!! By 2019, the average hospital will have two-thirds of a petabyte (665 terabytes) of patient data, 80% of which will be unstructured image data like CT scans and Xrays. Medical Imaging archives

are increasing by 20%-40% PACS (Picture Archival & Communication Systems) system is used for storage and retrieval of the images. Image Source: http://medcitynews.com/2013/03/the-body-in-bytes-medical-images-as-a-source-of-healthcare-big-data-infographic/ 97 Popular Imaging Modalities in Healthcare Domain Computed Tomography (CT) Positron Emission Tomography (PET) Magnetic Resonance Imaging (MRI)

The main challenge with the image data is that it is not only huge, but is also high-dimensional and complex. Extraction of the important and relevant features is a daunting task. Many research works applied image features to extract the most relevant images for a given query. Image Source: Wikipedia 85 Medical Image Retrieval System Training Phase Feature Extraction Algorithms for learning or similarity computations Final Trained Models

Biomedica l Image Database Query Results Testing Phase Retrieva l System Performance Evaluation (Precision-Recall) Precisio n Recal l

Query Image 99 Content-based Image Retrieval Two components Image features/descriptors - bridging the gap between the visual content and its numerical representation. These representations are designed to encode color and texture properties of the image, the spatial layout of objects, and various geometric shape characteristics of perceptually coherent structures. Assessment of similarities between image features based on mathematical analyses, which compare descriptors across different images. Vector affinity measures such as Euclidean distance, Mahalanobis distance, KL divergence, Earth Movers distance are amongst the widely used ones. 10 0

Medical Image Features Photo-metric features exploit color and texture cues and they are derived directly from raw pixel intensities. Geometric features: cues such as edges, contours, joints, polylines, and polygonal regions. A suitable shape representation should be extracted from the pixel intensity information by region-of interest detection, segmentation, and grouping. Due to these difficulties, geometric features are not widely used. Akgl, Ceyhun Burak, et al. "Content-based image retrieval in radiology: current status and future directions." Journal of Digital Imaging 24.2 (2011): 208-222. Mller, Henning, et al. "A review of content-based image retrieval systems in medical applications-clinical benefits and future directions."

International journal of medical informatics 73.1 (2004): 1-24. 88 Image CLEF Data ImageCLEF aims to provide an evaluation forum for the cross language annotation and retrieval of images (launched in 2003) Statistics of this database : With more than 300,000 (in .JPEG format), the total size of the database > 300 GB contains PET, CT, MRI, and Ultrasound images Three Tasks Modality classification Imagebased retrieval Casebased retrieval Medical Image Database available at http://www.imageclef.org/2013/medical 89 Modality Classification Task

Modality is one of the most important filters that clinicians would like to be able to limit their search by. 90 Image based and Case-based Querying Image-based retrieval : This is the classic medical retrieval task. Similar to Query by Image Example. Given the query image, find the most similar images. Case-based retrieval: This is a more complex task; is closer to the clinical workflow. A case description, with patient demographics, limited symptoms and test results including imaging studies, is provided (but not the final diagnosis). The goal is to retrieve cases including images that might best suit the provided case description.

10 4 Challenges with Image Data Extracting informative features. Selection of relevant features. Sparse methods* and dimensionality reducing techniques Integration of Image data with other data available Early Fusion Vector-based Integration Intermediate Fusion Multiple Kernel Learning Late Fusion Ensembling results from individual modalities *Jieping Yes SDM 2010 Tutorial on Sparse methods http://www.public.asu.edu/~jye02/Tutorial/Sparse-SDM10.pdf 10 5 Publicly Available Medical Image Repositories Image

database Name Moda lities No. Of patients No. Of Images Size Of Data Notes/Applications Download Link Cancer Imaging

Archive Database CT DX CR 1010 244,527 241 GB Lesion Detection and classification, Accelerated Diagnostic Image Decision, Quantitative image assessment of drug response https://public.cancerimagingarchive.net/ ncia/dataBasketDisplay.jsf

Digital Mammog raphy database DX 2620 9,428 211 GB Research in Development of Computer Algorithm to aid in screening http://marathon.csee.usf.edu/Mammogr aphy/Database.html

Public Lung Image Database CT 119 28,227 28 GB Identifying Lung Cancer by Screening Images https://eddie.via.cornell.edu/crpf.html Image CLEF Database

PET CT MRI US unknown 306,549 316 GB Modality Classification , Visual Image Annotation , Scientific Multimedia Data Management http://www.imageclef.org/2013/medical MS Lesion Segment

ation MRI 41 145 36 GB Develop and Compare 3D MS Lesion Segmentation Techniques http://www.ia.unc.edu/MSseg/download .php ADNI Database MRI

2851 67,871 16GB Define the progression of Alzheimers disease http://adni.loni.ucla.edu/datasamples/acscess-data/ PET 10 6 GENETIC DATA 10 7

Genetic Data The human genome is made up of DNA which consists of four different chemical building blocks (called bases and abbreviated A, T, C, and G). It contains 3 billion pairs of bases and the particular order of As, Ts, Cs, and Gs is extremely important. Size of a single human genome is about 3GB. Thanks to the Human Genome Project (1990-2003) The goal was to determine the complete sequence of the 3 billion DNA subunits (bases). The total cost was around $3 billion. 10 8 Genetic Data The whole genome sequencing data is currently being annotated and not many analytics have been applied so far since the data is relatively new. Several publicly available genome repositories. http://aws.amazon.com/1000genomes/ It costs around $5000 to get a complete genome. It is still in the

research phase. Heavily used in the cancer biology. In this presenttaion, we will focus on Genome-Wide Association Studies (GWAS). It is more relevant to healthcare practice. Some clinical trials have already started using GWAS. Most of the computing literature (in terms of analytics) is available for 10 the GWAS. It is still in rudimentary stage for whole genome sequences.9 Genome-Wide Association Studies (GWAS) Genome-wide association studies (GWAS) are used to identify common genetic factors that influence health and disease. These studies normally compare the DNA of two groups of participants: people with the disease (cases) and similar people without (controls). (One million Loci) Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence differs between

individuals. SNPs occur every 100 to 300 bases along the 3-billion-base human genome. 11 0 Epistasis Modeling For simple Mendelian diseases, single SNPs can explain phenotype very well. The complex relationship between genotype and phenotype is inadequately described by marginal effects of individual SNPs. Increasing empirical evidence suggests that interactions among loci contribute broadly to complex traits. The difficulty in the problem of detecting SNP pair interactions is the To detect pairwise interactions from 500,000 SNPs genotyped in heavy computational burden. thousands

of samples, a total of 1.25 X stat stical tests are 10 needed. i 11 1 Epistasis Detection Methods Exhaustive Enumerates all K-locus interactions among SNPs. Efficient implementations mostly aiming at reducing computations by eliminating unnecessary calculations. Non-Exhaustive Stochastic: randomized search. Performance lowers when the # SNPs

increase. Heuristic: greedy methods that do not guarantee Shang, Junliang, et al. "Performance analysis of novel methods for detecting epistasis." optimal solution. BMC bioinformatics 12.1 (2011): 475. 11 2 Sparse Methods for SNP data analysis Successful identification of SNPs strongly predictive of disease promises a better understanding of the biological mechanisms underlying the disease. Sparse linear methods have been used to fit the genotype data and obtain a selected set of SNPs. Minimizing the squared loss function (L) of N individuals and p variables (SNPs) is used for

p defined as N linear regression and is 1 T 2 L( 0 , ) i i ( y x ) 2 i1 0

j 1 j where xi p are inputs for the ith sample, y N is the N vector of outputs, 0 is the intercept, p is a p-vector of model weights, and is user penalty. Efficient implementations that scale to genome-wide data are available. SparSNP package http://bioinformatics.research.nicta.com.au/software/sparsnp/ Wu, Tong Tong, et al. "Genome-wide association analysis by lasso penalized logistic regression." Bioinformatics 25.6 (2009): 714-721. 100 Public Resources for Genetic (SNP) Data The Wellcome Trust Case Control Consortium (WTCCC) is a group of 50 research groups across the UK which was established

in 2005. Available at http://www.wtccc.org.uk/ Seven different diseases: bipolar disorder (1868 individuals), coronary heart disease (1926 individuals), Crohn's disease (1748 individuals), hypertension (1952 individuals), rheumatoid arthritis (1860 individuals), type I diabetes (1963 individuals) or type II diabetes (1924 individuals). Around 3,000 healthy controls common for these disorders. The individuals were genotyped using Affymetrix chip and obtained approximately 500K SNPs. The database of Genotypes and Phenotypes (dbGaP) maintained by National Center of Biotechnology Information (NCBT) at NIH. Available at http://www.ncbi.nlm.nih.gov/gap 114 BEHAVIORAL AND PUBLIC HEALTH DATA 115 Epidemiology Data The Surveillance Epidemiology and End Results Program (SEER) at NIH.

Publishes cancer incidence and survival data from population-based cancer registries covering approximately 28% of the population of the US. Collected over the past 40 years (starting from January 1973 until now). Contains a total of 7.7M cases and >350,000 cases are added each year. Collect data on patient demographics, tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status. Usage: Widely used for understanding disparities related to race, age, and gender. Can be used to overlay information with other sources of data (such as water/air pollution, climate, socio-economic) to identify any correlations. SEER database is available at http://seer.cancer.gov/ Can not be used for predictive analysis, but mostly used for studying trends. 116 Social Media can Sense Public Health !!

During infectious disease outbreaks, data collected through health institutions and official reporting structures may not be available for weeks, hindering early epidemiologic assessment. Social media can get it in near real-time. Twitter messaging correlated with cholera outbreak Google Flu Trends correlated with Influenza outbreak Dugas, Andrea Freyer, et al. "Google Flu Trends: correlation with emergency department influenza rates and crowding metrics." Clinical infectious diseases 54.4 (2012): 463-469. Chunara, Rumi, Jason R. Andrews, and John S. Brownstein. "Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak." American Journal of Tropical Medicine and Hygiene 86.1 (2012): 39. 104

Social Networks for Patients PatientsLikeMe1 is a patient network is an online data sharing platform started in 2006; now has more than 200,000 patients and is tracking 1,500 diseases. OBJECTIVE: Given my status, what is the best outcome I can hope to achieve, and how do I get there? People connect with others who have the same disease or condition, track and share their own experiences, see what treatments have helped other patients like them, gain insights and identify any patterns. Patient provides the data on their conditions, treatment history, side effects, hospitalizations, symptoms, disease-specific functional scores, weight, mood, quality of life and more on an ongoing basis. Gaining access to the patients for future clinical trials. 1 http://www.patientslikeme.com/ 105 Home Monitoring and Sensing Technologies Advancements in sensing technology are critical for developing effective and efficient home-monitoring systems Sensing devices can provide several types of data in realtime.

Activity Recognition using Cell Phone Accelerometers Kwapisz, Jennifer R., Gary M. Weiss, and Samuel A. Moore. "Activity recognition using cell phone accelerometers." ACM SIGKDD Explorations Newsletter 12.2 (2011): 74-82. Rashidi, Parisa, et al. "Discovering activities to recognize and track in a smart environment." Knowledge and Data Engineering, IEEE Transactions on 23.4 (2011): 527-539. 106 Public Health and Behavior Data Repositories Dataset Behavioral Risk Factor Surveillance System (BRFSS) Link Description http://www.cdc.gov/brfss/technic al_ infodata/index.htm

Healthcare survey data: smoking, alcohol, lifestyle (diet, exercise), major diseases (diabetes, cancer), mental illness Hospital: number of discharges, transfers, length of stay, admissions, transfers, number Ohio Hospital Inpatient/Outpatient http://publicapps.odh.ohio.gov/pwh of patients with specific Data /PWHMain.aspx?q=021813114232 procedure codes US Mortality Data Human Mortality Database http://www.cdc.gov/nchs/data_acce Mortality information on county- level ss/cmf.htm Birth, death, population size by country http://www.mortality.org/ Summary statistics for mortality,

charges, discharges, length of stay on a county-level basis Utah Public Health Database http://ibis.health.utah.gov/query Dartmouth Atlas of Health Care http://www.dartmouthatlas.org/tools Post discharge events, chronically ill care, surgical /downloads.aspx discharge rate Thanks to Prof. Joydeep Ghosh from UT Austin for providing this information. 107 CONCLUDING REMARKS 121

Final Thoughts Big data could save the health care industry up to $450 billion, but other things are important too. Right living: Patients should take more active steps to improve their health. Right care: Developing a coordinated approach to care in which all caregivers have access to the same information. Right provider: Any professionals who treat patients must have strong performance records and be capable of achieving the best outcomes. Right value: Improving value while simultaneously improving care quality. Right innovation: Identifying new approaches to health-care delivery. Stakeholders will only benefit from big data if they take a more holistic, patient-centered approach to value, one that focuses equally on health-care spending and treatment outcomes, McKinsey report available at: http://www.mckinsey.com/insights/health_systems/the_big-data_revolution_in_us_health_care 122 Conclusion

Big data analytics is a promising right direction which is in its infancy for the healthcare domain. Healthcare is a data-rich domain. As more and more data is being collected, there will be increasing demand for big data analytics. Unraveling the Big Data related complexities can provide many insights about making the right decisions at the right time for the patients. Efficiently utilizing the colossal healthcare data repositories can yield some immediate returns in terms of patient outcomes and lowering care costs. Data with more complexities keep evolving in healthcare thus leading to more opportunities for big data analytics. 123 Acknowledgements Funding Sources

National Science foundation National Institutes of Health Susan G. Komen for the Cure Delphinus Medical Technologies IBM Research 124 Public health information This section focuses on four areas: (1) infectious disease surveillance, (2) population health man-agement, (3) mental health management, and (4) chronic dis-ease management. Infectious disease surveillance

Using big data for global infectious disease surveillance. A system was developed that provides real-time risk monitoring on map, pointing out that machine learning and crowdsourcing have opened new possibilities for devel-oping a continually updated atlas for disease monitoring. Online social media combined with epidemiological information is a valuable new data source for facilitating public health surveillance. The use of social media for disease monitoring was demonstrated in which 553,186,016 tweets were collected and extracted more than 9,800 with HIV risk-related keywords (eg, sexual behaviors and drug use) and geographic annotations. There is a significant positive correlation (P , 0.01) between HIV-related tweets and HIV cases based on prevalence analysis, illustrating the importance of social media Population health management The independent association of patient MD and UCD was analyzed. The MD was identified by ICD10 code, while the UCD was extracted from a death registry. If MD and UCD were different events, then those events were

found to be independent. Using health insur-ance data, information from 421,460 deceased patients was extracted from 2008 to 2009. The results show that 8.5% of inhospital deaths and 19.5% of out-of-hospital deaths were independent events and that independent death was more common in elderly patients. The results demonstrate that large-scale data analysis can be used to effectively analyze the association of medical events. Mental health management Messages posted on social media could be used to screen for and potentially detect depression. Their analysis is based on previous research of the association between depressive dis-orders and repetitive thoughts/ruminating behavior. Big data analytics tools play an important role in their work by mining hidden behavioral and emotional patterns in messages, or tweets, posted on Twitter. Within these tweets, a disease-related emotion pattern, which is a previously hidden symptom may be detected . Future research could delve deeper into the conversations of the depressed users to understand more about their hidden emotions and sentiments. The effectiveness of their model against a dataset of 89,840 patients were

analyzed, and the results show that they can achieve an overall accuracy of 82.35% for all conditions. Successful applications of personalized medicine in cancer Successful applications of personalized medicine in cancer include three drugs that have been identified and used in specific groups of patients. Patients with melanoma and the BRAF muta-tion V600E can be treated with dabrafenib [52], patients with breast cancer and the amplification or overexpression of the gene encoding Her2/Neu can be treated with a targeted therapy using trastuzumab [53] and different types of tumor that contain the fusion protein BCR-ABL can be treated with imatinib [54]. Chronic disease management Cardiovascular Health in Ambulatory Care Research Team (CANHEART), a unique, population-based observational research initiative aimed at measuring and improving cardio-vascular health and the quality of ambulatory cardiovascular care provided in Ontario, Canada.

It included data from 9.8 million Ontario adults aged $20 years. Data were assembled by linking multiple databases, such as electronic surveys, health administration, clinical, laboratory, drug, and electronic medical record databases using encoded personal identifiers. Follow-up clinical events were collected through record link-ages to comprehensive hospitalization, emergency department, and vital statistics administrative databases. Conclusion (1) integrating different sources of information enables clinicians to depict a new view of patient care processes that consider a patients holistic health status, from genome to behavior; (2) the avail-ability of novel mobile health technologies facilitates realtime data gathering with more accuracy; (3) the implementation of distributed platforms enables data archiving and analysis, which will further be developed for decision support; and (4) the inclusion of geographical and environmental informa-tion may further increase the ability to interpret gathered data and extract new knowledge. Conclusion While big data holds significant promise for improving health care,

there are several common challenges facing all the four fields in using big data technology; the most significant problem is the integration of various databases. For example, the VHAs database, VISTA, is not a single system; it is a set of 128 interlinked systems. This becomes even more complicated when databases contain different data types (eg, integrating an imaging database or a laboratory test results database into existing systems), thereby limiting a systems ability to make queries against all databases to acquire all patient data. The lack of standardization for laboratory protocols and values also creates challenges for data integration. Conclusion For example, image data can suffer from technological batch effects when they come from different laboratories under different protocols. Efforts are made to normalize data when there is a batch effect; this may be easier for image data, but it is intrinsically more difficult to normalize laboratory test data. Security and privacy concerns also remain as hurdles to big data integra-tion and usage in all the four fields, and thus, secure platforms

with better communication standards and protocols are greatly needed. Example of companies and institutions that provide solutions to generate, interperet and visualize combined omics and health clinical data Company or institution Type of solution Website Appistry High-performance big data platform that combines self-organizing computational http://www.appistry.com Examples of companies and institutions thatwith provide

solutions to generate, interprethigh-performance and visualize combinedcomputing omics and health clinical data storage optimized and distributed to provide secure, HIPAA-complaint accurate on-demand analysis of omics data in association with clinical information Beijing Genome Institute This solution serves as a solid foundation for large-scale bioinformatics processing. The http://www.genomics.cn/en

computing platform is an integrated service comprising versatile software and powerful hardware applied to life sciences CLC Bio Utilizes proprietary algorithms, based on published methods, to accelerate successfully data calculations to achieve remarkable improvements in big data analytics http://www.clcbio.com Context Matters Provides a comprehensive tool that empowers pharmaceutical and biotechnology http://www.contextmattersinc.com DNAnexus Reviews INFORMATICS companies to make better strategic decisions using web-based applications, and easy-to-use interface and visualization tools to deal with complex data sets

Provides solutions for NGS by using cloud computing infrastructure with scalable systems and advanced bioinformatics in a web-based platform to solve data http://www.dnanexus.com management and the challenges in analysis that are common in unified systems. Genome International Corporation Genome International Corporation (GIC) is a research-driven company that provides http://www.genome.com innovative bioinformatics products and custom research solutions for corporate, government, and academic laboratories in life sciences GNS Healthcare A big data analytics company that has developed a scalable approach to deal with big data solutions that could be applied across the healthcare industry

http://www.gnshealthcare.com NextBio Big data technology that enables users to integrate and interpret systematically public and proprietary molecular data and clinical information from individual patients, http://www.nextbio.com population studies and model organisms applying omics data in useful ways both in research and in the clinic Pathfinder Develops customized software applications, providing solutions in different sectors, including healthcare and omics, offering technologies that enable business breakthroughs and competitive advantages http://www.pathfindersoftware.com Examples of big corporations offering solutions and pipelines to store, analyze and deal with complex biomedical information Company

Solution(s) Website Amazon Web Services Provides the necessary computing environment, including CPUs, storage, http://aws.amazon.com memory (RAM), networking, and operating system, for a hardware infrastructure as a service in the biomedical and scientific fields Cisco Healthcare Solutions Offers different types of solution for the life sciences, including specific http://www.cisco.com/web/strategy/ hardware and cloud computing for reliable and highly secure health data

healthcare/index.html communication and sharing across the healthcare community DELL Healthcare Solutions Connects researchers to the right technology and processes to create http://www.dell.com/Learn/us/en/70/ information-driven healthcare and accelerate innovation in life sciences healthcare-solutions?c=us&l=en&s=hea with electronic medical record (EMR) solutions GE Healthcare Life Sciences Provides expertise and tools for a wide range of applications, including basic http://www3.gehealthcare.com/en/

research of cells and proteins, drug discovery research, as well as tools to Global_Gateway support large-scale manufacturing of biopharmaceuticals IBM Healthcare and Life Sciences Provides healthcare solutions, technology and consulting that enable http://www-935.ibm.com/industries/ organizations to achieve greater efficiency within their operations, and healthcare to collaborate to improve outcomes and integrate with new partners for a more sustainable, personalized and patient-centric system Intel Healthcare Currently builds frameworks with governments, healthcare organizations,

http://www.intel.com/healthcare and technology innovators worldwide to build the health IT tools and services of tomorrow by combining different types of health information Microsoft Life Sciences Oracle Life Sciences Provides innovative, world-class technologies to help customers nurture http://www.microsoft.com/health/en-us/ innovation, improve decision-making and streamline operations solutions/Pages/life-sciences.aspx Delivers key functionalities built for pharmaceutical, biotechnology, clinical http://www.oracle.com/us/industries/

and medical device enterprises. Oracle maximizes the chances of life-sciences/overview/index.html discovering and bringing to market products that will help in treating specific diseases Examples of companies that offer personalized genetics and omics solutions Examples of companies that offer personalized genetics and omics solutions Company Applications and/or services Website 23andme A DNA analysis service providing information and educational tools for individuals to

http://www.23andme.com learn and explore their DNA through personal genomics Counsyl Offers tests for gene mutations and variations in more than 100 inherited rare genetic http://www.counsyl.com disorders using a DNA biochip designed specifically to test for these disorders Foundation Medicine A molecular information company at the forefront of bringing comprehensive cancer http://www.foundationmedicine.com genomic analytics to routine clinical care Knome Analyzes whole-genome data using software-based tests to examine and compare

http://www.knome.com simultaneously many genes, gene networks and genomes as well as integrate other forms of molecular and nonmolecular data Pathway Genomics Incorporates customized and scientifically validated technologies to generate http://www.pathway.com personalized reports, which address a variety of medical issues, including an individuals propensity to develop certain diseases Personalis A genome-scale diagnostics services company pioneering genome-guided medicine focused on producing the most accurate genetic sequence data from each sample, using data analytics and proprietary content to draw accurate and reliable biomedical interpretations http://www.personalis.com

Some References Sensor Mania! The Internet of Things, Wearable Computing, Objective Metrics, and the Quantified Self 2.0 Melanie Swan, J. Sens. Actuator Netw. 2012, 1, 217-253; doi:10.3390/jsan1030217 Towards fog-driven IoT eHealth: Promises and challenges of IoT in medicine and healthcare, Bahar Farahani a,*, Farshad Firouzi b, Victor Chang c , Mustafa Badaroglu d, Nicholas Constant e, Kunal Mankodiya e, Future Generation Computer Systems 78 (2018) 659676 Advanced internet of things for personalised healthcare systems: A survey, Jun Qi a, Po Yang a,*, Geyong Min b, Oliver Amft c, Feng Dong d, Lida Xu e, Pervasive and Mobile Computing 41 (2017) 132149 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 1, JANUARY 2010 1, A Survey on Wearable Sensor-Based Systems for Health Monitoring and Prognosis, Alexandros Pantelopoulos and Nikolaos G. Bourbakis Yuehong YIN, Yan Zeng, Xing Chen, Yuanjie Fan, The internet of things in healthcare: An overview, Journal of Industrial Information Integration 1 (2016) 313 Amir M. Rahmani a,b,*, Tuan Nguyen Gia c, Behailu Negash c, Arman Anzanpour c, Iman Azimi c, Mingzhe Jiang c, Pasi Liljeberg cExploiting smart e-Health gateways at the edge of healthcare Internet-of-Things: A fog computing approach, , Future Generation Computer Systems 78 (2018) 641658 Srivathsan Ma,Yogesh Arjun Kb, , Health Monitoring System by Prognotive Computing using Big Data Analytics, Procedia Computer Science 50 (2015) 602 609 Amir-Mohammad Rahmani, Smart e-Health Gateway: Bringing Intelligence to Internet-of-Things Based Ubiquitous Healthcare Systems, 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC)

Thank You Questions and Comments 140

Recently Viewed Presentations

  • CULTURAL STUDIES Stuart Hall - media.usm.maine.edu

    CULTURAL STUDIES Stuart Hall - media.usm.maine.edu

    CULTURAL STUDIES Stuart Hall In Em Griffin, A First Look at Communication Theory, 6th ed. CLICKER Hall is "deeply suspicious of and hostile to empirical work that has no ideas because that simply means that it does not know the...
  • Semantics Continued - University of Florida

    Semantics Continued - University of Florida

    Semantics Continued… Meaning Relationships Entailments Maxims of Conversation Semantic relationships The semantic relationships we will discuss here are: Hyponyms - X is a subset of Y Synonyms - X is similar to Y Antonyms - X is opposite of Y...
  • PROGRESSIVE ERA - Northern Highlands

    PROGRESSIVE ERA - Northern Highlands

    The Progressive Era had four constitutional amendments within 7 years. There were 43 years between the passage of the 15th and 16th Amendments, and another 12 between the 19th and 20th Amendments. Point of note - the Progressive Era had...
  • Exploring Exploring the the Religions Religions of of

    Exploring Exploring the the Religions Religions of of

    , refers to permitted dietary laws. Rosh . Hashanah - Jewish New Year. Holy of Holies - The sanctuary inside the tabernacle in the Temple of Jerusalem. Conversos - the Spanish Jews who converted to Christianity at the time of...
  • Image Analysis - Computer Action Team

    Image Analysis - Computer Action Team

    This should convince you how important it is to do convolution quickly in modern Spectral Architectures, especially for 3D etc. 2D Convolution Consists of filtering an image A using a filter (mask, template) B. Mask is a small image whose...
  • Experiments with a Robot Photographer

    Experiments with a Robot Photographer

    Denis Zorin Media Research Laboratory Computer Science Department Courant Institute of Mathematics New York University What should you get out of this course? Understand and apply basic concepts Surface modeling Other applications Math Basic ideas and intuition Formalism where it...
  • 2011 Retail Industry Executive Survey Talkbook

    2011 Retail Industry Executive Survey Talkbook

    Deal Income is applied to cost of goods rather than other operating margin Advantages and disadvantages of cost method for merchants in driving efficient operations. Aligns financial reporting with the operational view of the company which provides consistency in inventory...
  • 6th Sunday after Pentecost 8th July, 2012 1

    6th Sunday after Pentecost 8th July, 2012 1

    Mark 6:1- 13. 4But Jesus said, "Prophets are honoured by everyone, except the people of their hometown and their relatives and their own family."5Jesus could not work any miracles there, except to heal a few sick people by placing his...