Data Science and Intro Stat Kari Lock Morgan Assistant Professor of Statistics Penn State University ECOTS May 2018 Data Science in Intro How should intro stat adapt in an era of
abundant availability and use of data? Data Science and Intro Stat Computin g Statistics (Concepts, Methods, Theory)
Domain knowledg e Data Science? as needed to make sense of data Intro Stat? simple
How to Adapt? Focus on making sense of data! Focus on Making Sense of Data ( )= ( ) + ( ) ( ) 2 ( ) = ( )
( = ) = ( 1 ) ( ) ( < ) = 0
How to Adapt? Focus on making sense of data! What kind?!? Data Collection Classical statistics: Design,
Inference! randomness Ask a question Collect (small) data to answer it Data science?
Inference? Obtain available (big) data See what it tells you Data Quality vs Quantity Which provides a better (MSE) estimate? a) A simple random sample of n = 100
b) A non-random sample of n = 50 million (!) (say from the US population of 320 million) with correlation of 0.05 between x and probability of inclusion (relatively small) The small random sample!!! Meng, X.L. (2016). Discussion of Perils and potentials of selfselected entry to epidemiological studies and surveys, Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 319-376.
Data Quality over Quantity For population inference, small random sample beats large biased sample For causality, small randomized experiment beats large observational study (Statistics beats data science? ) Design (randomness) remains important inference remains important! How far might the estimate be from the truth?
Is the effect more than might be seen by chance? But Random sampling/assignment is hard! Non-random data are EVERYWHERE!!! For intro stat to remain relevant, we have to acknowledge and embrace the abundance of available data. AP Stat theme 2 (of 4): Data must be collected according to a well-developed plan if valid information is to be
obtained. How to Adapt? Focus on making sense of data! Keep some design and inference Do more with available data
Design and Inference Random sampling and assignment Inferential concepts sampling variability interval estimation hypothesis testing How can we cover this more efficiently? Simulation-based inference: more for less
What can we cut? Lets prioritize the good stuff! Available Data Acknowledge that not all data come from question -> design -> inference Data quality and limitations (e.g. sampling bias, confounding, missing data) Inferential cautions (e.g. multiple testing, sample size, non-random) Multivariable thinking
Highlight the abundance, diversity, and omnipresence of data One Way to Start www.gapminder.org/tools/ www.gapminder.org/data/ How to Adapt? Focus on making sense of data!
Keep some design and inference Do more with available data Emphasize the overlap Emphasize Overlap
EDA, especially data visualization Choice of graph/stat/parameter/method Modeling Interpretation and communication Context, background, real conclusions Technology Technology in Intro Stat Use technology in a way that engages students eliminates tedious work
excites students enhances conceptual understanding empowers students to make sense of data extendable or easy? Data Science in Intro Lets think about how to keep intro stat relevant in an era of data science! My opinion: Focus on making sense of data Acknowledge that not all data analysis is question
-> purposeful design -> inference But that the above remains valuable! Emphasize the overlap What do you all think?!? www.tricider.com/brainstorming/3R3ZmK3a02l [email protected]