Physcial Science & Engineering - EUDAT

Physcial Science & Engineering - EUDAT

Physical Sciences & Engineering Chair: Johannes Reetz, MPCDF - Max Planck Society www.eudat.eu Rapporteur: Leon du Toit, University of Oslo EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Session 1: Data Pilot presentations 15:00 15:00 15:05 15:20 15:30 15:40 15:50 16:00 16:10 16:30 Physical Sciences and Engineering Parallel Track Chair: Johannes Reetz, MPCDF - MPS

Rapporteur: Leon du Toit, University of Oslo Introduction to track & objectives, Johannes Reetz, MPCDF - MPS NoMaD, Raphael Ritz, Max Planck Society SIMCODE-DS Data Pilot - Matteo Nori, Bologna University Tokamak data pilot - Alys Brett & David Muir, Culham Centre for Fusion Energy TURBASE-DNS Data Pilot - Fabio Bonaccorso, University of Rome Tor Vergata & INFN NFFA-EUROPE Data Pilot - Stefano Cozzini, CNR-IOM Direct simulation data of turbulent flows Data Pilot - Javier Jimenez & Alberto VelaMartin, U. Politcnica Madrid Discussion Networking Coffee Session 2: Data management challenges Discussion facilitator: Claudio Cacciari, CINECA Challenges: Data Repository vs. LT Data sharing Objective: discuss about common data (management, stewardship) challenges Expected Outcome: A series of insights to a variety of approaches and view points, between related communities A set of (new) common needs where EUDAT could play

DATA Domains PUBLISHED DATA DOMAIN Linking Linking Publications Publications To To Digital Digital Objects Objects Discovery Discovery of of Digital Digital Objects Objects REGISTERED DATA DOMAIN

Stage Stage Digital Digital Objects Objects Register Register Digital Digital Objects Objects EUDATDATA Domains PUBLISHED DATA DOMAIN Linking Linking Publications Publications To To

Digital Digital Objects Objects Discovery Discovery of of Digital Digital Objects Objects REGISTERED DATA DOMAIN Stage Stage Digital Digital Objects Objects Register Register Digital

Digital Objects Objects EUDATDATA Domains Discovery Discovery of of Digital Digital Objects Objects REGISTERED DATA DOMAIN Stage Stage Digital Digital Objects Objects Register

Register Digital Digital Objects Objects Data Objects Data Entities Live Data repository vs. Long Term data sharing A Data Repository for live data Data gets updated during its life cycle Metadata and provenance information gets updated Collections get extended Research collaborations need shared data access to live unregistred data. e.g. a Dropbox variant , is this enough? An archiving-system for LT data sharing static data Curation, data publication, certification Cant we have a single system for all such types of data? What is needed, what can be managed, what can be afforded?

Live Data repository vs. Long Term data sharing Sharing & LT preservation We are looking for ideas, sharing ideas, finding ideas Sharing raw data Publication of data -> data becomes valuable to other communities after it has been published. The published paper is metadata, when people start reusing it they collect more metadata that is not available in the paper from the author this should be fed back into the metadata store, risk having too much Discoverability is not such a big problem within small communities are informed about their own activities, across communities is the where the problems is Who takes the costs of storage and curation if large data sets are being long-term archived Curation (selection vs management) is difficult in the sense that it is censorship (selectivity); who is entitled to do this, custodian role: knowledgeable contact person - scalability concerns Finer grained definition of custodian role - stages of responsibility, issues of LT, knowledge transfer We should rely on AI and machine learning - agents to help scale this Problem solving for now data depositors should specify LT storage parameters policy, e.g. lifetime, setting the starting point Data protection, privacy and sensitive information - different legal requirements, respond to changing demands on data creators, e.g. legitimate reasons for processing, qualified open access, managing consent implies system design decisions, e.g. PII data Our systems should accomodate versioning, data corrections, provenance Often data do not speak for themselves, only become useful when combined with code; this brings software maintenance; executables; how does this relate to EUDAT; where are the lines drawn

Funders should address who pays for the custodians Related to the data+code combo - sometimes data capture methods necessitate software to reconstruct the data in order to make it analysable Client side software is always relevant - therefore, software maintenance is always present We need to store sufficient information in order to interpret the data; define different levels; collections should contain pointers to software or other necessary tools Capturing workflow (provenance) requires the execution tools to be capable of generating the relavant metadata Should align incentives so scientists have reasons to provide info needed for useful LT preservation Mitigate risk of knowledge loss by gathering as much metadata as possible; Consider the interrelatedness of practices vs technologies Live Data repository vs. Long Term data sharing Live Repository (workspace) We should rely on the user /communities to control the community-specific data management Domain and problem specificity leads to very heterogeneous data making a common live data repository difficult to deal with as a service provider Usage policy trying to reduce dimensionality is a goal for them (defining metadata is part of this effort); tools that can help with this would be useful How deal with data on ingest needs to be post processed? Want to access live data via APIs people mostly want the latest version; discourage people to download; so nobody uses the data files

_really_ large scale is out of scope, we are in the mid scale of data The service provider should be clear about current capabilities and future plans regarding the scale of data one could present structured data in ways without knowing too much about the domain; viz services. Q: Are these tools already available? Live Data repository vs. Long Term data sharing use APIs to abstract and get rid of heterogeneity metadata enables solutions here; communities need to provide metadata standards if they want automated solutions; the amount of useful automation is proportional to the quality and standardisation of the metadata need community model for metadata and interfaces, communities need help to develop standards RDA can help guide development of standards problem is that understanding between large communities to standardise metadata takes a _very_ long time; always evolving, several versions, no silver bullet; domain specific metadata standards also important; need to give knowledge to the researchers; we should also use existing ones; the creators should make tools to make this easier usage metadata - tracking users having agreed TOUs - support for this? provide examples of TOUs? manage access via TOUs? are metadata schemas in the registered domain fixed?

in the past people did take a lot of care with data and metadata but it came to nothing; we need to have high requirements for long term preservation for it to be useful Session 3: Physical Sciences & Engineering Live Data repository vs. Long Term data sharing Results and conclusion Physical Sciences & Engineering Live Data repository vs. Long Term data sharing Long-term preservation aspects LT preservation for sharing ideas, finding ideas, preserving ideas Sharing raw data Upon reusing LT data more metadata collected from the author. Fed back into the metadata store risk having an inflation of metadata and annotations Curation is difficult: risk of censorship (selectivity), who is entitled for this custodian role? Custodian role needs to be defined in detail. Scalability? We could increasingly rely on AI and machine learning techniques; intelligent agents can perhaps help to reduce the deluge of data and meta data prior to the preservation. Data depositors should specify LT preservation parameters (intentions at ingest time) policy, retention time, setting the starting point Handling sensitive information, necessary to log the data providers consent to use their data; cope

with the variety of legal requirements; this implies system design decisions Systems should accomodate versioning, data corrections, provenance LT preserved data remains useful only when linked to the preserved code Collections should contain pointers to software, execution environments and workflows Capturing workflows (provenance) requires the (workflow) systems to be capable to generate the relevant metadata Live Repository We should rely on the user /communities to control the community-specific data management Want to access live data via APIs need community model for metadata and interfaces, communities need help to develop standards RDA can help guide development of standards

Recently Viewed Presentations

  • The Netherlands - Safety

    The Netherlands - Safety

    Examples of Variable Speed Limit Applications Speed Management Workshop January 9, 2000 TRB 79th Annual Meeting Background This document was prepared for use at the Speed Management Issues Workshop (1/9/2000) as part of the Transportation Research Board 79th Annual Meeting.
  • Research Ethics - Weebly

    Research Ethics - Weebly

    SACE Guidelines: Integrity …but also plagiarism, interpretation of results, etc. ... Plagiarism, failure to acknowledge sources. Bad/biased methodology or study design. Cherry-picking, fabrication, fraud, bias. Sloppy or careless research practice.
  • Division of Postsecondary State Authorization Initial Authorization Training

    Division of Postsecondary State Authorization Initial Authorization Training

    Any contract signed by a prospective student as a result of solicitation or enrollment by a non-licensed agent may be unenforceable and the student may be entitled to a refund of all moneys paid. The institution could be fined or...
  • Genetically Modified Crops - West Branch High School

    Genetically Modified Crops - West Branch High School

    Genetically modified organisms (GMOs) can be defined as organisms. Pros. Plants animals or microorganisms in which the genetic material (DNA) has been altered in a way that does not occur naturally.
  • TENSES of VERBS

    TENSES of VERBS

    An all-Filipino dance group, the Junior New System h. as gained . the judges' nod with their shadow performance. The Present Perfect Tense. Have + past participle form of the verb - used with first and second person singular pronouns...
  • Chapter 18 - Stars - Physics & Astronomy

    Chapter 18 - Stars - Physics & Astronomy

    At some point the luminosity is large enough to blow away most of the surrounding gas. Strong winds observed in protostars ("T Tauri stars" and "Herbig-Haro objects"). Most gas never made it onto star. Planets may form in protostellar disk...
  • Organic Chemistry Third Edition David Klein Chapter 10

    Organic Chemistry Third Edition David Klein Chapter 10

    Author: jyee Created Date: 10/04/2013 16:43:04 Title: Slide 1 Last modified by: Janakiraman.S Company: Wiley Publishing, Inc.
  • DESARROLLO EMBRIONARIO EN EL HOMBRE - I LA

    DESARROLLO EMBRIONARIO EN EL HOMBRE - I LA

    Las células trofoblásticas comienzan a introducirse en la mucosa uterina. Por secreción de enzimas proteolíticos por parte de células del trofoblasto. Sexto día de embarazo I M P L A N T A C I Ó N DISCO GERMINATIVO BILAMINAR...