Automated Metadata, Provenance Cataloging and Navigable Interfaces:

Ensuring the Usefulness of Extreme-Scale Data

Project Objectives

Data, from large-scale experiments and extreme-scale computing is expensive to produce and may be used for high-consequence applications. However, it is not the mere existence of data that is important, but our ability to make use of it. The goal of this research is to create a data model, infrastructure, and a set of tools that support data tracking, cataloging, and integration across a broad scientific domain. Our system would document workflow and data provenance in the widest sense, enabling us to answer the questions â€•who, what, when, how and whyâ€- for each data element; provide information about the connections and dependences between the data elements; and allow human or automatic annotation for any data element.

Project Approach

The project combines research on integrated metadata, provenance, and ontology storage along with research on user interfaces, including graphical navigation to build tools for real world science. These tools would be demonstrated in large national and international fusion sciences collaborations - from which user experience would be collected and lessons-learned tabulated to provide feedback for improved design. We believe that rapid prototyping and testing by real users on real problems at scale is crucial. Solution to â€•toyâ€- problems does not provide the opportunity for useful feedback. While using Fusion Energy Sciences as a test bed, our conceptual framework and data model will be quite general and not contain specific references to the fusion domain. We expect that what we develop will be applicable to many if not most science areas. Although the equations solved by simulations are different for each diverse field of science, the basic flow of information, the need to document workflow and provenance, and the need to allow traceability of results is common to all. Our work will in effect create a modern â€•scientific notebookâ€- for computational and experimental science. Similar common needs exist for experimental data. Futhermore, all fields of science struggle to integrate information from simulation and experiment and to extract knowledge from the confrontation between the two and our goal is to aid this task as well.

Project Elements

There are four main elements to the proposed research. The first will investigate the primitives and language for annotation required to automatically document provenance data. The second element will investigate the best approaches and technologies for integrating metadata, provenance, and workflow documentation. This will include methods for representing graphs, ontologies and taxonomy. The third element will research user interfaces, including graphical navigation. The proposed research will investigate the best methods for displaying, navigating, and interacting with the metadata catalog. A critical element will be to provide users with the tools to interactively explore data relationships. The fourth element involves deployment, and testing. This project will use agile development methods by carrying out rapid prototyping, deployment and testing aimed at getting feedback from real users working on real problems at scale. It is envisioned that research in the first three areas will rapidly feed into the deployment task allowing feedback that can influence the research within the time scale of this project.

Project Team

To carry out the proposed effort, a closely coordinated, multi-institutional team General Atomics (GA), Lawrence Berkeley National Laboratory (LBNL), and Massachusetts Institute of Technology (MIT) will be formed, consisting of researchers from computer science and fusion science. Collectively, the institutions and individuals involved have a broad range of experience and expertise in all technical areas relevant to this project. The team has a strong history of working collaboratively and of providing tools and services for scientific research. Finally, the team has extensive connections in the physical and computer science communities, including both domestic and international groups eager to employ the products of the proposed research.