CMCS
Goals and Objectives
Participants
Advisory Board
ABCs of CMCS
Publications
Resources
Project Status
Technical Details
Links
CMCS Developers
 
  Technical Details
 
   HomeTeam MembersFeedbackThe CMCS Portal

Collaboratory for Multi-scale Chemical Science

Project Lead: Larry Rahn

Institutional Points-Of-Contact:

Sandia National Laboratories: Larry Rahn
Pacific Northwest National Laboratory: Brett Didier
Argonne National Laboratory: Al Wagner
Lawrence Livermore National Laboratory: William Pitz
Los Alamos National Laboratory: David Montoya
National Institute of Standards and Technology: Thomas C. Allison
Massachusetts Institute of Technology: William Green
University of California at Berkeley: Michael Frenklach
Introduction

Rapid advances in computational hardware and software along with innovative experimental techniques are revolutionizing the rate at which chemical science research can produce the new information necessary to advance combustion technology, straining the traditional methods of communication through peer-reviewed literature and static databases. The Collaboratory for the Multi-scale Chemical Sciences (CMCS) Pilot Project brings together leaders in scientific research and technological development across multiple DOE laboratories, other government laboratories and academic institutions to develop an informatics-based approach to synthesizing multi-scale information to create knowledge in the chemical sciences.

CMCS is using advanced collaboration and metadata-based data management technologies to develop a Chemical Sciences portal providing support for distributed research, community communications, and data discovery, management, and annotation capabilities. The portal assists in documenting and browsing data pedigree and in communicating cross-scale dependencies between data produced at one scale and the results of computations using it at the next. A variety of standards-based mechanisms for extracting metadata from files, translating between schema, converting data formats, and integrating external applications are designed to minimize the work required to adopt CMCS capabilities.

The CMCS project also involves a set of efforts to demontrate interactions between the portal and key national chemistry resources (data and software) and to support pilot groups using cutting-edge chemical informatics techniques. These efforts span from quantum mechanics to fluid flow and together, are designed to demonstrate the potential of the CMCS infrastructure to qualitatively and quantitatively change the way chemical knowledge is produced. If successful, CMCS will significantly enhance the coordination of research efforts across related sub-disciplines in the chemical sciences, focus research at one scale on obtaining or refining values critical in the next, reduce work performed using limited or outdated values, and enhance the ability of the chemistry community to meet national research challenges.


Excerpts from the CMCS Proposal to the DOE National Collaboratories Program

Background and Significance

Multi-scale Dependencies - The Need for Collaboration

The area of chemical sciences is representative of many DOE programs in that it addresses complex multi-scale phenomena. The situation is similar in earth system studies, fusion research, high-energy physics, biology, and other areas of science - an understanding of environment and device scale phenomena requires more than simply applying one type of computation, with increased computing power, across scales. Different physical phenomena dominate system dynamics at these different scales, leading to a variety of models and experiments relevant in the different regimes. Information from one regime is used as input for the next, essentially "bootstrapping" from the atomistic to the device level. One of the major bottlenecks in such a multi-scale research enterprise is the passing of information from one level to the next in a consistent and validated manner.

The scientific process described above leads to a data- and model-centric view of the communications between sub-disciplines working at different scales. Data at one level is analyzed to develop a model that produces data used in turn by another, repeatedly across the range of scales and types of chemical information required. However, in this process more than just the raw data values need to be communicated. Confidence in a value's accuracy, its uncertainty, dependencies on other data, etc. must all be considered when using it in further computational and experimental research. In the direction of decreasing length and time scales, information about the sensitivity of models on particular data may place a premium on very accurate values for certain fundamental quantities. Enabling the rich bi-directional exchange of both data and metadata between scales is a critical issue in making progress.

Traditionally, this information flow has been accomplished through the research literature and, more recently, through databases of chemical values. Discovery of new information in these sources is a manual process. Further, the information is fragmented. Determining whether results presented in a paper depend on obsolete values from a different regime may require searching through several papers and databases. These factors make communication difficult and time consuming and increase the likelihood of redundant and irrelevant research.

Across the domains represented in DOE's Scientific Discovery through Advanced Computation (SciDAC), communication of expertise and the flow of information between sub-disciplines targeting different physical regimes will be as critical as increasing computational capabilities within a domain in effectively and efficiently producing practical results from basic research. Current manual approaches to coordinating multi-scale research cannot themselves scale to the amounts of data that will be generated through SciDAC and to the level of effectiveness and efficiency required to tackle national science issues in a cost effective manner. The multi-scale communications challenges facing SciDAC researchers are not discipline specific. Thus, a solution to these issues in the chemical sciences will provide a model for multi-scale science that can guide efforts in other domains.

An Informatics Approach to Multi-scale Science

To overcome current barriers to collaboration and knowledge transfer among researchers working at different scales, a number of enhancements must be made to the information technology infrastructure of the community:

  • A collaboration infrastructure is required to enable real-time and asynchronous collaborative development of standards for data and metadata description, inter-scale scientific communication, geographically distributed disciplinary collaboration, and project management.

  • Tools now used to generate and analyze data at each scale must be modified to enable generation and storage of the required metadata in a format that allows interoperability with other tools and collaboratory functions, and must be made available for use by geographically distributed collaborators.

  • Repositories are required to store chemical sciences data and metadata in a way that preserves data integrity and allows web access.

  • New tools are required to search and query metadata, and to retrieve data across all scales, disciplines, and locations. These tools should be available via an integrated user-customizable interface or portal.

The complexities of managing information within such an infrastructure are daunting and the creation, communica-tion and use of the additional information could quickly become unwieldy. However, recent technological advances, in particular the development of the extensible markup language (XML) [1] for defining machine and human readable metadata based on standard schema, have significantly reduced the barriers to creating such a comprehensive informatics environment.

We propose a Collaboratory for Multi-scale Chemical Science (CMCS) focusing on combustion research that will demonstrate that an integrated multi-scale approach to scientific and engineering research is not only possible but can produce significant benefits in harnessing research to address real-world issues. The field of combustion is critical to the DOE mission for clean and efficient energy, and the DOE has ongoing investments in research across the full range of relevant scales and disciplines. The CMCS will bring an integrated, informatics-based approach to combustion research that enhances and begins to automate the flow of information between sub-disciplines.

CMCS efforts to develop tools supporting the multi-scale analysis of chemical systems for combustion will be directly applicable to other communities in the chemical sciences and related fields. Furthermore, CMCS will provide a model for multi-scale science that can be replicated in other domains.

Combustion: A Multi-scale Chemical Sciences Pilot


Figure 1. Combustion modeling requires the integration
of scientific knowledge over a large range of scales.

Fossil fuel energy supply and fossil-fueled combustion systems are the cornerstones of the industrial and commer-cial sectors of the U.S. economy, accounting for 85% of the energy consumed in the United States each year. In the private sector the combustion of fossil fuels provide a level of comfort and mobility for U.S. citizens that is unrivaled in the world. Fossil fuels continue to be inexpensive and the supply of fossil fuels remains stable, although heavy dependence on foreign sources has led to major economic and societal dislocations in the past twenty-five years and threatens to again in the future. Also, recent changes in international environmental mandates for lower CO2 emissions have emerged as strong drivers for increased combustion efficiency. Despite continuing investments in alternative energy sources, the importance of hydrocarbon fuels, as they relate to the economy and quality of life in the United States, is unlikely to change in the foreseeable future.

The advancement of the DOE mission for efficient, low-impact energy sources and utilization relies upon continued significant advances in fundamental chemical sciences and the effective use of the knowledge accompanying these advances across a broad range of disciplines and scales. This challenge is exemplified in the development of pre-dictive computational models for realistic combustion devices. Combustion modeling requires the integration of computational physical and chemical models that span space and time scales from atomistic processes to those of the physical combustion device itself as illustrated in Figure 1.

Combustion systems involve three-dimensional, time-dependent, chemical-ly reacting turbulent flows that may include multiphase effects with liquid droplets and solid particles in complex physical configurations. Against this fluid-dynamical backdrop, chemical reactions occur that determine the energy production in the system, as well as the emissions that are produced. For complex fuels, the chemistry involves hundreds to thousands of chemical species participating in thousands of reactions. These chemical reactions occur in an environment that is defined by both thermal conduction and radiation. Reaction rates as a function of temperature and pressure are determined experimentally and by a number of methods using data from quantum mechanical computations. The collaborative creation, discovery, and exchange of information across all of these scales and disciplines are required [2] to meet DOE's mission requirements.

Architecture

The description above provides a brief glimpse of the inherent complexity that must currently be managed manually by combustion researchers. CMCS does not seek to eliminate this complexity, but to provide community-wide capabilities for managing it. As shown in Figure 2, CMCS will provide the majority of its capabilities through a web-based Multi-scale Chemical Science (MCS) portal. This portal will provide a customizable access point for accessing community knowledge bases, interacting with other members of the multi-scale chemistry community, performing and documenting experiments, and supporting the multiple paths of information flow necessary to integrate these activities. These capabilities rely on an underlying Grid [23] infrastructure including a sophisticated metadata and annotation management subsystem. Standard chemical information interchange schema combined with metadata/data translation mechanisms enable loose federation of underlying group and community data stores and global tracking of information such as data pedigrees. Chemistry domain applications, modified or extended to understand the community-developed schema, help automate the collection of and directly exploit data pedigree and other newly available data annotations.


Figure 2. Architecture diagram for CMCS showing portal integration of domain applications
and data resources from across the multi-scale chemistry community.

MCS Community Portal

The multi-scale chemical science (MCS) portal will be developed as the focal point of the collaboratory. As shown in Figure 3, the portal will provide a broad range of capabilities. Bringing these capabilities together across chemistry sub-domains will provide convenience while helping researchers think and act in the larger combustion context. For the purposes of discussion, portal capabilities are divided into groups related to accessing community knowledge resources, interacting within the community, and generating new knowledge through research projects. These activity-based groupings are clearly not orthogonal and our design does not make any such distinctions. Rather, the design prescribes the use of common services across all portal tools to enable facile communications between them, with the goal of enhancing researcher's ability to organize and move information through the various activities beyond what is currently possible.


Figure 3. Prototype Design of Portal for Multi-scale Chemical Science.

Figure 3 shows details of the portal's overall design. Users will be able to create profiles that define their preferred view(s) of the portal. Profiles can be used to aggregate users into communities of interest such as "Turbulent reacting flow model development" or "XML schema for reduced chemical mechanisms". Security is integrated into the portal architecture, providing authentication, authorization control, and encryption capabilities that can be passed to underlying resources.

The MCS portal will be customizable so that it supports the needs of the user and the user's workflow. It will allow users to emphasize information relevant to a subset of chemical scales or to emphasize data submission and search capabilities, tools for organizing and understanding relationships in the data, or the capabilities for interaction with colleagues while helping maintain contextual awareness of the overall community. Users will have access to portal configuration tools to create custom views and save them in personal, group, or community profiles.

A variety of web-based technologies exist for the development of such a portal. The WorkTools system [24], developed at the University of Michigan, is currently in use within the Space Physics and Aeronomy Research Collaboratory (SPARC) as a means of accessing data and colleagues. It allows researchers to select a set of tools such as chat boxes and live instrument data feeds and position them as desired within a persistent web page. Similarly, NCSA's OPIE system [25] allows surfers to arrange tools within a browser window through the same click-and-drag operations used to arrange windows on a computer desktop. Within the DOE community, work has been done to allow Grid resources to be accessed through portals as exemplified by the NPACI HotPage site [26], and additional work to develop standards for portal layout via a PortalML language [27] are anticipated within SciDAC. Commercial tools for portal development also exist, targeted at companies wishing to create "My Yahoo" style enterprise portals for employees and customers. While each of these systems has strengths and weaknesses, many could provide the basic capabilities needed within MCS. In developing the portal, we will evaluate existing and emerging portal development environments, select one, and adapt it as needed to meet MCS requirements, potentially in collaboration with the environment's developers.

Once the basic infrastructure is designed, we will begin to integrate specific MCS capabilities. Since significant technology development will be required to achieve the desired levels of functionality and integration, we will follow an incremental approach to developing the overall portal, using it to enable access to prototype capabilities very early in the project. Over time, more integrated and feature-rich versions of the individual tools will be developed and made available through the portal. To prioritize efforts and ensure overall system usability, specific use cases based on the CMCS guiding scenario will be developed early in the project and periodically updated to reflect the growing understanding of research.

System Security

Due to its central role in CMCS, the portal is a logical place to coordinate system security. We envision the portal providing single sign-on capabilities across the tools represented. The wide range of tools to be integrated and the bi-directional flow of information between private and public repositories proposed in CMCS (as detailed in subsequent sections) make system security a challenging issue. Obtaining sufficient expertise with security technologies and in the design and deployment of secure services in distributed environments was a primary consideration in the selection of CMCS development team members.

Existing public key infrastructure (PKI) technologies can be leveraged to provide the basic authentication, authorization, encryption, and non-repudiation services required in CMCS. Because the portal will be web-based, standard web mechanisms for authenticating users, including using public-key certificates, can be applied. Web browsers and servers currently incorporate the technology necessary to request and validate user credentials and to set up secure socket layer (SSL) encrypted communications. The Grid Security Infrastructure (GSI), implemented, in collaboration with others, by the Distributed Systems Laboratory (DSL) at Argonne National Laboratory [28] is a PKI-based system that includes delegation capabilities, allowing a globally trusted public-key identity to be used to securely obtain a local credential in a non-PKI system, e.g. a UNIX username/password. A SciDAC proposal, "Security and Policy for Group Collaboration" [29] led by ANL, offers advancements in GSI that will be useful in latter stages of CMCS development. In CMCS, GSI would allow an MCS portal user's PKI credentials to be used to automatically obtain credentials for back-end systems such as databases. The advanced attribute-based Akenti system, which is used within the DOE2000 Diesel Collaboratory project, is intended to provide scalable security services in highly distributed, collaborative, multi-institutional environments. The work described in LBNL's "Distributed Security Architectures: Middleware for Distributed Computing" submission to SciDAC [30], which proposes to integrate GSI credentials and Akenti and also to integrate Akenti access control with the Distributed Authoring and Versioning (DAV) extension to the HyperText Transfer Protocol (HTTP) would provide significant benefits to CMCS. (The significance of DAV in the CMCS is discussed later in the proposal.) Finally, digital signature and timestamping services are also becoming readily available through software development kits and Internet services.

Thus, it should be possible to assemble a strong basic security infrastructure for the MCS portal from existing components through careful integration and deployment efforts. However, two aspects of CMCS may require advances to meet community needs. The first deals with limiting the amount of resources that can be used on the user's behalf when the portal delegates the user's credentials to a back-end or external system. Similarly, the length of time over which such delegation is allowed to occur may also need to be limited. Appropriate policies for various CMCS use scenarios will need to be defined and technologies to implement them must be developed. A variety of approaches can be used to provide incremental capabilities in this area, and we anticipate the development of general limited delegation capabilities within the Grid community. We will seek to collaborate with any such efforts by providing requirements and testing the systems developed. The second aspect of CMCS security requirements that will need to be refined during the project relates to the capabilities provided for attaching private annotations to publicly available data. The general solution of providing fine-grained access controls on each such relationship is clearly too cumbersome. Defining an appropriate level of access granularity to balance ease-of-use with user requirements will require experimentation during the project lifecycle.

Knowledge Management

CMCS will provide a unifying interface to combustion-related chemical science information. This information is currently scattered in public and private databases, flat files, and in the scientific literature, and is organized primarily along sub-disciplinary lines. Information is locked in a variety of formats and provided with varying degrees of validation and context. The NIST databases [31] have been widely praised by the combustion community for their organizing influence, but the fraction of data contained therein is small. In CMCS, we propose to leverage the NIST and other databases, and provide added value to working researchers in a number of directions:

  • Providing a single location from which to access the wide range of data needed for combustion research

  • Reducing or eliminating the difference in access protocols and procedures between resources

  • Developing community standards for representing information that crosses domain boundaries

  • Using these definitions, combined with translation capabilities, to allow federated searches across multiple databases

  • Using similar mechanisms to automate the reformatting of information retrieved from databases for use in research applications and vice versa

  • Developing machine-processable representations of data pedigree and dependency relationships

  • Providing tools for manual and automated traversal of pedigree and dependency relationships to support group and community data validation efforts and exploration of sensitivity analyses

  • Allowing arbitrary annotations and comments to be attached to data values to support community expertise building and consensus formation activities

The underlying architecture proposed to address these issues, and the implementation plans for providing search, retrieval, and submission capabilities and for managing and exploring data pedigree information are detailed in the following subsections. Additional capabilities related to generation and exploitation of data annotations, and to the refinement and validation of information through community processes are discussed in sections 1.3.6 and 1.3.7 respectively. An initial exploration of the type of architecture proposed here, and a further discussion of its benefits, is reported in a draft paper in Appendix D.

Metadata/Annotation Management Overview

Metadata is commonly defined as information 'about' data values and data sets. However, such a definition is very dependent on one's perspective. For example, whether a chemical formula is metadata about a molecular geometry, or whether geometry and other information such as the heat of formation are metadata about a chemical formula is a matter of perspective. Such differences of opinion, once encoded in software, are an endless source of barriers to cross-scale collaboration. Within this proposal, we modify this definition to equate the term metadata with data values that have meaning across domains. In the example above, the chemical formula is metadata because it has meaning for both quantum- and thermo-chemists. Heat of formation would not be considered metadata until we expand scope and realize that both thermochemists and kineticists ascribe meaning to it.

Although this shows that our definition continues to have some context-dependence, it is more powerful than the original. By our definition, data is opaque and meaning-free outside a sub-discipline and, as a corollary, efforts to standardize formats and meanings between collaborators to support inter-scale search capabilities, application interoperability, etc. can be confined to metadata. Further, the system architecture can treat data as opaque as well and no restrictions need be placed on its format. In contrast, because metadata must be understood and manipulated, it must be formatted in a way that exposes its meaning in machine-comprehensible form. An important consequence of this bifurcation is that it minimizes the effort required to allow two parties to collaborate - no changes are required to any applications, and no agreements need be reached about the meaning of terms, except those directly concerning the values that will be exchanged.

Such an architecture is at the heart of the DOE2000 electronic notebooks and is the reason they can be easily extended to handle new annotation types; the complexities of handling the annotation data can be confined to the components that create and render the annotation. No translation of the annotation data is required by the notebook and the base notebook system only assigns meaning, and has code to manipulate, metadata such as author name, creation date, data type, digital signatures, etc. In defining the CMCS architecture, we looked to exploit the architecture explored in DOE2000 electronic notebooks while taking advantage of technologies that have matured since their inception. The directions being proposed for the evolution of electronic notebooks and the broader concept of metadata-based system design proposed within the "Scientific Annotation Middleware (SAM)" [32] submission to SciDAC have been a key influence.

The basis for both CMCS and SAM architectures is the formatting of metadata using the XML [1]. XML is a powerful language for encoding the definition of technical terms in a human- and machine-readable form. XML's expressive power, together with the availability of technologies for manipulating it - authoring, parsing, validating, translating, etc., have made it a de facto standard for information exchange in new systems.

Since both efforts will leverage the Web-based DAV protocol [9], CMCS anticipates being able take advantage of work done under the SAM proposal. DAV is an Internet Engineering Task Force (IETF) standard set of extensions to the HTTP/1.1 protocol to support basic data management over the web including storage and retrieval of typed, opaque data files/objects, content locking, hierarchical collections and annotation of the data with arbitrary metadata [9]. It defines the formatting of metadata in properties consisting of XML key:value pairs and provides operations for creating, removing, and querying them. The extensible DAV Searching and Locating (DASL) protocol [33] adds methods for server-side search capabilities. It provides a basic search grammar and can be extended with additional grammars, e.g. XML Query.

A layered set of services built on top of DAV/DASL that provide successively more specialized capabilities has been outlined in the SAM proposal. The brief description here highlights functionality directly relevant for CMCS. The Metadata Management Services (MMS) will support simple federation of DAV servers, allowing propagation of storage and retrieval requests and queries down through a hierarchy of servers. It will also provide mechanisms for registering metadata generation tools that can parse data of specified types and generate new DAV properties. The SAM Semantic Services (SS) adds a standard for representing semantic relationships between DAV objects. A Notebook Services (NS) layer defines records management capabilities in terms of specific semantic relationships and property definitions. A set of interface components, including a graphical relationship browser, and programming interfaces provides access to the services. A SAM-based notebook will be built from these components that could be leveraged by the CMCS to develop novel functionality.

Community Standard Schemas

This high-level description of the proposed CMCS metadata management infrastructure provides the context necessary to discuss the tasks that will be undertaken in CMCS to provide community knowledge management capabilities. The critical initial steps are the standardization of definitions for relevant chemical information that will be exchanged by CMCS researchers and the formal representation of these definitions in XML. Such representations are referred to as schema. CMCS researchers will engage their respective chemical science communities in these schema development efforts through the developing MCS portal and traditional community forums. Although we do not wish to under-represent the effort that will be required, it should be noted that overall scope is relatively small. CMCS does not require a single, global schema that represents all of chemical knowledge. Only information that will be exposed as metadata need be defined. Further, schema can be evolved within the system and variants can be accommodated. Thus, for example, if the stakeholders for two community resources wish to define metadata quantities beyond the minimal set on which agreement can be reached, annotations can be supported in both dialects. If complete agreement can be reached at a later date, a translator can be registered with the system to allow queries expressed in the new schema to match metadata defined with the old standards.

As shown in Figure 2, we anticipate the definition of five potentially overlapping schema within CMCS that define the information that must be exchanged across the boundaries between the five chemistry sub-domains in the guiding scenario as well as a general schema for representing dependency data. The inter-domain schema will most likely leverage nascent community efforts such as the Chemical Markup Language (CML) [15], which defines a set of basic molecular properties such as chemical formula and molecular geometry. Efforts at PNNL and NIST to define quantum chemistry, and thermochemistry and kinetics information, respectively, can also be leveraged. If significant overlap is observed in the developing schema, it may be possible to pull common elements into a broader chemistry schema. Since there are no limitations in DAV and XML in using multiple schemas to define properties on a single object, any such adjustments can be made with little impact on CMCS scope.

Since the definition of dependency relationships is not chemistry-specific, it may be possible to form a broad collaboration amongst problem solving environment developers to define the concepts necessary to track data pedigree and map the sensitivity relationships that relate the uncertainty in one quantity to the uncertainty in derived quantities. Such a common schema would enable re-use of tools for traversing such relationships in multiple projects.

Resource Wrapping and Registration / Metadata-enabled Data Repositories

DAV is quickly gaining in popularity. Major applications including Microsoft Office and Oracle's Internet File System ship with DAV support. Database and application framework vendors are in various stages of offering support for developing DAV views of an underlying relational database. Public domain DAV servers exist that support flat file data repositories and servers based on the open-source MySQL relational databases are expected. Thus, we expect that developing basic DAV interfaces for CMCS community databases will be a relatively straightforward task that can be completed early in the project. NIST databases of thermochemical and kinetics data will be initial targets for this work. It is less clear how quickly support for DASL and query grammars such as XML Query will become available and it may be necessary to implement some basic functionality in these areas to achieve CMCS goals.

While the number of community data resources that will be wrapped as part of the CMCS pilot project is limited, it should be well within the scope of individual researchers and institutions to develop wrappers for additional databases. The translation and metadata generation capabilities available through SAM should simplify interaction with resources that have already been converted to XML using non-CMCS schema. We will actively pursue such integration activities during the project lifecycle with the aim of building momentum for the long-term support of CMCS capabilities within the community.

Once resources are DAV-enabled, a mechanism is needed to make the portal knowledge-management capabilities aware of them. The registration process should be relatively straight forward, involving the specification of the resource's uniform resource locator (URL) and any schema translations that should be applied when accessing it through the MCS portal. For submission operations, and for non-public resources, interaction of the MCS portal and resource security systems may involve more complexity and require configuration of a GSI-based credential delegation. We will initially develop a form-based resource registration capability accessible through the portal that will support the simple case. We will investigate the practicality of providing web-based configuration for the more general case involving security system integration and work with the CMCS community to determine whether some options, such as the selection of particular schema translations, should be exposed on a per user basis through the general portal customization mechanism. As described in more detail in later sections, we intend to allow personal and group information in notebooks to be registered as searchable resources as well.

Multi-scale Integrated Search

The multi-scale search capability within the MCS portal will provide researchers with access to the registered data stores in terms of the community defined interchange schema regardless of the format of the resources themselves. The types of anticipated queries range from simple searches for all measured or calculated values of a molecular property to searches for a contact information for groups who have calculated properties that have significant effects on the uncertainty in a derived reaction rate. The latter, which might be used to initiate a discussion about obtaining updated values of the more fundamental properties, would be a laborious process today, but within CMCS it could be described succinctly and executed automatically. As with databases today, it may be important to provide predefined templates for such complex queries rather than requiring users to formulate them from scratch in the low-level query grammar.

We anticipate that a generic web-based DASL client can provide significant functionality for simple queries since DASL supports methods by which the client can dynamically learn what search grammars are supported and can discover the complete list of property names (keys) that exist. Thus, a generic DASL client may be able to provide scaffolding for helping users construct queries such as drop down lists of the available chemistry-related query terms and search operators, without itself being chemistry specific. We are not aware of such a mature DASL client at this time, so effort may be required within CMCS to enhance a simpler client or to develop one. However, because the client is not chemistry specific, there are significant opportunities to work with SAM and other projects using DASL in its development.

The more complex query example above would rely on the Resource Description Language (RDF) [34] to support semantic relationships and the dependency schema developed in CMCS. The mechanism for this may be derived from the SAM project, which intends to provide an enhanced grammar through DASL to aid in the construction of such queries. Interestingly, since a SAM-enabled notebook is layered and semantic relationships are ultimately defined in XML, the execution of such queries may simply involve server-side translation of the query into the basic DASL grammar within SAM and final execution within the standard DASL engines of federated repositories. Alternatively, similar mechanisms could be developed on the client side within CMCS, although it would involve significant additional work.

Content Submission and Retrieval

Once CMCS users have identified information they wish to retrieve, either using the search describe above or pedigree browsing tools described below, or via a reference obtained in another manner, it is a simple matter to download the data to the local machine. If the information needed is metadata, it will already be represented in the XML formatted response to the query, or can be retrieved using the DAV 'propfind' method. If it is data that is required, it can be retrieved using the DAV 'get' method. Since DAV is an extension of HTTP, browsers can execute this 'get' request given the URL of the data item. Capabilities for supporting such requests will be developed in the MCS portal.

While the methods above provide means to retrieve metadata and data, they do not address the issues of making the information usable within applications. To do this, a means of translating the returned values into the data formats expected by the applications is required. Similarly, while submission of new data and metadata is conceptually simple and we may provide some simple capabilities within the portal for manual submission, discussion of the technology to support submission from within applications, and of the policy and procedural issues associated with adding material to a curated data store are delayed until later sections.

Data Pedigree and Dependency Browsing

Data pedigree refers to information about how a particular piece of data was produced. In a narrow sense, this implies information about who created the data, when it was produced, and the technique used to create it. In the computational domain, the latter would include information on what software was run, its version number, and the specific parameters used as input to the calculation. As discussed previously, in a broader sense pedigree information may include information on assumptions that limit the data's range of applicability and sensitivity information tracing the uncertainly in a given value to uncertainties in the technique used and uncertainty in values used to calculate it. A pedigree browsing mechanism would allow researchers to discover and navigate through this information.

In CMCS, we anticipate a variety of sources of pedigree information - applications, notebook entries, via metadata generators that extract information from data objects, etc. Such information will be standardized based on the pedigree and dependency schema(s) defined/adopted by the CMCS community. We propose to provide a tool for browsing this information within the MCS portal. Such a tool would provide a visual representation of the pedigree information and allow users to shift focus forward and backward along pedigree chains. SAM specifies a basic component for such navigation. We anticipate guiding its development with requirements from CMCS users and embedding it within a pedigree browsing tool that would be capable of communicating with other MCS tools through drag-and-drop of data URLs, allowing, for example, a user to drag data sets returned in response to a query into the pedigree browser to understand their history.

It should also be noted that the concept of Active Tables can be explained in part as a dependency browsing mechanism. Active Tables combine a means of representing dependencies to the user with additional capabilities to ensure consistency across dependency networks and a means for updating dependency information. Thus, we anticipate opportunities for technology sharing and interactions between Active Tables and the portal pedigree browser that can be explored during the project. Similar overlap exists with the sensitivity analyses involved in chemical mechanism refinement and reduction, which again, may lead to possibilities for linkage with the pedigree browser development.

Literature Cited/References
  1. XML Specification, http://www.w3.org/TR/2000/REC-xml-20001006.

  2. J. S. Binkley, et. al., "Combustion Simulation and Modeling," Proceedings of the Workshop on Combustion Simulation and Modeling, Reston, VA, June/July 1998.

  3. C. M. Pancerella, L. A. Rahn, and C. L. Yang, "The Diesel Combustion Collaboratory: Combustion Researchers Collaborating over the Internet", Proceedings of ACM/IEEE SC99 Conference, Portland, OR, November 1999.

  4. R. A. Kendall, E. Aprà, D. E. Bernholdt, E. J. Bylaska, M. Dupuis, G. I. Fann, R. J. Harrison, J. Ju, J. A. Nichols, J. Nieplocha, T. P. Straatsma, T. L. Windus, and A. T. Wong, "High Performance Computational Chemistry; an Overview of NWChem a Distributed Parallel Application," Computer Physics Communications, 128, pp. 260, 2000.

  5. D. E. Bernholdt, E. Aprà, H. A. Früchtl, M.F. Guest, R. J. Harrison, R. A. Kendall, R. A. Kutteh, X. Long, J. B. Nicholas, J. A. Nichols, H. L. Taylor, A. T. Wong, G. I. Fann, R. J. Littlefield, and J. Nieplocha, "Parallel Computational Chemistry Made Easier: The Development of NWChem", Int. J. Quantum Chem.: Quantum Chem. Symposium 29, pp. 475-483, 1995.

  6. M.F. Guest, E. Aprà, D. E. Bernholdt, H. A. Früchtl, R. J. Harrison, R. A. Kendall, R. A. Kutteh, X. Long, J. B. Nicholas, J. A. Nichols, H. L. Taylor, A. T. Wong, G. I. Fann, R. J. Littlefield, and J. Nieplocha, "High Performance Computational Chemistry: NWChem and Fully Distributed Parallel Applications", in Advances in Parallel Computing, 10, High Performance Computing: Technology, Methods, and Applications, Eds. J. Dongarra, L. Gradinetti, G. Joubert, and J. Kowalik, (Elsevier Science B. V.), pp. 395-427, 1995.

  7. Extensible Computational Chemistry Environment, http://www.emsl.pnl.gov:2080/docs/ecce/index.html.

  8. RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt, June 1999.

  9. E. J. Whitehead, Jr. and M. Wiggins, "WebDAV: IETF Standard for Collaborative Authoring on the Web," IEEE Internet Computing, Vol. 2, No. 5, pp. 34, September-October 1998.

  10. B. Ruscic, J. V. Michael, P. C. Redfern, L. A. Curtiss, and K. Raghavachari, "Simultaneous Adjustment of Experimentally Based Enthalpies of Formation of CF3X, X = nil, H, Cl, Br, I, CF3, CN, and a Probe of G3 Theory," J. Phys. Chem. A. 102, pp. 10889-10899, 1998.

  11. B. Ruscic, M. Litorja, and R. L. Asher, "Ionization Energy of Methylene Revisited: Improved Values for the Enthalpy of Formation of CH2 and the Bond Energy of CH3 via Simultaneous Solution of the Local Thermochemical Network", J. Phys. Chem. A. 103, pp. 8625-8633, 1999 .

  12. M. Frenklach, "Modeling", in Combustion Chemistry (W. C. Gardiner, Jr., Ed.), Springer-Verlag, New York, Chap. 7, pp. 423-453, 1984.

  13. M. Frenklach, H. Wang, and M. Rabinowitz, "Optimization and Analysis of Large Chemical Kinetic Mechanisms Using the Solution Mapping Method - Combustion of Methane," J. Prog. Energy Combust. Sci. 18, pp. 47-73, 1992.

  14. Chemkin, http://www.ca.sandia.gov/chemkin/.

  15. Chemical Markup Language, http://www.xml-cml.org/.

  16. H. N. Hajm, J. H. Chen, J. F. Grcar, R. Armstrong, C. Kennedy, J. Ray, W. Koegler, A. Lutz, M. Allendorf, D. Klinke, A. McDaniel, N. Nystrom, R. Subramanya, and R. Reddy, "MPP DNS of diesel autoignition", SAND2001-8075, November 2000.

  17. R. Armstrong, D. Gannon, A. Geist, K. Keahey, S. Kohn, L. McInnes, and S. Parker, "Toward a Common Component Architecture for High Performance Scientific Computing" Proceedings of 1999 Conference on High Performance Distributed Computing, Redondo Beach, CA, August 1999.

  18. M. Thompson, W. Johnston, S. Mudumbai, G. Hoo, and K. Jackson, "Certificate-based Access Control for Widely distributed Resources", Usenix Security Symposium '99, March 1999.

  19. Session Directories for Setting up and Monitoring CORE2000/Habanero Conferences via Java, CORBA, and LDAP, http://www.emsl.pnl.gov:2080/docs/collab/presentations/papers/wsd.WebNet98.html.

  20. J. D. Myers, C. Fox-Dobbs, J. Laird, et. al., "Electronic Laboratory Notebooks for Collaborative Research", Proceedings of the Fifth Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE '96), Stanford, CA, June 1996.

  21. K. A. Keating, J. D. Myers, J. G. Pelton, R. A. Bair, D. E. Wemmer, and P. D. Ellis, "Development and Use of a Virtual NMR Facility", Journal of Magnetic Resonance, 143, pp. 172-183, 2000.

  22. XSIL - Extensible Scientific Interchange Language, < href="http://www.cacr.caltech.edu/SDA/xsil/"> http://www.cacr.caltech.edu/SDA/xsil/.

  23. I. Foster and C. Kesselman (eds.), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, 1998.

  24. UMWorktools, https://worktools.si.umich.edu/.

  25. NCSA's OPIE system, http://www.ncsa.uiuc.edu/opie/.

  26. The NPACI User HotPage, https://hotpage.npaci.edu/.

  27. PortalML (Portal Markup Language), http://www.oasis-open.org/cover/portalML.html.

  28. R. Butler, D. Engert, I. Foster, C. Kesselman, S. Tuecke, J. Volmer, and V. Welch, "A National-Scale Authentication Infrastructure", IEEE Computer, 33(12), pp. 60-66, 2000.

  29. S. Tuecke and C. Kesselman, "Security and Policy for Group Collaboration," Proposal to SciDAC 01-06, March 2001.

  30. M. Thompson, "Distributed Security Architectures: Middleware for Distributed Computing," Proposal to SciDAC 01-06, March 2001.

  31. "NIST Scientific and Technical Databases" - http://www.nist.gov/srd/index.htm.

  32. "NIST Chemistry WebBook" - http://webbook.nist.gov/chemistry/.

  33. "NIST 17. NIST Chemical Kinetics Database: Version 2Q98" - http://www.nist.gov/srd/nist17.htm.

  34. "NIST Standard Reference Data - Thermochemical Databases" - http://www.nist.gov/srd/thermo.htm .

  35. "NIST Standard Reference Data Products Catalog (Surface Data)" - http://www.nist.gov/srd/surface.htm.

  36. J. Myers, "Scientific Annotation Middleware (SAM)," Proposal to SciDAC 01-06, March 2001.

  37. "DAV Searching and Locating (DASL) protocol," http://www.webdav.org/dasl/.

  38. "Resource Description Framework (RDF) language," http://www.w3.org/RDF/.

  39. W. Appelt, "WWW-Based Collaboration with the BSCW System," Proceedings of SOFSEM’99, Springer Lecture Notes in Computer Science 1725, pp. 66-78, Milvoy (Czech Republic), 1999.

  40. XSL Transformations (XSLT), http://www.w3.org/TR/xslt.

  41. D. Gracio, "Center for Collaborative Problem Solving in the Earth Sciences," Proposal to SciDAC 01-06, March, 2001.

  42. Access Grid, http://www-fp.mcs.anl.gov/fl/accessgrid/default.htm.

  43. J. C. Corchado, Y.-Y. Chuang, P. L. Fast, J. Villa, W.-P. Hu, Y.-P. Liu, G. C. Lynch, K. A. Nguyen, C. F. Jackels, V. S. Melissas, B. J. Lynch, I. Rossi, E. L. Coitino, A. Fernandez-Ramos, R. Steckler, B. C. Garrett, A. D. Isaacson, and D. G. Truhlar, POLYRATE, version 8.5.1, University of Minnesota, Minneapolis, MN, 2000.

  44. S. J. Klippenstein, A. F. Wagner, R. C. Dunbar, D. M. Wardlaw, S. Robertson, and J. A. Miller, VARIFLEX, version 1.07, A Chemical Kinetics Computer Program, Argonne National Laboratory, December 2000.

  45. R. G. Susnow, A. M. Dean, W. H. Green, P. K. Peczak, and L. J. Broadbelt, "Rate-Based Construction of Kinetic Models for Complex Systems", Journal of Physical Chemistry, A 101, pp. 3731-40, 1997.

  46. B. Bhattacharjee, W.H. Green, and P.I. Barton, "Globally Optimal Model Reduction", presented at the AIChE National Meeting, Los Angeles, CA, November 2000.

  47. J. C. Hewson and M. Bollig, "Reduced Mechanisms for NOx Emissions from Hydrocarbon Diffusion Flames," Proc. of The Combustion Institute, 26, pp. 2171-2179, 1996.

  48. M. Bollig, H. Pitsch, J. C. Hewson, and K. Seshadri, "Reduced n-Heptane Mechanism for Nonpremixed Combustion," Proc. of The Combustion Institute, 26, pp. 729-737, 1996.

  49. S. R. Tonse, N. W. Moriarity, N. J. Brown, and M. Frenklach, "PRISM: Piecewise Reusable Implementation Strategy for Chemical Kinetics," Israel J. Chem., 39, pp. 97-106, 1999.

  50. A. Trouve, "Terascale High-Fidelity Simulations of Turbulent Combustion with Detailed Chemistry," submitted to SciDAC 01-08, March 2001.

 
 
 

Last Modified 02/25/04