|
Collaboratory for Multi-scale Chemical Science
Project Lead: Larry Rahn
Institutional Points-Of-Contact:
- Sandia National Laboratories: Larry Rahn
- Pacific Northwest National Laboratory: Brett Didier
- Argonne National Laboratory: Al Wagner
- Lawrence Livermore National Laboratory: William Pitz
- Los Alamos National Laboratory: David Montoya
- National Institute of Standards and Technology: Thomas C. Allison
- Massachusetts Institute of Technology: William Green
- University of California at Berkeley: Michael Frenklach
Introduction
Rapid advances in computational hardware and software
along with innovative experimental techniques are revolutionizing the rate at
which chemical science research can produce the new information necessary to
advance combustion technology, straining the traditional methods of
communication through peer-reviewed literature and static databases. The
Collaboratory for the Multi-scale Chemical Sciences (CMCS) Pilot Project
brings together leaders in scientific research and technological
development across multiple DOE laboratories, other government laboratories
and
academic institutions to develop an informatics-based approach to synthesizing
multi-scale information to create knowledge in the chemical sciences.
CMCS is using advanced collaboration and
metadata-based data management technologies to develop a
Chemical Sciences portal providing support for distributed research, community
communications, and
data discovery, management, and annotation capabilities. The portal assists in
documenting and browsing data pedigree and in communicating cross-scale
dependencies between data produced at one
scale and the results of computations using it at the next. A variety of
standards-based mechanisms for extracting metadata from files,
translating between schema, converting data formats, and integrating external
applications are designed to
minimize the work required to adopt CMCS capabilities.
The CMCS project also involves a set of efforts to demontrate interactions
between the portal and
key national chemistry resources (data and software) and to support pilot
groups using cutting-edge chemical informatics techniques. These efforts
span from quantum mechanics to fluid flow and together, are designed to
demonstrate the potential of the CMCS infrastructure
to qualitatively and quantitatively change the way chemical knowledge is
produced. If successful,
CMCS will significantly enhance the coordination of research efforts across
related
sub-disciplines in the chemical sciences, focus research at one scale on
obtaining or refining values critical in the next, reduce work performed
using limited or outdated values, and enhance the ability of the chemistry
community to
meet national research challenges.
Excerpts from the CMCS Proposal to the DOE National Collaboratories
Program
Background and Significance
Multi-scale Dependencies - The Need for Collaboration
The area of chemical sciences is representative of many
DOE programs in that it addresses complex multi-scale phenomena. The situation
is similar in earth system studies, fusion research, high-energy physics,
biology, and other areas of science - an understanding of environment and
device scale phenomena requires more than simply applying one type of
computation, with increased computing power, across scales. Different physical
phenomena dominate system dynamics at these different scales, leading to a
variety of models and experiments relevant in the different regimes.
Information from one regime is used as input for the next, essentially
"bootstrapping" from the atomistic to the device level. One of the major
bottlenecks in such a multi-scale research enterprise is the passing of
information from one level to the next in a consistent and validated
manner.
The scientific process described above leads
to a data- and model-centric view of the communications between
sub-disciplines
working at different scales. Data at
one level is analyzed to develop a model that produces data used in turn by
another, repeatedly across the range of scales and types of chemical
information required. However, in this process more than just the raw data
values need to be communicated. Confidence in a value's accuracy, its
uncertainty, dependencies on other data, etc. must all be considered when
using
it in further computational and experimental research. In the direction of
decreasing length and time scales, information about the sensitivity of models
on particular data may place a premium on very accurate values for certain
fundamental quantities. Enabling the rich bi-directional exchange of both data
and metadata between scales is a critical issue in making progress.
Traditionally, this information flow has
been accomplished through the research literature and, more recently, through
databases of chemical values. Discovery of new information in these sources is
a manual process. Further, the information is fragmented. Determining whether
results presented in a paper depend on obsolete values from a different regime
may require searching through several papers and databases. These factors make
communication difficult and time consuming and increase the likelihood of
redundant and irrelevant research.
Across the domains represented in DOE's
Scientific Discovery through Advanced Computation (SciDAC), communication of
expertise and the flow of information between sub-disciplines targeting
different physical regimes will be as critical as increasing computational
capabilities within a domain in effectively and efficiently producing
practical
results from basic research. Current manual approaches to coordinating
multi-scale research cannot themselves scale to the amounts of data that will
be generated through SciDAC and to the level of effectiveness and efficiency
required to tackle national science issues in a cost effective manner. The
multi-scale communications challenges facing SciDAC researchers are not
discipline specific. Thus, a solution to these issues in the chemical sciences
will provide a model for multi-scale science that can guide efforts in other
domains.
An Informatics Approach to Multi-scale Science
To overcome
current barriers to collaboration and knowledge transfer among researchers
working at different scales, a number of enhancements must be made to the
information technology infrastructure of the community:
A collaboration
infrastructure is required to enable real-time and asynchronous collaborative
development of standards for data and metadata description, inter-scale
scientific communication, geographically distributed disciplinary
collaboration, and project management.
Tools now used to generate
and analyze data at each scale must be modified to enable generation and
storage of the required metadata in a format that allows interoperability with
other tools and collaboratory functions, and must be made available for use by
geographically distributed collaborators.
Repositories are required to
store chemical sciences data and metadata in a way that preserves data
integrity and allows web access.
-
New tools are required to search and query metadata, and to retrieve
data across all scales, disciplines, and locations. These tools should be
available via an integrated user-customizable interface or portal.
The complexities
of managing information within such an infrastructure are daunting and the
creation, communica-tion and use of the additional information could quickly
become unwieldy. However, recent technological advances, in particular the
development of the extensible markup language (XML) [1] for defining machine
and human readable metadata based on standard schema, have significantly
reduced the barriers to creating such a comprehensive informatics environment.
We propose a
Collaboratory for Multi-scale Chemical Science (CMCS) focusing on combustion
research that will demonstrate that an integrated multi-scale approach to
scientific and engineering research is not only possible but can produce
significant benefits in harnessing research to address real-world issues. The
field of combustion is critical to the DOE mission for clean and efficient
energy, and the DOE has ongoing investments in research across the full range
of relevant scales and disciplines. The
CMCS will bring an integrated, informatics-based approach to combustion
research that enhances and begins to automate the flow of information between
sub-disciplines.
CMCS efforts to
develop tools supporting the multi-scale analysis of chemical systems for
combustion will be directly applicable to other communities in the chemical
sciences and related fields.
Furthermore, CMCS will provide a model for multi-scale science that can
be replicated in other domains.
Combustion: A Multi-scale Chemical Sciences Pilot

Figure 1. Combustion modeling requires the integration
of scientific knowledge over a large range of scales.
Fossil
fuel energy supply and fossil-fueled combustion systems are the cornerstones
of
the industrial and commer-cial sectors of the U.S. economy, accounting for 85%
of the energy consumed in the United States each year.
In the private sector the combustion of
fossil fuels provide a level of comfort and mobility for U.S. citizens that is
unrivaled in the world. Fossil fuels continue to be inexpensive and the supply
of fossil fuels remains stable, although heavy dependence on foreign sources
has led to major economic and societal dislocations in the past twenty-five
years and threatens to again in the future.
Also, recent changes in international environmental mandates for lower
CO2 emissions have emerged as
strong drivers for increased combustion efficiency. Despite continuing
investments in alternative energy sources, the
importance of hydrocarbon fuels, as they relate to the economy and quality of
life in the United States, is unlikely to change in the foreseeable
future.
The advancement of the DOE mission for efficient,
low-impact energy sources and utilization relies upon continued significant
advances in fundamental chemical sciences and the effective use of the
knowledge accompanying these advances across a broad range of disciplines and
scales. This challenge is exemplified in the development of pre-dictive
computational models for realistic combustion devices.
Combustion modeling requires the
integration of computational physical and chemical models that span space and
time scales from atomistic processes to those of the physical combustion
device itself as illustrated in Figure 1.
Combustion systems involve
three-dimensional, time-dependent, chemical-ly reacting turbulent flows that
may include multiphase effects with liquid droplets and solid particles in
complex physical configurations.
Against this fluid-dynamical backdrop, chemical reactions occur that
determine the energy production in the system, as well as the emissions that
are produced. For complex fuels, the
chemistry involves hundreds to thousands of chemical species participating in
thousands of reactions. These chemical
reactions occur in an environment that is defined by both thermal conduction
and radiation. Reaction rates as a function of temperature and pressure are
determined
experimentally and by a number of methods using data from quantum mechanical
computations. The collaborative creation, discovery, and exchange of
information across all of these scales and disciplines are required [2] to
meet DOE's mission requirements.
Architecture
The description above provides a brief glimpse of the
inherent complexity that must currently be managed manually by combustion
researchers. CMCS does not seek to eliminate this complexity, but to provide
community-wide capabilities for managing it. As shown in Figure 2, CMCS will
provide the majority of its capabilities through a web-based Multi-scale
Chemical Science (MCS) portal. This portal will provide a customizable access
point for accessing community knowledge bases, interacting with other members
of the multi-scale chemistry community, performing and documenting
experiments,
and supporting the multiple paths of information flow necessary to integrate
these activities. These capabilities rely on an underlying Grid [23]
infrastructure including a sophisticated metadata and annotation management
subsystem. Standard chemical information interchange schema combined with
metadata/data translation mechanisms enable loose federation of underlying
group and community data stores and global tracking of information such as
data
pedigrees. Chemistry domain applications, modified or extended to understand
the community-developed schema, help automate the collection of and directly
exploit data pedigree and other newly available data annotations.

Figure 2. Architecture diagram for CMCS showing portal integration of domain
applications and data resources from across the multi-scale chemistry
community.
MCS Community Portal
The multi-scale chemical science (MCS) portal will be
developed as the focal point of the collaboratory.
As shown in Figure 3, the portal will provide a broad range of
capabilities. Bringing these capabilities together across chemistry
sub-domains
will provide convenience while helping researchers think and act in the larger
combustion context. For the purposes of discussion, portal capabilities are
divided into groups related to accessing community knowledge resources,
interacting within the community, and generating new knowledge through
research
projects. These activity-based groupings are clearly not orthogonal and our
design does not make any such distinctions. Rather, the design prescribes the
use of common services across all portal tools to enable facile communications
between them, with the goal of enhancing researcher's ability to organize and
move information through the various activities beyond what is currently
possible.

Figure 3. Prototype Design of Portal for Multi-scale
Chemical Science.
Figure 3 shows details of the portal's overall design.
Users will be able to create profiles that define their preferred view(s) of
the portal. Profiles can be used to
aggregate users into communities of interest such as "Turbulent reacting flow
model development" or "XML schema for reduced chemical mechanisms".
Security is integrated into the portal
architecture, providing authentication, authorization control, and encryption
capabilities that can be passed to underlying resources.
The MCS portal will be customizable so that it supports
the needs of the user and the user's workflow.
It will allow users to emphasize information relevant to a subset of
chemical scales or to emphasize data submission and search capabilities, tools
for organizing and understanding relationships in the data, or the
capabilities
for interaction with colleagues while helping maintain contextual awareness of
the overall community. Users will have access to portal configuration tools to
create custom views and save them in personal, group, or community
profiles.
A variety of web-based technologies exist for the
development of such a portal. The WorkTools system [24], developed at the
University of Michigan, is currently in use within the Space Physics and
Aeronomy Research Collaboratory (SPARC) as a means of accessing data and
colleagues. It allows researchers to select a set of tools such as chat boxes
and live instrument data feeds and position them as desired within a
persistent
web page. Similarly, NCSA's OPIE system [25] allows surfers to arrange tools
within a browser window through the same click-and-drag operations used to
arrange windows on a computer desktop. Within the DOE community, work has been
done to allow Grid resources to be accessed through portals as exemplified by
the NPACI HotPage site [26], and additional work to develop standards for
portal layout via a PortalML language [27] are anticipated within SciDAC.
Commercial tools for portal development also exist, targeted at companies
wishing to create "My Yahoo" style enterprise portals for employees and
customers. While each of these systems
has strengths and weaknesses, many could provide the basic capabilities needed
within MCS. In developing the portal, we will evaluate existing and emerging
portal development environments, select one, and adapt it as needed to meet
MCS
requirements, potentially in collaboration with the environment's developers.
Once the basic infrastructure is designed, we will begin
to integrate specific MCS capabilities. Since significant technology
development will be required to achieve the desired levels of functionality
and
integration, we will follow an incremental approach to developing the overall
portal, using it to enable access to prototype capabilities very early in the
project.
Over time, more integrated and feature-rich versions of the individual tools
will be developed and made available through the portal. To prioritize efforts
and ensure overall system usability, specific use cases based on the CMCS
guiding scenario will be developed early in the project and periodically
updated to reflect the growing understanding of research.
System Security
Due to its central role in CMCS, the portal is a logical
place to coordinate system security. We envision the portal providing single
sign-on capabilities across the tools represented. The wide range of tools to
be integrated and the bi-directional flow of information between private and
public repositories proposed in CMCS (as detailed in subsequent sections) make
system security a challenging issue. Obtaining sufficient expertise with
security technologies and in the design and deployment of secure services in
distributed environments was a primary consideration in the selection of CMCS
development team members.
Existing public key infrastructure (PKI) technologies can
be leveraged to provide the basic authentication, authorization, encryption,
and non-repudiation services required in CMCS. Because the portal will be
web-based, standard web mechanisms for authenticating users, including using
public-key certificates, can be applied. Web browsers and servers currently
incorporate the technology necessary to request and validate user credentials
and to set up secure socket layer (SSL) encrypted communications. The Grid
Security Infrastructure (GSI), implemented, in collaboration with others, by
the Distributed Systems Laboratory (DSL) at Argonne National Laboratory [28]
is
a PKI-based system that includes delegation capabilities, allowing a globally
trusted public-key identity to be used to securely obtain a local credential
in
a non-PKI system, e.g. a UNIX username/password. A SciDAC proposal,
"Security and Policy for Group Collaboration" [29] led by ANL, offers
advancements in GSI that will be useful
in latter stages of CMCS development. In CMCS, GSI would allow an MCS portal
user's PKI credentials to be used to automatically obtain credentials for
back-end systems such as databases. The advanced attribute-based Akenti
system,
which is used within the DOE2000 Diesel Collaboratory project, is intended to
provide scalable security services in highly distributed, collaborative,
multi-institutional environments. The work described in LBNL's "Distributed
Security Architectures: Middleware for Distributed Computing" submission to
SciDAC [30], which proposes to integrate GSI credentials and Akenti and also
to
integrate Akenti access control with the Distributed Authoring and Versioning
(DAV) extension to the HyperText Transfer Protocol (HTTP) would provide
significant benefits to CMCS. (The significance of DAV in the CMCS is
discussed
later in the proposal.) Finally, digital signature and timestamping services
are also becoming readily available through software development kits and
Internet services.
Thus, it should be possible to assemble a strong basic
security infrastructure for the MCS portal from existing components through
careful integration and deployment efforts. However, two aspects of CMCS may
require advances to meet community needs. The first deals with limiting the
amount of resources that can be used on the user's behalf when the portal
delegates the user's credentials to a back-end or external system. Similarly,
the length of time over which such delegation is allowed to occur may also
need
to be limited. Appropriate policies for various CMCS use scenarios will need
to
be defined and technologies to implement them must be developed. A variety of
approaches can be used to provide incremental capabilities in this area, and
we
anticipate the development of general limited delegation capabilities within
the Grid community. We will seek to collaborate with any such efforts by
providing requirements and testing the systems developed. The second aspect of
CMCS security requirements that will need to be refined during the project
relates
to the capabilities provided for attaching private annotations to publicly
available data. The general solution of providing fine-grained access controls
on each such relationship is clearly too cumbersome. Defining an appropriate
level of access granularity to balance ease-of-use with user requirements will
require experimentation during the project lifecycle.
Knowledge Management
CMCS will provide a unifying interface to
combustion-related chemical science information. This information is currently
scattered in public and private databases, flat files, and in the scientific
literature, and is organized primarily along sub-disciplinary lines.
Information is locked in a variety of formats and provided with varying
degrees
of validation and context. The NIST databases
[31] have been widely praised by the combustion community for their organizing
influence, but the fraction of data contained therein is small. In CMCS, we
propose to leverage the NIST and other databases, and provide added value to
working researchers in a number of directions:
Providing a single location from which to access the
wide range of data needed for combustion research
Reducing or eliminating the difference in access
protocols and procedures between resources
Developing community standards for representing
information that crosses domain boundaries
Using these definitions, combined with translation
capabilities, to allow federated searches across multiple databases
Using similar mechanisms to automate the reformatting
of information retrieved from databases for use in research applications and
vice versa
Developing machine-processable representations of data
pedigree and dependency relationships
Providing tools for manual and automated traversal of
pedigree and dependency relationships to support group and community data
validation efforts and exploration of sensitivity analyses
Allowing arbitrary annotations and comments to be
attached to data values to support community expertise building and consensus
formation activities
The underlying architecture proposed to address these
issues, and the implementation plans for providing search, retrieval, and
submission capabilities and for managing and exploring data pedigree
information are detailed in the following subsections. Additional capabilities
related to generation and exploitation of data annotations, and to the
refinement and validation of information through community processes are
discussed in sections 1.3.6 and 1.3.7 respectively. An initial exploration of
the type of architecture proposed here,
and a further discussion of its benefits, is reported in a draft paper in
Appendix D.
Metadata/Annotation Management Overview
Metadata is commonly defined as information 'about' data
values and data sets. However, such a definition is very dependent on one's
perspective. For example, whether a chemical formula is metadata about a
molecular geometry, or whether geometry and other information such as the heat
of formation are metadata about a chemical formula is a matter of perspective.
Such differences of opinion, once encoded in software, are an endless source
of
barriers to cross-scale collaboration. Within this proposal, we modify this
definition to equate the term metadata with data values that have meaning
across domains. In the example above, the chemical formula is metadata because
it has meaning for both quantum- and thermo-chemists. Heat of formation would
not be considered metadata until we expand scope and realize that both
thermochemists and kineticists ascribe meaning to it.
Although this shows that our definition continues to have
some context-dependence, it is more powerful than the original. By our
definition, data is opaque and meaning-free outside a sub-discipline and, as a
corollary, efforts to standardize formats and meanings between collaborators
to
support inter-scale search capabilities, application interoperability, etc.
can
be confined to metadata. Further, the system architecture can treat data as
opaque as well and no restrictions need be placed on its format. In contrast,
because metadata must be understood and manipulated, it must be formatted in a
way that exposes its meaning in machine-comprehensible form. An important
consequence of this bifurcation is that it minimizes the effort required to
allow two parties to collaborate - no changes are required to any
applications,
and no agreements need be reached about the meaning of terms, except those
directly concerning the values that will be exchanged.
Such an architecture is at the heart of the DOE2000
electronic notebooks and is the reason they can be easily extended to handle
new annotation types; the complexities of handling the annotation data can be
confined to the components that create and render the annotation. No
translation of the annotation data is required by the notebook and the base
notebook system only assigns meaning, and has code to manipulate, metadata
such
as author name, creation date, data type, digital signatures, etc. In defining
the CMCS architecture, we looked to exploit the architecture explored in
DOE2000 electronic notebooks while taking advantage of technologies that have
matured since their inception. The directions being proposed for the evolution
of electronic notebooks and the broader concept of metadata-based system
design
proposed within the "Scientific Annotation Middleware (SAM)" [32] submission
to
SciDAC have been a key influence.
The basis for both CMCS and SAM architectures is the
formatting of metadata using the XML [1]. XML is a powerful language for
encoding the definition of technical terms in a human- and machine-readable
form. XML's expressive power, together with the availability of technologies
for manipulating it - authoring, parsing, validating, translating, etc., have
made it a de facto standard for information exchange in new systems.
Since both efforts will leverage the Web-based DAV
protocol [9], CMCS anticipates being able take advantage of work done under
the
SAM proposal. DAV is an Internet Engineering Task Force (IETF) standard set of
extensions to the HTTP/1.1 protocol to support basic data management over the
web including storage and retrieval of typed, opaque data files/objects,
content locking, hierarchical collections and annotation of the data with
arbitrary metadata [9]. It defines the formatting of metadata in properties
consisting of XML key:value pairs and provides operations for creating,
removing, and querying them. The extensible DAV Searching and Locating (DASL)
protocol [33] adds methods for server-side search capabilities. It provides a
basic search grammar and can be extended with additional grammars, e.g. XML
Query.
A layered set of services built on top of DAV/DASL that
provide successively more specialized capabilities has been outlined in the
SAM
proposal. The brief description here highlights functionality directly
relevant
for CMCS. The Metadata Management Services (MMS) will support simple
federation
of DAV servers, allowing propagation of storage and retrieval requests and
queries down through a hierarchy of servers. It will also provide mechanisms
for registering metadata generation tools that can parse data of specified
types and generate new DAV properties. The SAM Semantic Services (SS) adds a
standard for representing semantic relationships between DAV objects. A
Notebook Services (NS) layer defines records management capabilities in terms
of specific semantic relationships and property definitions. A set of
interface
components, including a graphical relationship browser, and programming
interfaces provides access to the services.
A SAM-based notebook will be built from these components that
could be leveraged by the CMCS to develop novel functionality.
Community Standard Schemas
This high-level description of the proposed CMCS metadata
management infrastructure provides the context necessary to discuss the tasks
that will be undertaken in CMCS to provide community knowledge management
capabilities. The critical initial steps are the standardization of
definitions
for relevant chemical information that will be exchanged by CMCS researchers
and the formal representation of these definitions in XML. Such
representations
are referred to as schema. CMCS researchers will engage their respective
chemical science communities in these schema development efforts through the
developing MCS portal and traditional community forums. Although we do not
wish
to under-represent the effort that will be required, it should be noted that
overall scope is relatively small. CMCS does not require a single, global
schema that represents all of chemical knowledge. Only information that will
be
exposed as metadata need be defined. Further, schema can be evolved within the
system and variants can be accommodated. Thus, for example, if the
stakeholders
for two community resources wish to define metadata quantities beyond the
minimal set on which agreement can be reached, annotations can be supported in
both dialects. If complete agreement can be reached at a later date, a
translator can be registered with the system to allow queries expressed in the
new schema to match metadata defined with the old standards.
As shown in Figure 2, we anticipate the definition of five
potentially overlapping schema within CMCS that define the information that
must be exchanged across the boundaries between the five chemistry sub-domains
in the guiding scenario as well as a general schema for representing
dependency
data. The inter-domain schema will most likely leverage nascent community
efforts such as the Chemical Markup Language (CML) [15], which defines a set
of
basic molecular properties such as chemical formula and molecular geometry.
Efforts at PNNL and NIST to define quantum
chemistry, and thermochemistry and kinetics information, respectively, can
also
be leveraged. If significant overlap is observed in the developing schema, it
may be possible to pull common elements into a broader chemistry schema. Since
there are no limitations in DAV and XML in using multiple schemas to define
properties on a single object, any such adjustments can be made with little
impact on CMCS scope.
Since the definition of dependency relationships is not
chemistry-specific, it may be possible to form a broad collaboration amongst
problem solving environment developers to define the concepts necessary to
track data pedigree and map the sensitivity relationships that relate the
uncertainty in one quantity to the uncertainty in derived quantities. Such a
common schema would enable re-use of tools for traversing such relationships
in multiple projects.
Resource Wrapping and Registration / Metadata-enabled Data
Repositories
DAV is quickly gaining in popularity. Major applications
including Microsoft Office and Oracle's Internet File System ship with DAV
support. Database and application framework vendors are in various stages of
offering support for developing DAV views of an underlying relational
database.
Public domain DAV servers exist that support flat file data repositories and
servers based on the open-source MySQL relational databases are expected.
Thus,
we expect that developing basic DAV interfaces for CMCS community databases
will be a relatively straightforward task that can be completed early in the
project. NIST databases of thermochemical and kinetics data will be initial
targets for this work. It is less clear
how quickly support for DASL and query grammars such as XML Query will become
available and it may be necessary to implement some basic functionality in
these areas to achieve CMCS goals.
While the number of community data resources that will be
wrapped as part of the CMCS pilot project is limited, it should be well within
the scope of individual researchers and institutions to develop wrappers for
additional databases. The translation
and metadata generation capabilities available through SAM should simplify
interaction with resources that have already been converted to XML using
non-CMCS schema. We will actively pursue such integration activities during
the
project lifecycle with the aim of building momentum for the long-term support
of CMCS capabilities within the community.
Once resources are DAV-enabled, a mechanism is needed to
make the portal knowledge-management capabilities aware of them. The
registration process should be relatively straight forward, involving the
specification of the resource's uniform resource locator (URL) and any schema
translations that should be applied when accessing it through the MCS portal.
For submission operations, and for non-public resources, interaction of the
MCS
portal and resource security systems may involve more complexity and require
configuration of a GSI-based credential delegation. We will initially develop
a
form-based resource registration capability accessible through the portal that
will support the simple case. We will investigate the practicality of
providing
web-based configuration for the more general case involving security system
integration and work with the CMCS community to determine whether some
options,
such as the selection of particular schema translations, should be exposed on
a
per user basis through the general portal customization mechanism. As
described
in more detail in later sections, we intend to allow personal and group
information in notebooks to be registered as searchable resources as well.
Multi-scale Integrated Search
The multi-scale search capability within the MCS portal
will provide researchers with access to the registered data stores in terms of
the community defined interchange schema regardless of the format of the
resources themselves. The types of anticipated queries range from simple
searches for all measured or calculated values of a molecular property to
searches for a contact information for groups who have calculated properties
that have significant effects on the uncertainty in a derived reaction rate.
The latter, which might be used to initiate a discussion about obtaining
updated values of the more fundamental properties, would be a laborious
process
today, but within CMCS it could be described succinctly and executed
automatically. As with databases today, it may be important to provide
predefined templates for such complex queries rather than requiring users to
formulate them from scratch in the low-level query grammar.
We anticipate that a generic web-based DASL client can
provide significant functionality for simple queries since DASL supports
methods by which the client can dynamically learn what search grammars are
supported and can discover the complete list of property names (keys) that
exist. Thus, a generic DASL client may be able to provide scaffolding for
helping users construct queries such as drop down lists of the available
chemistry-related query terms and search operators, without itself being
chemistry specific. We are not aware of such a mature DASL client at this
time,
so effort may be required within CMCS to enhance a simpler client or to
develop
one. However, because the client is not chemistry specific, there are
significant opportunities to work with SAM and other projects using DASL in
its
development.
The more complex query example above would rely on the
Resource Description Language (RDF) [34] to support semantic relationships and
the dependency schema developed in CMCS. The mechanism for this may be derived
from the SAM project, which intends to provide an enhanced grammar through
DASL
to aid in the construction of such queries. Interestingly, since a SAM-enabled
notebook is layered and semantic relationships are ultimately defined in XML,
the execution of such queries may simply involve server-side translation of
the
query into the basic DASL grammar within SAM and final execution within the
standard DASL engines of federated repositories. Alternatively, similar
mechanisms could be developed on the client side within CMCS, although it
would
involve significant additional work.
Content Submission and Retrieval
Once CMCS users have identified information they wish to
retrieve, either using the search describe above or pedigree browsing tools
described below, or via a reference obtained in another manner, it is a simple
matter to download the data to the local machine. If the information needed is
metadata, it will already be represented in the XML formatted response to the
query, or can be retrieved using the DAV 'propfind' method. If it is data
that
is required, it can be retrieved using the DAV 'get' method. Since DAV is an
extension of HTTP, browsers can execute this 'get' request given the URL of
the
data item. Capabilities for supporting such requests will be developed in the
MCS portal.
While the methods above provide means to retrieve metadata
and data, they do not address the issues of making the information usable
within applications. To do this, a means of translating the returned values
into the data formats expected by the applications is required.
Similarly, while
submission of new data and metadata is conceptually simple and we may provide
some simple capabilities within the portal for manual submission, discussion
of
the technology to support submission from within applications, and of the
policy and procedural issues associated with adding material to a curated data
store are delayed until later sections.
Data Pedigree and Dependency Browsing
Data pedigree refers to information about how a particular
piece of data was produced. In a narrow sense, this implies information about
who created the data, when it was produced, and the technique used to create
it. In the computational domain, the latter would include information on what
software was run, its version number, and the specific parameters used as
input
to the calculation. As discussed previously, in a broader sense pedigree
information may include information on assumptions that limit the data's
range
of applicability and sensitivity information tracing the uncertainly in a
given
value to uncertainties in the technique used and uncertainty in values used to
calculate it. A pedigree browsing mechanism would allow researchers to
discover
and navigate through this information.
In CMCS, we anticipate a variety of sources of pedigree
information - applications, notebook entries, via metadata generators that
extract information from data objects, etc. Such information will be
standardized based on the pedigree and dependency schema(s) defined/adopted by
the CMCS community. We propose to provide a tool for browsing this information
within the MCS portal. Such a tool would provide a visual representation of
the
pedigree information and allow users to shift focus forward and backward along
pedigree chains. SAM specifies a basic component for such navigation. We
anticipate guiding its development with requirements from CMCS users and
embedding it within a pedigree browsing tool that would be capable of
communicating with other MCS tools through drag-and-drop of data URLs,
allowing, for example, a user to drag data sets returned in response to a
query into the pedigree browser to understand their history.
It should also be noted that the concept of Active Tables
can be explained in part as a dependency browsing mechanism. Active Tables
combine a means of representing dependencies to the user with additional
capabilities to ensure consistency across dependency networks and a means for
updating dependency information. Thus, we anticipate opportunities for
technology sharing and interactions between Active Tables and the portal
pedigree browser that can be explored during the project. Similar overlap
exists with the sensitivity analyses involved in chemical mechanism refinement
and reduction, which again, may lead to possibilities for linkage with the
pedigree browser development.
Literature Cited/References
-
XML Specification,
http://www.w3.org/TR/2000/REC-xml-20001006.
J. S. Binkley, et. al.,
"Combustion Simulation and Modeling," Proceedings
of the Workshop on Combustion Simulation and Modeling, Reston, VA,
June/July 1998.
C. M. Pancerella, L. A. Rahn,
and C. L. Yang, "The Diesel Combustion Collaboratory: Combustion Researchers
Collaborating over the Internet", Proceedings
of ACM/IEEE SC99 Conference, Portland, OR,
November 1999.
R. A. Kendall, E. Aprà, D. E. Bernholdt, E. J. Bylaska,
M. Dupuis, G. I. Fann, R. J. Harrison, J. Ju, J. A. Nichols, J. Nieplocha, T.
P. Straatsma, T. L. Windus, and A. T. Wong, "High Performance Computational
Chemistry; an Overview of NWChem a Distributed Parallel Application,"
Computer Physics Communications, 128,
pp. 260, 2000.
D. E. Bernholdt, E. Aprà, H. A. Früchtl, M.F. Guest, R.
J. Harrison, R. A. Kendall, R. A. Kutteh, X. Long, J. B. Nicholas, J. A.
Nichols, H. L. Taylor, A. T. Wong, G. I. Fann, R. J. Littlefield, and J.
Nieplocha, "Parallel Computational Chemistry Made Easier: The Development of
NWChem", Int. J. Quantum Chem.: Quantum
Chem. Symposium 29, pp. 475-483,
1995.
M.F. Guest, E. Aprà, D. E. Bernholdt, H. A. Früchtl, R.
J. Harrison, R. A. Kendall, R. A. Kutteh, X. Long, J. B. Nicholas, J. A.
Nichols, H. L. Taylor, A. T. Wong, G. I. Fann, R. J. Littlefield, and J.
Nieplocha, "High Performance Computational Chemistry: NWChem and Fully
Distributed Parallel Applications", in Advances in Parallel Computing, 10, High
Performance Computing: Technology, Methods, and Applications, Eds. J.
Dongarra, L. Gradinetti, G. Joubert, and J. Kowalik, (Elsevier Science B. V.),
pp. 395-427, 1995.
Extensible Computational Chemistry Environment,
http://www.emsl.pnl.gov:2080/docs/ecce/index.html.
RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1,
ftp://ftp.isi.edu/in-notes/rfc2616.txt,
June 1999.
E. J. Whitehead, Jr. and M. Wiggins, "WebDAV: IETF Standard
for Collaborative Authoring on the Web," IEEE
Internet Computing, Vol. 2, No. 5, pp. 34, September-October 1998.
B. Ruscic, J. V.
Michael, P. C. Redfern, L. A. Curtiss, and K. Raghavachari, "Simultaneous
Adjustment of Experimentally Based Enthalpies of Formation of CF3X, X = nil,
H, Cl, Br, I, CF3, CN, and a Probe of G3 Theory," J. Phys. Chem. A. 102,
pp. 10889-10899, 1998.
B. Ruscic, M. Litorja, and R. L. Asher,
"Ionization Energy of Methylene Revisited: Improved Values for the Enthalpy
of
Formation of CH2 and the Bond Energy of CH3 via Simultaneous Solution of the
Local Thermochemical Network", J. Phys.
Chem. A. 103, pp. 8625-8633, 1999 .
M. Frenklach, "Modeling", in Combustion Chemistry
(W. C. Gardiner, Jr., Ed.), Springer-Verlag, New York, Chap. 7, pp. 423-453,
1984.
M. Frenklach, H.
Wang, and M. Rabinowitz, "Optimization and Analysis of Large Chemical Kinetic
Mechanisms Using the Solution Mapping Method - Combustion of Methane," J.
Prog. Energy Combust. Sci. 18, pp.
47-73, 1992.
Chemkin,
http://www.ca.sandia.gov/chemkin/.
Chemical Markup Language,
http://www.xml-cml.org/.
H. N. Hajm, J. H. Chen, J. F. Grcar, R.
Armstrong, C. Kennedy, J. Ray, W. Koegler, A. Lutz, M. Allendorf, D. Klinke,
A. McDaniel, N. Nystrom, R. Subramanya, and R. Reddy, "MPP DNS of diesel
autoignition", SAND2001-8075, November 2000.
R. Armstrong, D. Gannon, A. Geist, K. Keahey, S.
Kohn, L. McInnes, and S. Parker, "Toward a Common Component Architecture for High Performance
Scientific Computing" Proceedings of
1999 Conference on High Performance Distributed Computing, Redondo Beach,
CA, August 1999.
M. Thompson, W.
Johnston, S. Mudumbai, G. Hoo, and K. Jackson, "Certificate-based Access
Control for Widely distributed Resources", Usenix Security Symposium
'99, March 1999.
Session Directories
for Setting up and Monitoring CORE2000/Habanero Conferences via Java, CORBA,
and LDAP,
http://www.emsl.pnl.gov:2080/docs/collab/presentations/papers/wsd.WebNet98.html.
J. D. Myers, C.
Fox-Dobbs, J. Laird, et. al.,
"Electronic Laboratory Notebooks for Collaborative Research",
Proceedings of the Fifth Workshop on
Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE
'96), Stanford, CA, June 1996.
K. A. Keating, J. D. Myers, J. G. Pelton, R. A. Bair,
D. E. Wemmer, and P. D. Ellis, "Development and Use of a Virtual NMR
Facility", Journal of Magnetic Resonance, 143,
pp. 172-183, 2000.
XSIL - Extensible Scientific Interchange Language,
< href="http://www.cacr.caltech.edu/SDA/xsil/">
http://www.cacr.caltech.edu/SDA/xsil/.
I. Foster and C. Kesselman (eds.), The Grid: Blueprint for a New
Computing
Infrastructure, Morgan Kaufmann Publishers, 1998.
UMWorktools, https://worktools.si.umich.edu/.
NCSA's OPIE system, http://www.ncsa.uiuc.edu/opie/.
The NPACI User HotPage,
https://hotpage.npaci.edu/.
PortalML (Portal
Markup Language),
http://www.oasis-open.org/cover/portalML.html.
R. Butler, D. Engert, I. Foster, C. Kesselman, S.
Tuecke, J. Volmer, and V. Welch, "A National-Scale Authentication
Infrastructure", IEEE Computer,
33(12), pp. 60-66, 2000.
S. Tuecke and C.
Kesselman, "Security and Policy for Group Collaboration," Proposal to
SciDAC 01-06, March 2001.
M. Thompson, "Distributed
Security Architectures: Middleware for Distributed Computing," Proposal to
SciDAC 01-06, March 2001.
"NIST Scientific and Technical Databases" -
http://www.nist.gov/srd/index.htm.
"NIST Chemistry WebBook" -
http://webbook.nist.gov/chemistry/.
"NIST 17. NIST Chemical Kinetics Database: Version 2Q98" -
http://www.nist.gov/srd/nist17.htm.
"NIST Standard Reference Data - Thermochemical Databases" -
http://www.nist.gov/srd/thermo.htm .
"NIST Standard Reference Data Products Catalog (Surface
Data)" -
http://www.nist.gov/srd/surface.htm.
J. Myers, "Scientific Annotation Middleware (SAM)," Proposal to SciDAC
01-06, March 2001.
"DAV Searching and Locating (DASL) protocol,"
http://www.webdav.org/dasl/.
"Resource Description Framework (RDF) language,"
http://www.w3.org/RDF/.
W. Appelt, "WWW-Based Collaboration with the BSCW System,"
Proceedings of SOFSEM’99, Springer Lecture Notes in Computer
Science 1725, pp. 66-78, Milvoy (Czech Republic), 1999.
XSL Transformations (XSLT),
http://www.w3.org/TR/xslt.
D. Gracio, "Center for Collaborative Problem Solving in the Earth Sciences,"
Proposal to SciDAC 01-06, March, 2001.
Access Grid,
http://www-fp.mcs.anl.gov/fl/accessgrid/default.htm.
J. C.
Corchado, Y.-Y. Chuang, P. L. Fast, J. Villa, W.-P. Hu, Y.-P. Liu, G. C.
Lynch,
K. A. Nguyen, C. F. Jackels, V. S. Melissas, B. J. Lynch, I. Rossi, E. L.
Coitino, A. Fernandez-Ramos, R. Steckler, B. C. Garrett, A. D. Isaacson, and
D.
G. Truhlar, POLYRATE, version 8.5.1, University of Minnesota, Minneapolis, MN,
2000.
S. J.
Klippenstein, A. F. Wagner, R. C. Dunbar, D. M. Wardlaw, S. Robertson, and J.
A. Miller, VARIFLEX, version 1.07, A Chemical Kinetics Computer Program,
Argonne National Laboratory, December 2000.
R. G.
Susnow, A. M. Dean, W. H. Green, P. K. Peczak, and L. J. Broadbelt,
"Rate-Based Construction of Kinetic Models for Complex Systems",
Journal of Physical Chemistry, A 101,
pp. 3731-40, 1997.
B. Bhattacharjee,
W.H. Green, and P.I. Barton, "Globally Optimal Model Reduction",
presented at the AIChE National Meeting, Los Angeles, CA, November 2000.
J. C. Hewson and M. Bollig, "Reduced Mechanisms for NOx
Emissions from Hydrocarbon Diffusion Flames," Proc. of The Combustion
Institute, 26, pp. 2171-2179, 1996.
M. Bollig, H. Pitsch, J. C. Hewson, and K. Seshadri,
"Reduced n-Heptane Mechanism for Nonpremixed Combustion," Proc. of The
Combustion Institute, 26, pp. 729-737, 1996.
S. R. Tonse, N. W.
Moriarity, N. J. Brown, and M. Frenklach, "PRISM: Piecewise Reusable
Implementation Strategy for Chemical Kinetics," Israel J. Chem., 39,
pp. 97-106, 1999.
A. Trouve,
"Terascale High-Fidelity Simulations of Turbulent Combustion with Detailed
Chemistry," submitted to SciDAC 01-08, March 2001.
|