OR/14/022 Data standards for one way, static transfer of data

From Earthwise
Jump to: navigation, search
Barkwith A K A P, Pachocka M, Watson C, Hughes A G. 2014. Couplers for linking environmental models: Scoping study and potential next steps. British Geological Survey Internal Report, OR/14/022.

General

The way data is organised, formatted and transferred within groundwater and other numerical process modelling teams within BGS has historically been controlled by the individual carrying out the research and influenced by the technologies used. This often results in a mass of loosely controlled text files stored on local and networked computer drives. These files contain source data, metadata on the methodology used to create the model, metadata on the model outputs and the resultant outputs. By learning lessons from the BGS corporate software team, the process modelling teams could improve model and data management, reduce duplication of effort and enable greater data reuse.

The BGS has invested a vast amount of money and time into the professionalisation of information management, specialising in the storage of geological data from a wide range of sources and standardising digital formats to maximise opportunities for data reuse. Through these efforts the BGS have built up a robust digital infrastructure and staff expertise in the fields of relational databases, applications design and web based communications. To date much of this knowledge has not been applied to the field of process modelling in the BGS, but there are ongoing efforts to rectify this, for example adapting international spatial metadata standards for use in process models or use through the introduction of the source code repository and versioning system, Subversion.

Whilst the BGS aim to improve how static data relating to process models is managed there remains a wider issue of how such data is incorporated into the model coupling technologies. The most popular coupling technology in the BGS to date has been the FluidEarth software development kit (SDK) for the OpenMI 1.4 standard, which does not support the linking of process model components to static data sources in a model workflow (referred to as a composition). The OpenMI 2.0 standard does include support for the linking of static data sources but this functionality is yet to be tested by BGS staff.

There are three data source types which are most likely to be used in a linked model composition, namely text files, relational databases and web services. Each of these data source types can be used in an indiscriminate or standardised way; the following lists provide an overview of the key standards, technologies and organisations that relate to the storage and transfer of gridded data, the most common spatial representations in mathematical process models.

There are a number of organisations that publish standards for spatial data structure, these include:

  • ISO, traditionally focussed on the logical data models required to describe phenomena, these tend to be published in the form of UML models
  • OGC, the Open Geospatial Consortium aim to gain consensus on standards by building upon existing real world implementations, therefore, it could be argued, more useful in applied use cases than ISO.
  • W3C, the world wide web consortium, is the main standards organisation for the WWW, set up by Tim Berners Lee. It aims to ensure compatibility and agreement between the industry leaders behind the web. W3C standards that may relate to IEM technologies include HTML, SOAP, SPARQL, XML and WSDL

Other more proprietary organisations such as ESRI, Microsoft and Oracle define file formats and interfaces which often relate to international standards or become standards in their own right, simply because these technologies are so widely used.

Specific standards that relate to the datasets which are likely to be involved in linked models include:

  • CSW, Catalog Service for the Web is one part of the OGC Catalog Service specification that they describe as follows “Catalogue services support the ability to publish and search collections of descriptive information (metadata) for data, services, and related information objects. Metadata in catalogues represent resource characteristics that can be queried and presented for evaluation and further processing by both humans and software. Catalogue services are required to support the discovery and binding to registered information resources within an information community."
  • GML, Geographic Markup Language is an OGC XML standard for geographic systems, it describes features, geometries, coordinate reference systems and more. One of the primary purposes for GML is to help connect various geographic databases
  • WCS, Web Coverage Service: provides access, sub setting, and processing on a ‘coverage’ (a spatio-temporal feature conveying different values at different locations)
  • WCPS, Web Coverage Processing Service is maintained by the OGC and provides a languages for querying raster data over the web.
  • WFS, Web Feature Service from the OGC, provides an interface which allows clients to query and access geographical features across the web.
  • WMS, Web Mapping Service is a specification published by the OGC and defines a protocol for serving of georefenced map images over the internet. As the images themselves tend not to be analysed in quite the same way as the data received via a WFS call this service may be less relevant to challenge of linking models.

Various technologies and libraries have been created to support the management of spatial data, noteworthy examples include:

  • GDAL, Geospatial Data Abstraction Library is, according to gdal.org “a translator library for raster geospatial data formats that is released under an X/MIT style Open Source license by the Open Source Geospatial Foundation. As a library, it presents a single abstract data model to the calling application for all supported formats. It also comes with a variety of useful commandline utilities for data translation and processing.”
  • Oracle Spatial, although a less generic solution than most of those mentioned in this section, Oracle Spatial is particularly relevant to the BGS as the corporate database is hosted on an Oracle 11g server. The BGS corporate database contains a wealth of spatial data that could theoretically be consumed by process models, not least the Geological Object Store of modelled objects.
  • Oracle Spatial has an implementation of CSW
  • Through ArcSDE it is possible to access and edit Oracle Spatial data in a GIS environment
  • GDAL is able to read and write raster data in Oracle Spatial GeoRaster format

Direct database connections provide powerful ways to store and access spatio-temporal data and metadata. Connection technologies include:

  • ADO, a Microsoft middleware layer that sits between a programming language and OLE DB
  • ODBC, is the Open Database Connectivity standard API for accessing data from a wide range of database platforms. Drivers exist for all major database management systems and many other sources such as Microsoft Excel and CSV files.
  • OLE DB, another Microsoft solution is an API that allows access to data in a variety of formats, including non-relational database data sources. It is now a legacy technology that has been superseded by ODBC.

Atmosphere

Atmospheric datasets tend to fall into three generic categories, Gridded Binary (GRIB), Network Common Data Form (netCDF) or the Hierarchical Data Format (HDF) system. All are intended for use with modern atmospheric datasets, which encompass information about the atmosphere, sea, and ocean. The same systems are used for observed and simulated data, as observational data is often used to initialise atmospheric models, particularly those adopted for short term weather prediction. Atmospheric datasets avoid the use of gridded ascii files, as the volume of data produced renders these file types unsuitable. The Climate and Forecasting (CF) standard for atmospheric datasets was conceived at the turn of the century and is increasingly gaining acceptance as the de facto convention. CF aims to distinguish quantities (descriptive, units, prior processing, etc) and to spatio-temporally locate data as a function of other independent variables, such as a coordinate system (Gregory, 2003[1]). Each method for storing data for transfer has its own advantages and therefore if a method is selected it should be the most adequate for the data concerned

GRIB

The Gridded Binary (GRIB) format is commonly used to store meteorological datasets, both forecast and historical. The GRIB standard is described in detail in the World Meteorological Organisation (WMO) code manual (WMO, 1995[2]). There have been three versions of the GRIB standard, however the first (GRIB 0) was only used on a limited number of projects. The second version has been used operationally for a number of years. Currently the third generation GRIB format (GRIB2) is used by some institutions at the operational level. Use of the third generation standard is expanding.

The GRIB file format is a set of self containing records, which when broken down retain their usability. They are composed of two main parts, the header and the data, the latter of which is in binary format.

HDF

Hierarchical Data Format (HDF, HDF4, or HDF5) is the name of a set of file formats and libraries designed to store and organise large amounts of numerical data. The HDF format, libraries and associated tools are available under a liberal, Berkeley Software Distribution (BSD)-like license for general use. HDF is supported by many commercial and non-commercial software platforms, including Java, MATLAB/Scilab, Octave, IDL, Python, and R. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).

HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called vgroups. There currently exist two major versions of HDF; HDF4 and HDF5, which differ significantly in design and API.

HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users. The HDF4 format has many limitations. It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.

The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of a SQL database, but access is available for non-array data. HDF5 simplifies the file structure to include only two major types of object:

  • Datasets, which are multidimensional arrays of a homogenous type
  • Groups, which are container structures which can hold datasets and other groups

This results in a hierarchical filesystem-like data format. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.

The latest version of NetCDF, version 4, is based on HDF5.

NetCDF

Net CDF is a set of interfaces for array-oriented data access and a distributed collection of data access libraries for C, Fortran, C++, Java, and other languages. The netCDF libraries support a machine-independent format for representing scientific data. Together, the interfaces, libraries, and format support the creation, access, and sharing of scientific data.

The NetCDF format is self-describing, whereby the file includes information about the data it contains. NetCDF files also exhibit some platform independence, so they can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. One major advantage of the NetCDF format is its ability to handle large datasets that are otherwise unsuitable for other formats. The NetCDF libraries are designed to be backwards compatible, so data stored in old versions will always be accessible.

CF

The Climate and Forecast (CF) convention is intended for use with state estimation and forecasting data, in the atmosphere, ocean, and other physical domains. It is used by many atmospheric institutions and projects around the world. It was designed primarily to address gridded data types such as numerical weather prediction model outputs and climatology data in which data binning is used to impose a regular structure. However, the CF conventions are also applicable to many classes of observational data and have been adopted by a number of groups for such applications. CF originated as a standard for data written in netCDF, but its structure is general and it has been adapted for use with other data formats. For example, using the CF conventions with HDF data has been explored.

CF conventions are for the description of Earth sciences data, intended to promote the processing and sharing of data files. The metadata defined by the CF conventions are generally included in the same file as the data, thus making the file self-describing. The conventions provide a definitive description of what the data values found in each CF variable represent, and of the spatial and temporal properties of the data, including information about grids, such as grid cell bounds and cell averaging methods. This enables users of files from different sources to decide which variables are comparable, and is a basis for building software applications with powerful data extraction, grid remapping, data analysis, and data visualisation capabilities.

The CF conventions have been adopted by a wide variety of national and international programs and activities in the Earth sciences. For example, they were required for the climate model output data collected for Coupled Model Inter-comparison Projects (CMIP), which are the basis of Intergovernmental Panel on Climate Change assessment reports. They are promoted as an important element of scientific community coordination by the World Climate Research Programme. They are also used as a technical foundation for a number of software packages and data systems, including the Climate Model Output Rewriter (CMOR), which is post processing software for climate model data, and the Earth System Grid, which distributes climate and other data. The CF conventions have also been used to describe the physical fields transferred between individual Earth system model software components, such as atmosphere and ocean components, as the model runs.

References

  1. Gregory, J M, 2003. The CF metadata standard. Technical Report 8, CLIVAR
  2. WMO, 1995. Manual on Codes. WMO Publication Number 306, Volume 1, Part B, 1995 Edition, plus Supplements.