Answer
Project Overview
Introduction
INSPIRE is a European Union directive that requires member states and agencies to share spatial data in order to better support environmental decision making. The key approach to enable this data exchange is the use of a common data model accessed using open standards such as OGC WFS (Open Geospatial Consortium Web Feature Services). This prototype demonstrates how FME can be used to implement this approach by reading from disparate data sources with different data models and transforming their schema to conform to the common INSPIRE schema for Protected Areas. This is done with a Loader workspace which reads the disparate sources, processes the required schema transformations and loads them into a PostGIS spatial database. It also demonstrates how FME can be used to construct an OGC web service to publish this data. A workspace was built to read from PostGIS and write out GML 3.2.1 data that complies with the INSPIRE Protected Areas schema. This was then published to the WFS service on FME Server.
Thus this prototype successfully demonstrates that FME Desktop can be used to design data transforms and FME Server can be used to publish them via OGC web services in order to support spatial data integration of different real world data sources into the common data structure defined in the INSPIRE protected area schema. All this was done without any programming or scripting. The only resources required were XML documents for the getCapabilities.xml, describeFeature.xml, a sample record for protected sites and a sample parent document that were adapted to serve as templates. It is hoped that this prototype will serve as an example implementation for other jurisdictions and inspire other users to consider using Spatial ETL tools such as FME to bridge the gap between their diverse data sources and the evolving complexities of the INSPIRE specifications with minimal risk and effort.
A special thanks to Metria, our partners in Sweden, for their central role and participation in this project.
Source Data
This demo uses data from Natura2000 – a protected areas database for Europe, Swedish NVR – protected areas data from Swedish Environment, and Helcom, a database of protected areas for the Baltic region. All of these have different native schemas which had to be mapped to the INSPIRE protected areas schema.
Processing Steps
Loader workspace (Source to PostGIS):
1. Read the source database (Natura2000, NVR, Helcom or alternate) 2. Join all the required site fields together with a FeatureMerger based on site code. 3. Create any required field not present in source. 4. Process each feature through a SchemaMapper. This defines all the attribute mappings from the source schema to the destination Inspire protected areas schema. It is also used to create many of the required fields not present in the source and then sets default values for those fields. 5. Extract OGC geometry as GML 3.2 XML and store in a _geometry text attribute 6. Generate the XML for each child entity (one to many relationships) - such as protected site activity - with XMLTemplater. XMLTemplater uses an XML prototype of the subrecord entity and then substitutes attributes into that template using fme:get-attribute() and fme:get-xml-attribute functions located at the appropriate place within the template. This works like a document merge application. 7. Merge activities into protected site features containing the list of XML snippets for the activities and one large XML snippet for the geometry. 8. Write the complete protected site feature to the ProtectedSite table in the Inspire PostGIS database which is modelled after the Inspire Protected Areas schema.
Exporter workspace (PostGIS to GML and WFS):
1. Read the source Inspire ProtectedSite table from the PostGIS database. Use the extents supplied with the BBox parameters via the WFS getFeature request. 2. Generate the XML for each protected site feature with XMLTemplater substituting attributes into protectedSites_featureTemplate.xml using fme:get-attribute() and XML snippets using fme:get-xml-attribute() 3. Merge into a single feature containing the list of the protected site XML snippets 4. Generate the XML for the entire dataset with XMLTemplater substituting the protectedSites features XML into protectedSites_datasetTemplate.xml using fme:get-xml-list-attribute() 5. Test for schema validity against the Inspire schemas (this test is optional as disabling the test improves performance). 6. Write the completed XML doc using the text file writer
Note that in the FME Viewer the WFS data can easily be overlaid with other vector or raster data sources such as a WMS boundary layer to give context to the protected sites. If you have any problems, please contact support@safe.com and make sure you specify Inspire Protected Areas Demo in the subject line.
DEVELOPMENT
Prototype Goal
Build a scalable, INSPIRE compliant data publishing and distribution system. This means complying with the standardized specifications for the common INSPIRE data model and employing OGC standards for web services.
Design Criteria
One of the most important requirements for this system is that it be scalable. This means it should be easy to add new data sources, easy to increase processing power to serve more users and process more data as demand increases, and easy to incorporate new data types. Naturally, everything must be INSPIRE compliant. This is accomplished by using OGC standards such as WFS and verifying that all data output complies with the INSPIRE schema. Specifically, this means that the GML streamed via the WFS must be validated against the GML 3.2.1 protected sites xsds supplied by JRC.
Two other criteria were also central to the potential future success of systems based on this prototype. These criteria are ones that are sometimes overlooked by standards based approaches that seek ideal solutions but not always consider the gritty realities of integration with existing workflows. Any new INSPIRE based system needs to be non-intrusive. That means that existing workflows, tools, and data models must not be impacted. The INSPIRE system shouldn’t dictate how work is to be done. Contributing organizations must be able to innovate as they want. The beauty of FME is that it allows for complex integrations across very different systems enabling users to bridge gaps such as those that often occur between open standards and proprietary systems. The reality is that many organizations will continue to use proprietary systems, and ways need to be found to integrate these with open standards based initiatives such as INSPIRE.
Approach
FME’s abilities to work with complex XML have historically been limited. Traditional GIS and databases are relational based. This means that data structures are typically flat whether or not they include geometry. On the other hand, GML structures such as those used in INSPIRE are often complex, objected oriented, nested structures. For example, a house has rooms which have furniture which has drawers etc. This becomes difficult to model in the relational context in an automated way.
With FME 2010 and onwards we have developed a template approach to solving this problem. An XML template is used to define a structure as complex or nested as you want. Then, wherever you want to insert attribute values, all you need to do is insert fme:get-attribute() functions in the XML template. This is typically done within a record template that contains the XML to describe the data associated with one feature type. Then, when you want to roll all your feature types into parent objects or documents you can use the fme:get-list-attribute function. When attributes contain XML fragments we had to use fme:get-xml-attribute calls to allow the tags to be dealt with correctly. When all is said and done we simply assemble all the XML into one text document and write it out using our textfile writer. We also can pass the XML through an XMLValidator transformer to validate the output against the application schema as a last step before writing.
So ultimately this involves a process something like a document merge were the XML template holds the structures that don’t change and the fme get functions substitute in the values that do change from one record to the next. While this might look like a shortcut, the end result is XML that validates against the GML 3.2.1 INSPIRE application schemas - the ends justifying the means. It is also a much more accessible approach, since the author only really needs to understand the data structures, which is typically required of FME users anyways. Gone is the need to code complicated scripts in XSLT, which is more typically the specialized domain of XML programmers.
The other challenge inherent in any system which integrates disparate systems with a common data model is schema transformation. FME has supported this capability for a long time. One of the key tools for this has been SchemaMapper. However, in the past, the SchemaMapper interface has been a challenge to learn. Recent improvements to FME have made it much more intuitive. The basic idea is to separate the data process authoring from the data remodelling problem. This way data domain experts are free to author schema mapping without having to be data process experts. In this case we use a spreadsheet or csv file to define all the mappings from each of the source data models to the common INSPIRE schema. We also use the same tables to define new fields not present in the data source and provide them with the required default values. Conditional mapping rules and coded domains can also be applied. The SchemaMapper transformer is then inserted in the appropriate location within the integration data flow (after the required joiners) and automatically applies these data structure transformation rules across all features that pass through it, no matter what feature type they are from or attributes they contain.
Thus SchemaMapper and XMLTemplater are two tools within FME that are crucial to the functioning of this prototype and allow FME to transform both the data schema, and the XML structure into the form needed to comply with the defined INSPIRE standards. Because of the common data model approach, much of the effort involved was invested in building the processes to map the first data source to the INSPIRE schema. Thereafter, these processes were easily replicated and modified to accommodate different source datasets. All that had to be done for new data sources was to ‘fill in the blanks’ on the mappings and set the source attributes.
Challenges
There were a number of challenges that had to be overcome in order to implement this project. Within FME, we had to add some additional XML functions for the XMLTemplater. One of these was the fme:get-xml-list-attribute which simplified the process of rolling all the feature output for one feature type into the parent XML document without having to script a for loop.
In terms of the overall project, significant effort was required to develop the mappings from the source schemas to the destination INSPIRE schema. The INSPIRE schema is by nature rather complex so it does not necessarily match closely any established dataset standard. It took considerable time to digest the documentation of the INSPIRE standard and interpret the meaning of each of the fields and decide which is the best match from the source data or what is the most appropriate default value. Because of our approach with the SchemaMapper defined above, it was very easy to separate this task from the development work around authoring the transformation workspaces. This meant that one task was not unnecessarily held up by the other. It also allowed team members to focus on their strengths. Those that knew the datasets well were assigned schema mapping tasks while others who knew FME and XML processing well worked more on workspace development.
One major unknown was how to generate valid and INSPIRE compliant GML. First this was a challenge since out of the box FME does not yet have a GML 3.2.1 writer. Even for the standard FME GML 3.1.1 writer, in the past we have relied on XSLT to reformat our FME GML output to comply with customized application schemas. In this case we used the XMLTemplater approach described above. The first step was to interpret the INSPIRE schemas and use them to construct prototype XML records and parent documents. This was made more challenging simply because so little real world INSPIRE data actually exists – at least we had very little available to us. So we had to create sample a ProtectedSite record manually based on our best interpretation of the schema and then test this to see if it indeed complied.
This wasn’t too hard to do for basic attribute types, but did pose more problems for geometry and child data types. As FME does not include a GML 3.2 writer, this meant using an older OGC WKT geometry extraction tool and then parsing this text to conform. However, through the course of the project a new parameter was added to the GeometryExtractor that allowed for the extraction of geometry to GML 3.2.1 as well as GML 3.1.1, which simplified this process. The next major challenge was how to deal with nested objects or properties. Especially within Natura2000, protected sites can have child impacts and protected entities. Fortunately the INSPIRE schema was designed to accept these. However, we had to find a way to associate the parent with its child entities, and then render the appropriate XML. This was accomplished using FeatureMergers with lists. The lists were then passed through special XML templates for each child entity type, with get-list-attribute calls to generate XML fragments for each child. Finally, get-xml-list-attribute calls were used to roll up all the children per record for each feature. Once we figured out how to do this for one child entity type, this approach was relatively easy to replicate for other child types.
Another big unknown in this project was publishing INSPIRE GML 3.2.1 data via our WFS Service on FME Server. There are a lot of WFS servers and clients out there, but many of them are limited to a specific profile of GML, such as simple features. In order for our WFS to comply with INSPIRE standards, it had to be able to transmit application schema specific GML. Early on, we identified that our WFS Service initially only allowed output from our GML 3.1.1 writer. We had to relax this condition so that any workspace that generated XML could publish to the WFS service. Secondly, the getCapabilities and describeFeature documents were automatically generated by FME Server. We added the ability to disable this and allow the author to supply custom getCapabilities and describeFeature documents that complied with the target application schema. Finally, the feature type name in the database had to match the name of the feature type in the WFS and ultimately the one in the INSPIRE schema in order for publishing to work. This troubleshooting process on FME Server was a challenge at times because it involved reworking how the WFS service works and as such did not provide a lot of feedback when we didn’t do this quite right. However, now that we have done it once it is much easier to replicate. For example, adding new data sources involved no new effort in terms of the FME Server configuration, other than physically loading the new data, which meets our goals for scalability. Overall, publishing transformations to Server was one of the steps in the project requiring the least effort, and only minimal enhancements to FME Server were required to support this.
One challenge related to scalability is data volume, particularly in relation to geometry. We encountered this problem when we introduced the spatial database for data staging to support multiple sources. When we transformed straight from source to GML, this was not an issue since XML does not limit field width. However, spatial databases typically do, and some polygons in Natura2000 had hundreds of thousands of points. Because we decided to render the geometry to XML during the load process, we ran out of storage room in the database for a few of the largest features. In the end we had to add filters to drop records that exceed the database capacity and log warnings as needed. However, to mitigate this problem in the future, we have added a new AttributeCompressor transformer that allows users to compress and decompress the XML geometry attributes, which will in turn improve performance and reduce the database footprint by at least an order of magnitude.
Fully complying with the INSPIRE schema presented some challenges. While we had the basic XML structure correct, there were a couple of issues that initially caused us problems during validation. One of the tools new to FME since 2010 is the XMLValidator. This was handy to help automate XML evaluation and problem diagnosis. With it we found out that we had some namespace problems. The protected areas schema involves many namespaces. Also the XMLTemplater requires that all namespaces used be declared even within the XML fragments. It took us a while to figure out what namespaces were actually required at what level, and what namespace elements were optional or even detrimental. With some help from the GML development team, we finally crafted a set of namespace declarations that met our needs and allowed us to eliminate namespace errors. We also had to learn how to use XMLTemplater with XML fragments. In some cases we had to add wrapper <temp></temp> tags around the XML fragments so that they would remain valid XML. These <temp> tags were later removed when the fragments were assembled into the parent XML document. A new enhancement to preserve white space through XMLTemplate processing also helped us a lot with keeping our XML output readable.
We also had validation issues with our data / time data types and some duplicate gml_ids. Date data type problems were resolved with a combination of FME string processing and DateFormatter transformer manipulations. The duplicate gml_id problem was partly caused by our PostGIS reader. When reading geometry collections, such as a multi parcel protected area, the reader generates multiple records each with a component geometry but also with the same attributes. Because we store the XML geometry in a string attribute in the database, we no longer need to read the actual feature geometry. So all we had to do was add a DuplicateRemover based on gml_id and we were able to eliminate these duplicate records. We also decided to use a GOIDgenerator to create universal unique ids for our gml_id. This became important when using multiple data sources as some of our counter based ids became duplicates when new data source loaders were created.
Overall we had a range of challenges – some expected and some surprises. Most of these challenges related to the complexities of the schemas involved and our learning curve in dealing with the intricacies of XML structures related to the INSPIRE schemas. However, the flexibility of FME allowed us to overcome these challenges and meet the original prototype objective: a scalable, INSPIRE compliant data publishing and distribution system for protected areas data from multiple jurisdictions.
Value Added
Besides maximizing adaptability to existing systems and data structures, another characteristic of successful prototypes is the extent which they “give back value” to organizations that choose to adopt them. Data sharing systems such as this INSPIRE prototype can go a long ways beyond satisfying the complex requirements for compliance with INSPIRE standards. They can also more broadly promote better data sharing both within an organization and between organizations. Also, once a spatial ETL system such as this is set up, it is relatively easy to integrate in new data sources not necessarily required by INSPIRE, but ones that add significant value to working with the INSPIRE data. For example, background imagery, climate model data (NetCDF) and real time sensor data can all be easily integrated with the INSPIRE data to provide a richer context to end users of the data. This can help support data quality review as well as broader analysis capabilities. It also makes it much easier to exchange data with other organizations whether or not they support the INSPIRE standards. An example of this is the data download service, which comes with FMEServer and allows users to download data in any format and any coordinate system they desire – e.g. ED50 Shape file. Finally, the server based approach builds in the performance needed to scale as demand increases, and allows the implementation of security where needed to manage data access.
Future Development
In the future we would like to add more data sources from other agencies and jurisdictions. We would also like to add support to other INSPIRE themes. An OGC CSW Metadata prototype was developed in parallel to this project, which could be integrated with this prototype in order to implement INSPIRE metadata for protected areas.
In terms of data quality checks, there is a lot more we could do. As a start, the requirements of schema validation actually do catch a lot of data problems that might otherwise be missed, since certain relationships and data integrity checks are part of the validation process (unique ids etc). However, we could also add other quality checks on both the geometries (e.g. min area) and attributes (standardizing place names etc).
Enhancements could be made to our viewer to allow for more user exploration of the XML attributes. Currently FME Viewer does not display XML formatting characters. FME Data Inspector does display these, but has limits on how many characters can be displayed per field. Of course, Workbench can be used to parse any WFS request using XQueries, so FME can consume and extract any desired information as well as serve it. What is needed is to explore end user requirements and how spatial transformation tools can help digest INSPIRE services once they are implemented.
Finally, performance is an area that will take more work. WFS is by design limited in how much data can be transmitted per query. While the system as a whole can handle many transactions, we encountered problems when requests were made for more than 15 or 20 MB of data, or data for an area larger than 1 degree X 1.5 degrees. Mitigation approaches could be explored such as caching, tiling, and generalization depending on extent range. WMS could also be used to display data over large areas with WFS available as users zoom to a higher level of detail.