XQuery Examples
Introduction
With the growing number of XML based data sources and web services responding to the users with XML streams, many of us started to feel the need for a tool that would allow for extracting data from XML in an easy and flexible way. As usual, our team responded very quickly to this demand and this is why FME 2009 was empowered with XQuery- a language designed to query collections of XML data. XQuery for XML is similar to what SQL is for databases.
Here, I am going to give a few examples how XQuery can be used and show its power and convenience. I am not an XQuery expert in any way and my examples do not pretend to be a replacement of the language documentation or manuals, I would just like to share my positive impressions about this new, powerful combination of FME and XQuery. I will give examples of element and attribute extraction, FLWOR (standing for FOR-LET-WHERE-ORDER BY-RETURN - analogous to SELECT-FROM-WHERE in SQL) queries, simple conditions (IF-THEN-ELSE), and join expressions.
Old Way of Querying XML
Before continuing with queries, let's have a look at the old-fashioned way of extracting data from XML (here we talk about an arbitrary XML where we don't have the mapping of XML elements to FME features via xfMAP). In FME 2008 and before, any XML data arriving via web services (for example, as an HTTPFetcher
response) had to be searched with a StringSearcher. The desired element would be placed into a list attribute, which then had to be indexed and renamed. So, for each element, you would need at least three transformers:
Regular expressions in the StringSearcher could be quite complex:
<postal[[:graph:][:space:]]*>([A-Z]+[0-9]+[A-Z]+)</postal>
and also, not very reliable. For example, multiple occurrences of the same element would be quite hard to track. The StringSearcher does not distinguish between different XML parts, so it is not at all a good tool for parsing that kind of data.
New Way
FME 2009 got a rich set of transformers for manipulating XML data: the XMLFeatureMapper, XSLTProcessor, XQueryExploder, XQueryExtractor, and XQueryUpdater.
Usually, I try to use some real world example to illustrate how new functionality is implemented. This time, I am using the Google Geocoding API, and namely, the Reverse Geocoding operation, which, when given a pari of coordinates (a point feature) returns the address closest to that point.
I placed a point on the Safe building using Google Earth, got its coordinates with a CoordinateFetcher, made a URL, and submitted it to the Google Geocoding service with am HTTPFetcher:
http://maps.google.com/maps/geo?ll=49.137,-122.857&output=xml&key=my_google_api_key
You may want to get your own Google API Key here -
http://code.google.com/apis/maps/signup.html.
I specified XML as the output format (the alternatives are CSV, JSON, and KML, which is essentially the same as XML, but with a different MIME type). Here is an extract of the XML sent back:
<?xml version="1.0" encoding="UTF-8" ?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Response>
<name>49.138000,-122.857000</name>
<Status>
<code>200</code>
<request>geocode</request>
</Status>
<Placemark id="p1">
<address>7455 132 St, Surrey, BC, Canada</address>
<AddressDetails Accuracy="8" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0">
<Country>
<CountryNameCode>CA</CountryNameCode>
<CountryName>Canada</CountryName>
<AdministrativeArea>
<AdministrativeAreaName>BC</AdministrativeAreaName>
<Locality>
<LocalityName>Surrey</LocalityName>
<Thoroughfare>
<ThoroughfareName>7455 132 St</ThoroughfareName>
</Thoroughfare>
<PostalCode>
<PostalCodeNumber>V3W</PostalCodeNumber>
</PostalCode>
</Locality>
</AdministrativeArea>
</Country>
</AddressDetails>
<Point>
<coordinates>-122.8566950,49.1378180,0</coordinates>
</Point>
</Placemark>
<Placemark id="p2">
...........
As you may notice, the structure of the <AddressDetails> tag is quite complex. This portion of XML is actually
xAL (Extensible Address Language), a special language designed for addresses. It's pretty normal with in the XML world that whenever you have to learn a new language, you also discover two or three others. During this exercise, I learned about XQuery, which is based on
XPath syntax and, along with the xAL, there is also
xNL (Extensible Name Language). That means four new languages (to me).
Transformer Interface
The parameters dialog of the XQueryExtractor transformer (I mostly use this one) looks quite scary:
However, it is not so bad - seven of the parameters just specify where to find the data and how to query it, and the rest of the parameters have default values.
There is also an XQUERY Expression Window where we can type our queries:
Query results go to the attribute specified in the "
Result Attribute" parameter.
Simple XQueries
Simple XQuery expressions are also valid XPath expressions. And XPath is a language for navigating through the XML document. Loosely speaking, a path to XML parts is similar to the paths we get used to in our operation systems. For example, the path to the <address> tag looks as follows:
/kml/Response/Placemark/address
However, we need one more thing before this query will work with our example. The top node <
kml> contains an attribute
xmlns="http://earth.google.com/kml/2.0". This attribute defines a
namespace, which makes the elements and attributes unique. Since the namespace is present in the top element, we have to declare it and use it to specify all the elements (unless there is another namespace is introduced), so our query should look as follows (
see the attached simpleQueries.fmw):
(: Query #1 :)
declare namespace x="http://earth.google.com/kml/2.0";
/x:kml/x:Response/x:Placemark/x:address
Note that smiley faces (: :) are used for commenting.
A double slash "//" indicates that the query should look down the whole tree, not necessarily at the specified level, so the query can be maa bit simpler:
(: Query #2 :)
declare namespace x="http://earth.google.com/kml/2.0";
//x:address
Both queries above will give us the following result on the original XML:
<address xmlns="http://earth.google.com/kml/2.0">7455 132 St, Surrey, BC, Canada</address>
There is another namespace in this XML defined at the <AddressDetails> level. This namespace specifies the unique names of the xAL. If we make a query that goes into this part of the XML, we should declare two namespaces and use them both, each at the appropriate levels:
(: Query #3 :)
declare namespace x="http://earth.google.com/kml/2.0";
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
/x:kml/x:Response/x:Placemark/y:AddressDetails/y:Country/y:CountryName
or, if we don't use elements above the second namespace:
(: Query #4 :)
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
//y:CountryName
Both queries above will give us the following result:
<CountryName xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0">Canada</CountryName>
Of course, XQuery (and XPath) are much more powerful than simple path expressions they have a rich set of functions that help to extract and transform data.
For example, we often don't need the XML tag, rather, we want to get the contents of the tags. The modified query #1 after extending it with a string() function will look as follows:
(: Query #5 :)
declare namespace x="http://earth.google.com/kml/2.0";
string(/x:kml/x:Response/x:Placemark/x:address)
The result is:
7455 132 St, Surrey, BC, Canada
Here are a couple of other function examples showing how to extract coordinates. This query will take the substring before the comma of the contents within the <coordinates> tags. This returns the X-coordinate of the geocoded point:
(: Query #6 :)
declare namespace x="http://earth.google.com/kml/2.0";
substring-before(string(//x:coordinates), ",")
To get the Y-coordinate, we have to convert the contents into a sequence and remove it's first and last elements:
(: Query #7 :)
declare namespace x="http://earth.google.com/kml/2.0";
remove(remove(tokenize(string(//x:coordinates), ","),3),1)
A great XQuery function reference can be found
here.
FLWOR Expressions
The workspace in the previous section used a small section of the XML from the original data received from Google. What if we run one of the queries shown above on the whole XML file? (See FLWOR.fmw workspace)
(: Query #8 :)
declare namespace x="http://earth.google.com/kml/2.0";
data(//x:address)
The transformer will return the following string (I set the "Return Value" parameter to "Separated Values" and used "|" as a separator; line breaks were added manually for better visibility):
7455 132 St, Surrey, BC, Canada|British Columbia V3W 1J8, Canada|
Surrey, BC, Canada|
Surrey, BC, Canada|
Greater Vancouver Regional District, British Columbia, Canada|
British Columbia, Canada|
Canada|
North America
This happens because the Google API returns multiple points of differeing accuracy levels - from continent and country levels (lowest) to the address or premise levels (highest). Imagine, we need the highest accuracy possible for each point we geocode. To do so, we have to scan the entire XML file and find such <AddressDetails> element where the "
Accuracy" attribute is bigger than that of other <
AddressDetails>. This is a situation where we have to use
FLWOR expressions. Again, FLWOR stands for five clauses - FOR, LET, WHERE, ORDER BY, and RETURN.
Here is an example:
(: Query #9 :)
declare namespace x="http://earth.google.com/kml/2.0";
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
for $n in //x:Placemark
let $maxAcc:= max(//y:AddressDetails/number(@Accuracy))
where $n//y:AddressDetails/number(@Accuracy) = $maxAcc
return data($n/x:address)
FOR scans through all <Placemark> nodes. LET assigns the highest accuracy to the $maxAcc variable. WHERE takes only the node where the accuracy of the current <Placemark> (in <AddressDetails> node) is equal to $maxAcc. RETURN sends back the result:
7455 132 St, Surrey, BC, Canada
If we need a certain level of accuracy, we can explicitly indicate it like this:
(: Query #10 :)
declare namespace x="http://earth.google.com/kml/2.0";
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
for $n in //x:Placemark
where $n//y:AddressDetails/number(@Accuracy) = 3
return data($n/x:address)
And the result is:
Greater Vancouver Regional District, British Columbia, Canada
Join Expressions
Join is one of the most powerful tools in SQL. XQuery also allows joining data. For example, we would like to give the user a better explanation of what the numbers in the Accuracy attribute mean. We need an XML file that will contain both accuracy numbers and descriptions.
I added such a lookup table directly in the XQuery window, although I could use a Concatenator transformer to create an attribute containing two XML parts. Here is the query:
(: Query #11 :)
declare namespace x="http://earth.google.com/kml/2.0";
<dml>
{
//x:Placemark
}
<AccuracyLookup>
<AddressAccuracy Accuracy="0">Unknown location</AddressAccuracy>
<AddressAccuracy Accuracy="1">Country level</AddressAccuracy>
<AddressAccuracy Accuracy="2">Region (state, province, prefecture, etc.) level</AddressAccuracy>
<AddressAccuracy Accuracy="3">Sub-region (county, municipality, etc.) level</AddressAccuracy>
<AddressAccuracy Accuracy="4">Town (city, village) level</AddressAccuracy>
<AddressAccuracy Accuracy="5">Post code (zip code) level</AddressAccuracy>
<AddressAccuracy Accuracy="6">Street level</AddressAccuracy>
<AddressAccuracy Accuracy="7">Intersection level</AddressAccuracy>
<AddressAccuracy Accuracy="8">Address level</AddressAccuracy>
<AddressAccuracy Accuracy="9">Premise (building name, property name, shopping center, etc.) level</AddressAccuracy>
</AccuracyLookup>
</dml>
Note that I added the outer <dml> tags, this way we get a single XML document. Here <dml> means Dmitri's Markup Language :-)
Now we can join the two XML parts - <Placemarks> and <AccuracyLookup>. Again, this should be a FLWOR expression. The FOR clause says that we are scanning <AddressDetails> and AddressAccuracy elements, and in WHERE we are looking for the matching attribute Accuracy. As a RESULT, we ask for a concatenation of the address and the accuracy descriptions:
(: Query #12 :)
declare namespace x="http://earth.google.com/kml/2.0";
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
for $addr in //x:Placemark,
$accr in //AddressAccuracy
where $addr//@Accuracy = $accr/@Accuracy
return concat (data($addr/x:address), " (Accuracy: ", data($accr), ")")
Here is the result (line breaks were added manually, asterisk was used as a separator):
7455 132 St, Surrey, BC, Canada (Accuracy: Address level)*
British Columbia V3W 1J8, Canada (Accuracy: Post code (zip code) level)*
British Columbia V3W 4M7, Canada (Accuracy: Post code (zip code) level)*
Surrey, BC, Canada (Accuracy: Post code (zip code) level)*
Surrey, BC, Canada (Accuracy: Town (city, village) level)*
Greater Vancouver Regional District, British Columbia, Canada (Accuracy: Sub-region (county, municipality, etc.) level)*
British Columbia, Canada (Accuracy: Region (state, province, prefecture etc) level)*
Canada (Accuracy: Country level)*
North America (Accuracy: Unknown location)
Conditions
XQuery also supports traditional IF-THEN-ELSE expressions. Let's have another look at the source XML. Some Placemark nodes with lower accuracies don't have PostalCode elements:
...
<Placemark id="p7">
<address>British Columbia, Canada</address>
<AddressDetails Accuracy="2" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0">
<Country>
<CountryNameCode>CA</CountryNameCode>
<CountryName>Canada</CountryName>
<AdministrativeArea>
<AdministrativeAreaName>BC</AdministrativeAreaName>
</AdministrativeArea>
</Country>
</AddressDetails>
...
We may want to find such placemarks and give a notification to our user (see If.fmw workspace. I am going to be simple and use "n/a" for those nodes that don't have postal codes.
(: Query #13 :)
declare namespace y="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0";
for $pc in //y:AddressDetails
return
if (not($pc//y:PostalCodeNumber))
then ("n/a")
else (data($pc//y:PostalCodeNumber))
Again, here we have a FLWOR expression where the RETURN clause contains a condition specifying that if there is no PostalCodeNumber element, return "n/a", otherwise use the value of this element. This gives us the following result (asterisk is used as a separator character):
V3W*V3W 1J8*V3W 4M7*V3W*n/a*n/a*n/a*n/a*n/a
Demo workspaces
ReverseGeocoding
Most of the expressions shown above are used in the Reverse Geocoding workspace. This workspace takes three points stored in FFS format, constructs a URL, gets an XML from Google, extracts several attributes including address, postal code and accuracy, and places geocoded points.
There is one step not explained yet. I used an XQueryExploder to get separate features for each Placemark node with the highest accuracy. One of the source points gives two placemarks. As a result, three source points produced four geocoded points. The query used in the Exploder is very simple:
declare namespace x="http://earth.google.com/kml/2.0";
//x:Placemark
This means that the transformer will make a separate feature from every Placemark element (and after the previous step, we've got only four elements).
NOTE: The address Google gave us (7455 132 St, Surrey) does not exactly the building where Safe Software is located. Our real building number 7445. Initially I thought that this was bad, but then I decided to leave it as is - just as a warning that Reverse Geocoding may be not very reliable - do not use it for your mailing needs.
I hope this page will help you to get an idea of how great the combined power of FME and XQuery might be and that you will find it useful in your data transformations.
HTML Creation and User Defined Functions
This demo shows how to extract all necessary data from an XML file and put it into nicely formatted HTML with just one XQuery transformer.
Here is our source XML, taken from
the
Earth Tools Webservices example:
<?xml version="1.0" encoding="UTF-8" ?>
<sun xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.earthtools.org/sun.xsd">
<version>1.0</version>
<location>
<latitude>49</latitude>
<longitude>-123</longitude>
</location>
<date>
<day>1</day>
<month>1</month>
<timezone>-8</timezone>
<dst>1</dst>
</date>
<morning>
<sunrise>09:08:20</sunrise>
<twilight>
<civil>08:30:58</civil>
<nautical>07:50:26</nautical>
<astronomical>07:11:50</astronomical>
</twilight>
</morning>
<evening>
<sunset>17:23:31</sunset>
<twilight>
<civil>18:00:53</civil>
<nautical>18:41:25</nautical>
<astronomical>19:20:02</astronomical>
</twilight>
</evening>
</sun>
We can write any tags directly with queries. In this case, the XQuery expressions should be placed into curly braces {}:
<table border="1">
<tr>
<th>Sunrise</th><th>Sunset</th>
</tr>
<tr>
<td>{data(//sunrise)}</td><td>{data(//sunset)}</td>
</tr>
This code will generate a valid portion of HTML, which can be viewed with any browser. Here I place the resulting code without any modification, and it is rendered according to all HTML rules, and instead of XQuery expressions, we get the actual values:
Sunrise Sunset
09:08:20 17:23:31
Now, because we have three different types of twilight, we may end up with a lot of similar portions of code:
<tr>
<th>Civil</th>
<td>08:30:58</td>
<td>18:00:53</td>
</tr>
The structure is always the same, the values are different. Can we define functions in XQuery? Yes, we can:
declare function local:twilights($twl)
{
<tr>
<th>{concat(upper-case(substring(name($twl[1]), 1, 1)),substring(name($twl[1]), 2))}</th>
<td>{data($twl[1])}</td>
<td>{data($twl[2])}</td>
</tr>
};
and the function calls look as follows:
{local:twilights(//civil)}
{local:twilights(//nautical)}
{local:twilights(//astronomical)}
Note that we have two elements called civil, nautical, and astronomical - for mornings and evenings. It means that what we send to the function is a sequence <civil>08:30:58</civil>,<civil>18:00:53</civil>. This is why we have to address them with square brackets [ ] - this way we extract the atomic values.
The first expression looks complex, but it simply changes the case of the first character in the word.
The final query is a bit long, but really, it does a lot (use xquery2html_kml.fmwt template):
(: Query #14 :)
declare function local:twilights($twl)
{
<tr>
<th>{concat(upper-case(substring(name($twl[1]), 1, 1)),substring(name($twl[1]), 2))}</th>
<td>{data($twl[1])}</td>
<td>{data($twl[2])}</td>
</tr>
};
<body>
<h3>
{let $m:= data(//month)
return
if ($m = "1") then "January"
else if ($m = "2") then "February"
else if ($m = "3") then "March"
else if ($m = "4") then "April"
else if ($m = "5") then "May"
else if ($m = "6") then "June"
else if ($m = "7") then "July"
else if ($m = "8") then "August"
else if ($m = "9") then "September"
else if ($m = "10") then "October"
else if ($m = "11") then "November"
else "December"},
{data(//day)}
</h3>
<h4>at
{let $lat:= number(//latitude)
return
if ($lat < 0)
then concat(abs($lat), " deg S")
else concat($lat, " deg N")},
{let $lng:= number(//longitude)
return
if ($lng < 0)
then concat(abs($lng), " deg W")
else concat($lng, " deg E")}
</h4>
<h5>timezone:
{data(//timezone)}
(
{let $dst:= number(//dst)
return
if ($dst = 1) then "dst" else ""}
)
</h5>
<table>
<tr>
<th></th><th>Sunrise</th><th>Sunset</th>
</tr>
<tr>
<td></td><td>{data(//sunrise)}</td><td>{data(//sunset)}</td>
</tr>
<tr>
<th></th><th colspan="2" align="center">Twilights</th>
</tr>
<tr>
<th></th><th>Morning</th><th>Evening</th>
</tr>
{local:twilights(//civil)}
{local:twilights(//nautical)}
{local:twilights(//astronomical)}
</table>
</body>
I called my resulting attribute
kml_description_raw_text. Now if I send my output
to KML, and click on the point, I get the following placemark balloon with our HTML:
