When we look at transferring data we have to know how we want to format that data to transfer. A common, simple, file allows for it to easily be written from one system and read by another. But we didn’t always have a simple format that could be used for this purpose.
We’d like to use comma delimited files, databases, or something like that, where we could list the field name as the first row, but it wasn’t necessarily easy to ready, especially with complex data like you would find in an object oriented environment where an object may have other objects as it’s child.
In older days, we would share the format of the file, especially if it was made up of complicated data formats – but that limits who can read it. The person who was going to read it would have to know what the fields were in what order. The developer would have to work with that format, and any changes to the format could cause the data transfer to fail.
A solution to that was the XML file format. It is a self descriptive format, which means that you can read the file and see the fields that make up the data. By reading the file you know what fields are being used by the file. As the file format changes, you can still read the file.
You can simply ignore data fields that you don’t need, but use the new fields as you expand your project.
While it seemed like a perfect solution – the end developer still often needs to know how the data is formatted, how it works. That way they can convert the data into their internal format and be able to use it.The XML format can be used by a system used to convert data, but outside of that, you will need to understand the data.
XML stands for the eXtensible Markup Language – kind of like HTML. The thing to remember is that it is for storing data. It doesn’t, it can’t, do anything as far as running a file, or converting information itself. The data just sits there.
However, there are some differences between HTML and XML. With HTML, the data is designed to be displayed, where XML is to be transported between systems, or used for data storage itself. HTML is made up of predefined tags, and those are all that are supposed to be used, with XML there are no predefined tags.
The first line in an XML file is optional, but should be the following:
<?xml version="1.0" encoding="UTF-8"?>
If it does exist, it must be the very first line. If it doesn’t, you should use UTF-8 for your file encoding. Note however, that some parsers will expect to see this line, and will not read the file if it is missing. Yes, it is optional and they shouldn’t do that, but it does happen.
This is like the doctype tag that starts an HTML file. It lets the program which is reading it, or writing it, know what the format is, before trying to read and parse it.
So what does the data look like? One can look at it like a tree structure with a root node and many child nodes, and some of the children having their own children. The root node is required, and it allows a place to start parsing.
Each node is made up of tags and attributes. The formatting of the file format is fairly specific, and required to be followed to officially be a properly exported XML file which can be read by different systems.
Generally, a file has an external tag, which contains all other tags. For example, a books tag might contain a series of book tags. This general format has a plural tag name, which contains one or more singular tags. This is by no means a requirement – just a practice you might see.
Tags can contain information by either having children tags, data themselves, or using attributes.
Every tag must have a closing tag. If it doesn’t then the closing tag slash is put at the end of the tag.
A tag with attributes must follow a name/value pair set. The name is independent, but the value must be enclosed by quotes.
<books>
<book>
<title>This is my Book</title>
<author>John Smith</author>
<edition>1</edition>
</book>
</books>
<books>
<book title="This is a Book" author="John Smith" edition="1"></book>
</books>
<books>
<book title="This is a Book" author="John Smith" edition="1" />
</books>
Both of these examples are practically the same. They store the same data, however, they can be read slightly differently.
In the examples you can see that child tags and attributes are related to the parent. Data cannot cross between two tags. Luckily this makes it easy to determine what data goes with what entity.
Because the XML file type is well known, there are a lot of libraries which programming languages can use. Some are built into the language itself, others can be imported into the language, or are available from an internal library.
Different languages will have different libraries and different rules for how to work with them. Many languages might provide two different libraries or sets of classes. This way the developer can choose how to read the file.
Some libraries read the entire file into memory. While this is fine for small files, it can take a lot more memory than reading and parsing as you process it. Especially if your file is very large.
While each library is different, you will often find methods like, getChild(), getNext(), getParent(), getAttribute(), getValue(), etc. These are all used to access different nodes and attributes.
If you know the layout of the file, you may specify which child node or nodes you are interested in by the index. However, if you don’t know the layout, you may search by name.
Searching my an (numeric) index requires that the format remain constant – which isn’t something that can always happen.Consider the following example from the file: https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
<gesmes:subject>Reference rates</gesmes:subject>
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time="2020-04-03">
<Cube currency="USD" rate="1.0785"/>
<Cube currency="JPY" rate="117.10"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="27.539"/>
<Cube currency="DKK" rate="7.4689"/>
<Cube currency="GBP" rate="0.87850"/>
<Cube currency="HUF" rate="365.15"/>
</Cube>
</Cube>
</gesmes:Envelope>
If we load that into an XML parser we can access the Cube elements to get the values with something like:
xmlDoc.DocumentElement.ChildNodes[2].ChildNodes[0].ChildNodes
But that can be a bit challenging to write and maintain, especially if they add, move, or remove a node for some reason before you get to the last set of nodes.
Introduction to XML was originally found on Access 2 Learn