XML Introduction

A XML (short for Extensible Markup Language) document consists of:

  1. the prolog (optional)
  2. the document type definition (DTD, optional)
  3. the root element (which furthermore consists of more elements, tree structure)

Comments and processing instructions can be defined outside of tags.


The basic prolog looks like this: <?xml version="1.0" ?>
An extended version: <?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>

Attributes explained:

  • version: XML version
  • encoding: Character set, defaults to UTF-8
  • standalone: define if extern entities or DTDs are being referenced in this document

Document Type Definition

The DTD defines structure validation rules for our documents. We fundamentally construct elements with their respective type (analogous to the database schema).

Reasons to use DTD:

  • validation purposes
  • gathering information about the document
  • force same structure for multiple documents
  • comparability of documents
  • automated processing of specific document types

The DTD construct looks like this:

<!DOCTYPE root-element [

The above DTD can be directly placed in the respective XML document. To reference an external DTD file:

<!DOCTYPE root-element SYSTEM "path-to-dtd.dtd">

An example for the external DTD file (note that it does not require the doctype declaration):

<!ELEMENT root-element (test-element*)>
<!ELEMENT test-element (#PCDATA)>
<!ATTLIST test-element id ID #REQUIRED>


The general syntax for element declarations: <!ELEMENT name category> or <!ELEMENT name (content)>. A quick overview of some category keywords:

Syntax Meaning
<!ELEMENT name EMPTY> An empty element
<!ELEMENT name ANY> Element with arbitrary content

Content is furthermore specified through these keywords:

Syntax Meaning
<!ELEMENT name (#PCDATA)> Element with parsed character data
<!ELEMENT name (#CDATA)> Element with (non parsed) character data
<!ELEMENT name (child1, child2)> Element surrounding a child1 followed by child2 element (strict order!)
<!ELEMENT name (child1 | child2)> Element surrounding either a child1 or child2 element

Occurrences of these children can also be specified:

Syntax Meaning
<!ELEMENT name (child)> Exact one children
<!ELEMENT name (child?)> 0..1 children
<!ELEMENT name (child*)> 0..N children
<!ELEMENT name (child+)> 1..N children


An ATTLIST can declare 0..N attributes to an element, which is equal to having multiple ATTLISTs pointing to the same element.

<!ATTLIST element-name attribute-name attribute-type attribute-value>

<!ATTLIST human 
            id ID #REQUIRED
            salary Currency(Dollar, Euro) "Dollar">

As you already noticed, we are able to specify an explicit value range aside a type. The following table lists some of the attribute-types:

Syntax Meaning
CDATA Character data
(val1, val2...) Explicit value range
ID A unique ID
IDREF Reference to another ID
NMTOKEN Valid XML name
ENTITY An entity
ENTITIES Set of entities

Note: Set values are separated with whitespaces

Now a list of valid attribute-values:

Syntax Meaning
"value" Explicit value
#REQUIRED Attribute is required
#IMPLIED Attribute is optional
#FIXED "value" Explicit fixed value

XML Schema

An alternative to DTDs are XML schemas, which actually use XML syntax, support more data types and offer better referencing (in contrast to the IDREF mechanism).

Note: XML schemas extensively use the namespace mechanism (see bottom section).

XML Schema Structure

The schema element is the root element of every XML Schema:

<?xml version="1.0" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

Simple Element

<xsd:element name="test-element" type="xsd:string" default="Default Value">
<xsd:element name="test-element2" type="xsd:string" default="Fixed Value">

The equivalent to above using DTD:

<!ELEMENT test-element "Default Value">
<!ELEMENT test-element2 #FIXED "Fixed Value">

Types: xsd:string, xsd:decimal, xsd:integer, xsd:boolean, xsd:date, xsd:time.

Note: simple elements cannot contain attributes.

Complex Element

<xs:element name="employee" type="person-info"/> <!-- Reference to a complex type (similar to nesting in DTD) -->

<xs:complexType name="person-info">
    <xs:sequence><!-- Firstname, then lastname (similar to DTD: comma separation) -->
        <xs:element name="firstname" type="xs:string"/>
        <xs:element name="lastname" type="xs:string"/>

Referencing to XML Schema

And finally a reference to an XML schema:

<?xml version="1.0" ?>
<root-element xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="namespace_path_for_schema schema.xsd">

<?xml version="1.0" ?>
<root-element xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd">


An entity is a separate data unit within a XML document. Entities are also resolved before a validation is taking place.

<!ENTITY entityname "value">

We categorize entities into two sections:

Parsed entity (XML-fragment):
- Internal: defined within a DTD
- External: defined in another document

Unparsed entity (miscellaneous data):
- Value of an attribute with type ENTITY or ENTITIES
- Reference to an external file

Predefined Entities

Syntax Parsed
&lt; <
&gt; >
&amp; &
&apos; '
&quot; "

Using an already defined entity: &entityname;

References In XML

References to entities (as we already know): entityname;.
References to elements: an element with an ID attribute can be references through IDREF(S).

Note: references via IDREF(S) only work within a document


Motivation: Mixing different XML documents will result in a conflict, when they contain elements with the same names. General syntax: xmlns:PREFIX="URI".

<store xmlns:s="http://gimu.org/s">
    <s:example s:id="1">Example</s:example>

Note: an attribute does not inherit the namespace of its parent element (also applies for the default namespace)

Default Namespace

A namespace which applies to all child elements without a prefix.

<html xmlns="http://www.w3.org/1999/xhtml">
        <title>All unprefixed elements belong to the xhtml namespace</title>


XML documents are treated as trees. There are several distinguishable node types:

  • root node
  • element nodes
  • attribute nodes
  • text nodes
  • instructional processing nodes
  • comment nodes
  • namespace nodes

Explicit Path

The XPath query /PersonalFile/Particulars/Firstname would result in


as defined in the previous section.


Query Result
/AAA/BBB/ <BBB/>, <BBB/>, <CCC/>, <BBB/>
/AAA/BBB[1]/ <BBB/> (first one)
/AA/BBB[last()] <BBB/> (last one)
/AAA/* <BBB/>, <BBB/>, <CCC/>, <BBB/> (any child of AAA)
//BBB <BBB/>, <BBB/>, <BBB/> (hierarchical independent node localization)

Processing Instructions

To influence XML processing, you can use processing instructions: <?name data>

Character Data

Arbitrary character sets can be included in XML (e.g. HTML documents).
This is done by enclosing the data with <![CDATA[...]]>

            <b>Bold text</b><br/>