XML Compare

This open source project is for an Oracle PL/SQL based utility used to compare two XML documents and provide an output identifying the differences.

Overview

XML can be stored in a native data type within the Oracle database known as XMLTYPE. It is not possible to perform a direct comparison between items declared with this data type which has facilitated the need for this XML compare utility.

Its has a range of options to control the sensitivity of the comparison, which are on par with many of the leading XML diff tools, where it is able to ignore such things as white space, comments, processing instructions, order of elements and attributes, etc.

The utility has two modes of operation. It can simply return a Boolean result of the comparison; or it can identify all of the XML nodes that differ between the two documents being compared.

This is currently the only PL/SQL based XML diff tool available. It not intended to be a standalone utility, but to compliment Oracle’s already extensive list of supplied PL/SQL packages, and be integrated into other PL/SQL based applications.

This PL/SQL package compares two Oracle XMLTYPE data types and identifies differences in content. It does not perform a file straight comparison, nor influenced by whitespace such as tab indentation and spaces between tags, but uses XMLDOM to navigate the document and make a node by node comparison, and return back all differences.

To download a free version of the  dbi_xml_compare package click here.

Options

The sensitivity and behaviour of the comparison can be controlled by a number of options:

  • Ignore Comments: Comment nodes can be excluded from the comparison.

  • Ignore Processing Instructions: Processing instruction nodes can be excluded from the comparison.

  • Ignore Element Order: The order of child elements can be ignored, otherwise elements are compared by identical position. There are limitations in ignoring order in that there cannot be any ambiguity between what are equivalent elements between documents. If ambiguity is found, then the comparison will fail with a recommendation that the comparison is performed with element ordering preserved.

  • Ignore Attribute Order: The order of attributes can be ignored. As attribute names must be unique within an element, there is no issue with ambiguity of attributes.

  • Ignore Namespace: Elements and attributes qualified with a namespace can have the namespace excluded from the comparison. So if two identical documents, one with namespace qualifiers and definition, and one without, will show no difference.

  • Ignore Whitespace. This is whitespace in the value of a text node or an attribute node, and not in the overall textual form of the XML document, i.e. whitespace between tags. Whitespace are carriage returns, line fees, tabs and space characters. The whitespace can ignored in one of two ways: either all adjacent whitespace is normalised into a single space character; or all whitespace characters are removed for comparison purposes.

Comparison Modes

The utility provides two modes of operation depending on the aim of the comparison. If the comparison needs to know whether the XML documents are different, or not, then the Boolean mode of operation is available. Alternatively, if all differences need to be identified, then the difference list mode of operation will return an array containing information on each node that does not compare successfully.

Difference List: This function returns an associative array of nodes and their values that do not successfully compare, and the reason for not comparing.

The offending nodes are identified using an XPATH like notation, based on the node names and child element order sequence. It is not correctly formed XPATH, but merely a descriptive means of identifying the node.

The possible reasons why a node does not compare are:

  • Values do not match

  • No matching node between documents

  • Ambiguous element (applicable only if element ordering is being ignored)

The function in the package that supports this mode is the get_xml_differences function:

FUNCTION get_xml_differences
(
    p_xml_1 IN XMLTYPE,
    p_xml_2 IN XMLTYPE,
    p_ignore_comments IN PLS_INTEGER := gc_true,
    p_ignore_element_order IN PLS_INTEGER := gc_true,
    p_ignore_attribute_order IN PLS_INTEGER := gc_true,
    p_ignore_proc_instr IN PLS_INTEGER := gc_true,
    p_ignore_namespace IN PLS_INTEGER := gc_true,
    p_ignore_whitespace IN PLS_INTEGER := gc_whitespace_preserve
)
RETURN t_nodes;

The arguments to this package are documented as follows:
 

Argument Settings
p_ignore_comments 0 = include comments in the comparison.
1 = ignore comment nodes.
p_ignore_element_order 0 = element order must be identical in both documents.
1 = element order can be ignored. NOTE: the ability to establish which elements are the same relies on the success of the algorithm used to unique identify an element (see documentation).
p_ignore_attribute_order 0 = attribute order must be identical in
both documents.
1 = attribute order can be ignored.
p_ignore_proc_instr 0 = include processing instructions in the
comparison.
1 = ignore processing instruction nodes.
p_ignore_namespace 0 = use the namespace on elements and
attributes.
1 = ignore the namespace (use local names).
p_ignore_whitespace 0 = preserve whitespace in text node and
attribute node values.
1 = normalise adjacent whitespace characters
(blank, tab, line feed, carriage return)
into a single blan character.
2 = strip whitespace completey from text
node and attribute node values.

 

The results are returned in an associative array defined as follows:

TYPE t_node IS RECORD
(
    document_id PLS_INTEGER,
    node_id VARCHAR2(
4000),
    node_values t_values,
    difference_type PLS_INTEGER
);


TYPE t_nodes IS TABLE OF t_node INDEX BY BINARY_INTEGER;

 

The components to the record type are defined as follows:

 

Column Description
document_id 1 = First document
2 = Second document
node_id The node identifier represented in an XPATH like notation. The hierarchy of elements are represented in the same way as XPATH, but the elements are indexed with the position as a child node on an element and not as the occurrence of a particular named element, e.g.

<Document>
  <Paragraph>
    <Font Class="Arial"/>
    <Word Id="1">Hello</Word>
    <Word Id="2">World</Word>
  </Paragraph>
</Document>

XPATH would represent the highlighted attribute:  /Document[1]/Paragraph[1]/Word[2]/@Id, but with this utility, which is XMLDOM driven, it would be represented as: /Document{1}/Paragraph{1}/Word{3}/Id{1} where the word element is the third child element under the paragraph element, regardless of the names of the other elements. The attribute is also indexed if attribute order is not being ignored.

node_value The maximum size of a text node value is 64k characters. To avoid the use of CLOBs which are not practical for comparison purposes, the value is broken up into more convenient chunks represented by a further associative array.

TYPE t_values IS TABLE OF VARCHAR2(4000) INDEX BY BINARY_INTEGER;

A size of 4000 characters was favoured over a maximum size of 32k to make it more convenient to persist the data to the database in VARCHAR2 form.

difference_type 1 = Node values do not match.
2 = No equivalent node.
3 = Ambiguous element.

Boolean: The function returns a true (1) or false (0) result on whether the documents are identical. There is a third return result (-1) to indicate that there is ambiguity with one or more elements, effectively this is saying a comparison is not possible.

FUNCTION is_identical_xml
(
    p_xml_1 IN XMLTYPE,
    p_xml_2 IN XMLTYPE,
    p_ignore_comments IN PLS_INTEGER := gc_true,
    p_ignore_element_order IN PLS_INTEGER := gc_true,
    p_ignore_attribute_order IN PLS_INTEGER := gc_true,
    p_ignore_proc_instr IN PLS_INTEGER := gc_true,
    p_ignore_namespace IN PLS_INTEGER := gc_true,
    p_ignore_whitespace IN PLS_INTEGER := gc_whitespace_preserve
)
RETURN PLS_INTEGER;

The argument to this function are identical to the difference list version.

 


webmaster@db-innovations.co.uk : Copyright © 2005 - 2010 Database Innovations Ltd : Last modified: 27/08/10