figshare
Browse
2008DelprattodPhD.pdf (2.46 MB)

Space efficient in-memory representation of XML documents

Download (2.46 MB)
thesis
posted on 2009-11-20, 12:23 authored by O'Neil Davion Delpratt
Extensible Markup Language (XML) is a multi-purpose text-based format, used for storage, transmission and manipulation of data. XML documents are often held in main memory and processed via standard interfaces such as the Document Object Model (DOM). However, XML is inherently verbose, and the in-memory representation of XML documents by existing DOM implementations is up to ten times larger than the file size. This is a problem for machines with limited memory, such as mobile devices, where processing even moderately-sized XML documents requires more memory than is available. We focus on in-memory representations of XML documents for situations where space is limited and where rapid processing time is important. We propose a compact representation of XML documents that uses succinct or highly space-efficient data structures, that allows XML processing to be executed efficiently. Succinct data structures use space that approaches the information-theoretic lower bound on the space that is required to represent the data, and support operations upon the representation in constant time. In the context of XML documents, we study and improve succinct representations for ordinal trees by adding features that make them more suitable for use in XML documents. We explore fast and space-efficient representations of the textual data of XML documents. Our basic approach is to concatenate all the textual data in the XML document into a single string, and extract individual textual values by computing the appropriate substring of the concatenated string. Computing the substring requires us to store offsets into the text. The storage of the offsets is surprisingly expensive, if stored naively (as 32 or 64-bit integer values). We give a succinct representation and provide data-aware representations (adapted from work on inverted indices in information retrieval), and show their close connection. We describe Succinct DOM (SDOM), which is a DOM implementation that has low, stable and predictable memory usage. We show, via an experimental evaluation, that SDOM is extremely fast. A variant, SDOM-CT, applies BZip-based compression to textual and attribute data, and its space usage is comparable with “query-friendly” XML compressors. Some of these compressors support navigation and/or querying (e.g. subpath queries) of the compressed file. SDOM-CT does not support querying directly, but remains extremely fast: it is several orders of magnitude faster for navigation than query-friendly XML compressors that support navigation (and only a few times slower than popular DOM implementations such as the Apache Foundation’s Xerces-C).

Funding

PPARC e-Science Studentship;PPA/S/E2003/03749

History

Supervisor(s)

Raman, R.

Date of award

2009-07-03

Awarding institution

University of Leicester

Qualification level

  • Doctoral

Qualification name

  • PhD

Language

en

Usage metrics

    University of Leicester Theses

    Categories

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC