[All Packages] [Previous] [Next]
Parser APIs
This C implementation of the XML processor (or parser) follows the W3C XML
specification (rev REC-xml-19980210) and implements the required behavior of
an XML processor in terms of how it must read XML data and the information
it must provide to the application.
The following is the general behavior of this parser:
- If an input document's character encoding cannot be determined
automatically by a BOM (Byte Order Mark) or XMLDecl, then UTF-8 is assumed.
A separate, fast single-byte code path exists, as well as the multibyte
path. To use this fast track, if your documents are single-byte (ASCII,
ISO-8859, EBCDIC, etc), make sure to specify the correct input encoding and
to not let it default to UTF-8.
- Output encoding (DOM/SAX data) will be in the same encoding as the
first input encountered. To explicitly set the output encoding, use the
xmlinitenc function and pass in the extra outcoding argument.
UTF-16 is supported.
- Messages are printed to stderr unless an error message handling
callback function is given.
If you provide a message handler (and context), a numeric error code error,
error message, and context will be passed to this function instead. Error
message text will be in UTF-8, and any data included as part of a message
will be converted to UTF-8.
- DOM is the default interface for accessing a parsed document. To
use SAX instead, specify a structure of SAX callback functions (and SAX
context) at initialization time. Not all SAX functions need be provided;
you can set any or all to NULL and only process those events you care about.
- The default behavior for the parser is to check that the input is
well-formed, but not to validate. Set the xmlinit function flag
XML_FLAG_VALIDATE to turn on validation.
- Whitespace processing is fully conformant with the XML 1.0 spec,
i.e. all whitespace is reported back to the application but it is indicated
which whitespace is "ignorable". Some applications may want to
set the XML_FLAG_DISCARD_WHITESPACE flag which will discard all whitespace
between an end-element tag and the following start-element tag (such as
newlines).
- Validation problems are printed (or passed to the error message
callback) but do not halt validation. Set the flag
XML_FLAG_STOP_ON_WARNING to cause validation to cease immediately on the first
warning (as for an error).
Calling Sequence
The sequence of calls to the parser can be:
Parsing a single document:
- xmlinit - xmlparsexxx - xmlterm
Parsing multiple documents, but only the latest document needs to be
available:
- xmlinit - xmlparsexxx - xmlclean - xmlparsexxx - xmlclean ... xmlterm
Parsing multiple documents, all documents must be available:
- xmlinit - xmlparsexxx - xmlparsexxx ... xmlterm
Memory Callbacks
Memory callback functions may be used if you wish to use your
own memory allocation. If they are used, all of the functions should be
specified. Allocated memory does not need to be initialized.
The memory allocated for parameters passed to the SAX callbacks or for
nodes and data stored with the DOM parse tree will not be freed until one
of the following is done:
- xmlclean is called.
- xmlterm is called.
Error Message Callbacks
By default, error messages are printed to stderr. An error message
callback may be provided at initialization time, however. If given,
error numbers and text are passed to that function, and the user may do
whatever they wish with them. Location information (line number and
source filename) is available through the xmlwhere function. This
function should only be called while an error is in progress (i.e. while
in the error callback function). Error message callback functions should
be declared using the XML_MSGHDLRF function prototype macro.
I/O Callbacks
Document input is handled through a set of I/O callback functions. For most
access methods (HTTP, FTP, filesystem, etc), built-in callbacks are provided.
For other methods, notably stream, the user must specify their
own callbacks, as none will be provided. Any of the built-in callbacks may
be overridden with user-defined ones.
The function xmlaccess sets the callbacks for
the given access method (xmlacctype).
Thread Safety
If threads are forked off somewhere in the midst of the init-parse-term
sequence of calls, you will get unpredictable behavior and results.
Data Types Index
| oratext
| String pointer used for all data encodings, cast as needed; for UTF-16, to (ub2 *)
|
| xmlctx
| Top-level XML context
|
| xmlmemcb
| Memory callback structure (optional)
|
| xmlsaxcb
| SAX callback structure (SAX only)
|
| xmlacctype
| XML access type (HTTP, FTP, File, etc)
|
| ub4
| 32-bit (or larger) unsigned integer
|
| uword
| Native unsigned integer
|
Function Index
Data Structures and Types
typedef unsigned char oratext;
typedef struct xmlctx xmlctx;
Note: The contents of xmlctx are private and must not be accessed by users.
struct xmlmemcb
{
void *(*alloc)(void *ctx, size_t size);
void (*free)(void *ctx, void *ptr);
void *(*realloc)(void *ctx, void *ptr, size_t size);
};
typedef struct xmlmemcb xmlmemcb;
Note: This is the memory callback structure. Allocations
do not need to be initialized (e.g. like malloc, not calloc).
struct xmlsaxcb
{
sword (*startDocument)(void *ctx);
sword (*endDocument)(void *ctx);
sword (*startElement)(void *ctx, const oratext *name,
const struct xmlattrs *attrs);
sword (*endElement)(void *ctx, const oratext *name);
sword (*characters)(void *ctx, const oratext *ch, size_t len);
sword (*ignorableWhitespace)(void *ctx, const oratext *ch,
size_t len);
sword (*processingInstruction)(void *ctx, const oratext *target,
const oratext *data);
sword (*notationDecl)(void *ctx, const oratext *name,
const oratext *publicId,
const oratext *systemId);
sword (*unparsedEntityDecl)(void *ctx, const oratext *name,
const oratext *publicId,
const oratext *systemId,
const oratext *notationName);
sword (*nsStartElement)(void *ctx, const oratext *qname,
const oratext *local,
const oratext *namespace,
const struct xmlattrs *attrs);
/* The following 8 fields are reserved for future use. */
void (*empty1)();
void (*empty2)();
void (*empty3)();
void (*empty4)();
void (*empty5)();
void (*empty6)();
void (*empty7)();
void (*empty8)();
};
typedef struct xmlsaxcb xmlsaxcb;
Note: Callbacks for SAX-like API.
typedef unsigned int ub4;
typedef unsigned int uword;
Functions
- Purpose
- Sets the I/O callback functions for the given access method.
- Syntax
uword xmlaccess(xmlctx *ctx, xmlacctype access, XML_OPENF((*openf)),
XML_CLOSEF((*closef)), XML_READF((*readf)));
- Parameters
ctx (IN) - The XML context
access (IN) - access method enum, XMLACCESS_xxx
openf (IN) - Open-input callback function
closef (IN) - Close-input callback function
readf (IN) - Read-input callback function
- Comments
- Sets the I/O callback functions for the given access method. Most
methods have built-in callback functions, so none be provided by the user.
The notable exception is XMLACCESS_STREAM, user-defined streams, where the
user must set the stream callback functions themselves.
- The three callback functions are invoked to open, close, and read from
the input source. The functions should have been declared using the
the function prototype macros XML_OPENF, XML_CLOSEF and XML_READF.
- XML_OPENF is the open function, called once to open the input
source. It should set its persistent handle in the xmlihdl
union, which has two choices, a generic pointer (void *), and
an integer (as unix file or socket handle). This function
must return XMLERR_OK on success. Args:
ctx (IN) - XML context
path (IN) - full path to the source to be opened
parts (IN) - path broken down into components; opaque pointer
length (OUT) - total length of input source, if known (0 if not known)
ih (OUT) - the opened handle is placed here
- XML_CLOSEF is the close function; it closes an open source and
frees resources. Args:
ctx (IN) - XML context
ih (IN) - input handle union
- XML_READF is the reader function; it reads data from an open
source into a buffer, and returns the number of bytes read:
- If <= 0, an EOI condition is indicated.
- If > 0, then the EOI flag determines if this's the terminal data.
On EOI, the matching close function will be called automatically. Args:
ctx (IN) - XML context
path (IN) - full path to the source to be opened; only
provided here for use in error messages
ih (IN) - input handle union
dest (OUT) - destination buffer to read data into
destsize (IN) - size of dest
nraw (OUT) - number of bytes read
eoi (OUT) - hit End of Information?
- Purpose
- Initializes the C XML parser. It must be called before any parsing
can take place.
- Syntax
xmlctx *xmlinit(uword *err, const oratext *incoding,
XML_MSGHDLRF((*msghdlr)), void *msgctx,
const xmlsaxcb *saxcb, void *saxcbctx,
const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
xmlctx *xmlinitenc(uword *err, const oratext *incoding, const oratext *outcoding,
XML_MSGHDLRF((*msghdlr)), void *msgctx,
const xmlsaxcb *saxcb, void *saxcbctx,
const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
- Parameters
err (OUT) - Numeric error code, on failure
incoding (IN) - default input character set encoding
outcoding (IN) - output (DOM/SAX data) character set encoding (xmlinitenc only)
msghdlr (IN) - Error message handler function
msgctx (IN) - Context for the error message handler
saxcb (IN) - SAX callback structure (filled with function pointers)
saxcbctx (IN) - Context for SAX callbacks
memcb (IN) - Memory function callback structure
memcbctx (IN) - Context for the memory function callbacks
lang (IN) - Language for error messages
- Comments
- Do not call any other XML parser functions if this is not successful!
- This function should only be called once before parsing any XML files.
xmlterm should be called after all parsing and DOM use has
completed. Multiple parses should call xmlclean between runs
if only the current document needs to be available. Until clean is called,
data pointers from all previous parses will continue to be valid.
- All arguments may be NULL except for err, which is required. On
success, an XML context (lpxctx *) is returned. If this is NULL, a
failure occured and the numeric error code is stored in *err.
- Data Encoding
- The encoding of input documents is detected automatically (by BOM,
XMLDecl, etc). If the encoding cannot be determined, incoding is
assumed. If incoding is not specified (NULL), UTF-8 is assumed.
incoding should be an IANA/Mine encoding name, e.g. "UTF-16", "ASCII", etc.
- NOTE: A separate, fast code path exists for single-byte character
sets like ASCII, ISO-8859, and EBCDIC. This path is considerably
faster than the UTF-8 multibyte path, so if you are sure your input
documents are single-byte, you are strongly encouraged to say so by
setting the incoding.
- The encoding which data will be presented as (through DOM/SAX) is given
as outcoding. If not specified, UTF-8 is chosen. Unicode (UTF-16)
is supported. Since DOM/SAX APIs specify (oratext *) as data pointers,
for Unicode these should be cast to (ub2 *).
- NOTE: For backwards compatibility (until the next major release),
xmlinit will set the outcoding to the input encoding of the first
document parsed, to simulate the old behavior. For xmlinitenc,
the output encoding is explicitly specified.
- Error Messages, Language
- By default, error messages are printed to stderr. To handle messages
yourself, specify a handler function pointer. The formatted error
string and numeric error code will be passed to your function, along
with the user-defined message context msgctx. The error strings will
be UTF-8; any data included as part of the error message will be
converted to UTF-8. If you need the line number and path/URL where the
error occured, the xmlwhere function returns this information,
but it may only be called from the user's callback function (while the
error is in progress).
- The error language is specified as lang, e.g. "AMERICAN", JAPANESE",
"FRENCH", etc, and defaults to American.
- SAX vs DOM
- By default, a DOM parse tree is built. To use SAX instead, specify a
SAX callback structure (saxcb). The callbacks will be invoked with
the given SAX context pointer. If any of the SAX functions returns
an error (non-zero), parsing stops immediately.
- Memory Allocation
- The parser allocates memory in large chunks, then doles parts of the
large chunks out as needed. The default system memory allocator (malloc,
etc) will be used to allocate, and the default system memory freer (free,
etc) will be used to free the chunks unless a memory callback structure
is provided. Note that the memory allocation function is a replacement
for malloc, and so does not need to return initialized memory. If memcb
is given, it contains function pointers to alloc/free functions which will
be used instead. The user-defined memory callback context memcbctx is
passed to the callback functions.
- Error Codes
| XMLERR_NLS_INIT
| The National Language Service package could not be initialized.
Perhaps an installation or configuration problem.
|
| XMLERR_INVALID_MEMCB
| A memory callback structure (memcb) was specified, but it did not have
alloc and free function pointers.
|
| XMLERR_BAD_ENCODING
| An encoding was not known. Use IANA/Mine names for encodings, and
make sure NLS data is present.
|
| XMLERR_INVALID_LANG
| The language specified for error messages was not known.
|
| XMLERR_LEH_INIT
| The LEH (catch/throw) package could not be initialized. An internal
error, contact support.
|
- Purpose
- Frees any memory used during the previous parse.
- Syntax
void xmlclean(xmlctx *ctx);
- Parameters
- ctx (IN) - The XML parser context
- Comments
- Recycles memory within the XML parser, but does not free it to the
system-- only xmlterm finally releases all memory back to the
system. If xmlclean is not called between parses, then the data
used by the previous documents remains allocated, and pointers to
it are valid. Thus, the data for multiple documents can be accessible
simultaneously, although only the current document can be manipulated
with DOM.
- If you just want to access one document's data at a time (within a
single context), then call clear before each new parse.
- Purpose
- These functions invoke the XML parser on various input sources. The
parser must have been initialized successfully with a call to
xmlinit first.
- Syntax
uword xmlparse(xmlctx *ctx, const oratext *uri,
const oratext *incoding, ub4 flags);
uword xmlparsebuf(xmlctx *ctx, const oratext *buffer, size_t len,
const oratext *incoding, ub4 flags);
uword xmlparsedtd(xmlctx *ctx, const oratext *filename,
oratext *name, const oratext *incoding, ub4 flags);
uword xmlparsefile(xmlctx *ctx, const oratext *path,
const oratext *incoding, ub4 flags);
uword xmlparsestream(xmlctx *ctx, const void *stream,
const oratext *incoding, ub4 flags);
- Parameters
ctx (IN/OUT) - The XML parser context
uri (IN) - URI of XML document (xmlparse only)
buffer (IN) - input buffer (xmlparsefile only)
len (IN) - length of the buffer (xmlparsefile only)
stream (IN) - input stream (xmlparsestream only)
incoding (IN) - default input character set encoding
flags (IN) - mask of parser options
- Comments
- Parser options are specified as flag bits OR'd together into
the flags mask. Flag bits are:
| XML_FLAG_VALIDATE
| Turn validation on
|
| XML_FLAG_DISCARD_WHITESPACE
| Discard extraneous whitespace (end-of-line etc)
|
| XML_FLAG_STOP_ON_WARNING
| Stop validation on warnings
|
| XML_FLAG_DTD_ONLY
| Parse an external DTD, not a complete XML document
|
| XML_FLAG_WARN_DUPLICATE_ENTITY
| Emit warning when duplicate entities are declared
|
| XML_FLAG_FORCE_INCODING
| Force input documents to be read in incoding
|
- By default, the parser does not validate the input-- you must use
the flag XML_FLAG_VALIDATE to enable validation. To parse an external
DTD (as opposed to a complete XML document), set the XML_FLAG_DTD_ONLY
flag. Validation problems are considered warnings, not errors, and by
default validation will continue after warnings have occured. To treat
validation problems as errors, set the flag XML_FLAG_STOP_ON_WARNING.
- The default behavior for whitespace processing is to be fully
conformant to the XML 1.0 spec, i.e. all whitespace is reported
back to the application, but it is indicated which whitespace is
"ignorable". However, some applications may prefer to set the
XML_FLAG_DISCARD_WHITESPACE which will discard all whitespace
between an end-element tag and the following start-element tag.
- The default input encoding is specified as incoding,
which overrides the incoding parameter to xmlinit. If the input's
encoding cannot be determined automatically (by BOM, XMLDecl, etc) then
it is assumed to be incoding (which defaults to UTF-8).
If the flag XML_FLAG_FORCE_INCODING is set, the document will be assumed to
be in incoding regardless of the XMLDecl, BOM, etc. Only an
external protocol declaration (HTTP charset, etc) overrides a forced incoding.
- Data pointers returned by DOM APIs remain valid until xmlclean
or xmlterm is called.
- For SAX, the data pointers only remain valid for the duration of
the user's callback function. That is, once the callback function
has returned, the data pointers become invalid. If longer access
is needed, the data can be stored in the XML memory's pool using
stringSave (or stringSave2 for UCS2 data).
- Streams: A stream is a user defined entity here-- all that's passed
in is a stream/context pointer, which is in turned passed to the
I/O callback functions. The parser does not reference the stream
directly.
- DTD: The DTD parser invokes the XML parser on an external DTD, not
a complete document. It is used mainly by the Class Generator so
that classes may be generated from a DTD without needed a complete
(dummy) document.
- Purpose
- Terminates the XML parser. It should be called after
xmlinit, and before exiting the main program.
- Syntax
uword xmlterm(xmlctx *ctx);
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function tears down the parser. It frees all allocated memory,
giving it back to the system (through free or the user's memory
callback). Contrast to xmlclean, which recycles memory internally
without giving it back to the system.
- No additional XML parser calls can be made until xmlinit
is called again to get a new context.
- Purpose
- Return error location information for the last (current) error.
- Syntax
uword xmlwhere(xmlctx *ctx, ub4 *line, oratext **path, uword idx);
- Parameters
ctx (IN) - the XML parser context
line (OUT) - line# where the error occured
path (OUT) - source path/URL where error occured
idx (IN) - error# in stack (starting at 0)
- Comments
- Returns the location information for the idx'th error on the stack.
This function should only be called while an error is in progress, i.e.
from within an error message callback function. Since errors occur in
nested inputs (document A includes document B includes document C which
contains an error), more than one location is available. The highest-
level input file is idx 0, then the next level down is 1, etc. If only
the highest level is desired, just call once with idx=0. If all levels
are desired, loop starting with idx=0 and incrementing until the
function returns FALSE.
- Purpose
- Return current location information while parsing.
- Syntax
void xmlLocation(xmlctx *ctx, ub4 *line, oratext **path);
- Parameters
ctx (IN) - the XML parser context
line (OUT) - current line#
path (OUT) - current source path/URL
- Comments
- Returns the current location information while parsing. This function
may be called at any time, but before a document has begin parsing, or after
the document has finished parsing, 0 will be returned for the path and line#.
- Purpose
- Creates a new document in memory.
- Syntax
xmlnode* createDocument(xmlctx *ctx)
xmlnode* createDocumentNS(xmldomimp *imp, oratext *uri,
oratext *qname, xmlnode *dtd);
- Parameters
ctx (IN) - XML parser context
imp (IN) - XML DOMImplementation (see getImplementation)
uri (IN) - New document's namespace URI
qname (IN) - Namespace qualified name of new document (DOCUMENT_NODE's name)
dtd (IN) - DTD this document is associated with
- Comments
- The original function createDocument has now been
standardized in DOM 2.0 CORE. For compatibility, the old function
remains with its original usage, and the new CORE function is called
createDocumentNS.
Creates a new document in memory. An XML document is always rooted in
a node of type DOCUMENT_NODE-- this function creates that root
node and sets it in the context. There can be only one current document
and hence only one document node; if one already exists, this function
does nothing and returns NULL.
- For createDocumentNS, if a DTD is specified, its ownerDocument
attribute will be set to the document being created.
- Purpose
- Creates a new document type (DTD) node.
- Syntax
xmlnode* createDocumentType(xmldomimp *imp, oratext *qname, oratext *pubid, oratext *sysid);
- Parameters
imp (IN) - XML DOMImplementation (see getImplementation)
qname (IN) - Namespace qualified name of new document type (DTD)
pubid (IN) - External subset public identifier
sysid (IN) - External subset system identifier
- Purpose
- Return value of document's standalone flag.
- Syntax
boolean isStandalone(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the boolean value of the document's standalone
flag, as specified in the <?xml?> processing instruction.
- Purpose
- Return value of "simple encoding" flag.
- Syntax
boolean isSingleChar(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the boolean value of the document's "simple"
encoding flag. If the document is single-byte encoded (ASCII, ISO-8859,
EBCDIC, etc), TRUE is returned; otherwise, encoding is multibyte or
Unicode and FALSE is returned. See also the getEncoding function
which returns the name of the specific encoding, and isUnicode
which tests for Unicode data.
- Purpose
- Return value of Unicode encoding flag.
- Syntax
boolean isUnicode(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the flag which determines whether DOM/SAX data
for this context is in Unicode (UCS2).
- Purpose
- Returns the IANA/Mime name of the character encoding
used by the document, e.g. "ASCII", "ISO-8859-1", "UTF-8", "UTF-16", etc.
- Syntax
oratext *getEncoding(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the name of the document's encoding, e.g. "ASCII",
"UTF-8", etc. See also the isSingleChar function, which can be
used to simply determine if the document is single or multibyte, and
the isUnicode function, which determines if the input is Unicode
(UTF-16).
- Purpose
- Associates an xdkdomdoc object with xmlctx.
- Syntax
void associateDomDocument(xmlctx *ctx, void *doc)
- Parameters
ctx (IN) - the XML parser context
doc (IN) - pointer to the xdkdomdoc to be associated
- Comments
- This routine associates an xdkdomdoc to a given xmlctx. This association
allows usage of the XSLT, Xpath, and XML Schema Validator APIs with an
xdkdomdoc. The XDB DOM (OCIDomDocument) object is one particular type
of xdkdomdoc. xdkdomdoc is defined in xdkdomdoc.h.