?? xml-sax.txt
字號:
Author: David Beazley (beazley@cs.uchicago.edu)
Python and SAX
==============
This chapter provides details on Python's support for SAX (Simple API
for XML). SAX is a widely used interface specification for parsing
and processing XML. Many high level XML processing modules use SAX
for their internal processing and there are many reasons why you might
want to use it yourself. An introduction to XML parsing is provided
in the previous chapter and is not repeated here. Instead, this
chapter builds upon that material and is aimed to be more of a SAX
reference.
Introduction to SAX
-------------------
First, SAX is not a package that you download and install. Instead, it
is a common API that is used to manipulate XML. SAX is based on an
event model in which XML documents are scanned sequentially and
different handler methods are invoked for different document features.
This style of XML processing was already illustrated in the previous
section chapter describing the Expat module. SAX merely generalizes
that approach and makes it more widely applicable.
The big picture of SAX is that subdivides XML processing into a
collection of different tasks--each defined by a special interface.
In Python, these interfaces are implemented by a collection of
classes. Figure 1 illustrates the different pieces:
+--> ContentHandler
|
Raw Data SAX Events |--> DTDHandler
InputSource --> XMLReader ----------->+
|--> EntityResolver
|
+--> ErrorHandler
At the lowest level, XML data is read from a byte-stream that is
encapsulated by a special InputSource object. This is usually just a
wrapper around an ordinary file. The data read from an InputSource
object is then fed into an XMLReader object. An XMLReader is just a
generalized parsing interface that sits on top of a low-level XML parser
such as Expat or xmlproc. The XMLReader then sends different events
(SAX events) on to a collection of four different types of handler
objects. The handlers are used to handle different document
features. For example:
ContentHandler - Ordinary document text, elements, and attributes.
DTDHandler - DTD declarations needed to parse the rest of
the document (unparsed external entities)
EntityResolver - Handling of external entities.
ErrorHandler - Error handling
In addition, SAX defines a number of miscellaneous interfaces
that are used during processing. For instance, a special Locator
object is used to keep track of the current parsing location and
an AttributeImpl object is used to store attributes.
Since SAX processing is essentially divided into two parts; parsing
and handling, two different Python modules contain most of the
functionality. The xml.sax.xmlreader module contains the classes
pertaining to the parsing of XML and includes the InputSource,
XMLReader, Locator, and AttributeImpl classes. The xml.sax.handler
module contains the classes for handling different SAX events and
includes the ContentHandler, DTDHandler, EntityResolver, and
ErrorHandler classes. More generally, you will use the
xml.sax.xmlreader module to set up a parser. You will then use the
xml.sax.handler module to implement the actions performed during
parsing.
The rest of this chapter documents the different SAX interfaces
and how they are used.
XML Readers
-----------
To read XML documents with SAX, you have first have to instantiate a
special XMLReader object. An XMLReader is really nothing more than a
standardized wrapper around an existing XML parser such as Expat or
xmlproc. To create an XMLReader for an existing parser, use the
xml.sax.make_parser() function. For example:
from xml.sax import make_parser
p = make_parser() # Create default parser
Without any arguments, the make_parser() function consults an internal
list of parsers and chooses the first one that is installed. If its
necessary to use a specific parser for some reason, an optional
argument specifying the precise parsing module can also be supplied.
For example:
p = make_parser("xml.sax.drivers2.drv_xmlproc") # Create xmlproc parser
p = make_parser("xml.sax.drivers2.drv_expat") # Create expat parser
Regardless of the underlying parsing engine, all XMLReader objects support
a common set of methods.
p.parse(source)
Starts processing SAX events for input source. source is often an open
file object, but it can also be a filename, a URL, or an InputSource
object (described shortly). For example:
p.parse("file.xml") # Filename
p.parse(open("file.xml")) # Open file
p.parse("http://dead.com/file.xml") # URL
parse() does not return until it has processed the entire input stream.
Therefore, before calling parse(), you need to set up all of the handlers
that will be used to process SAX different types of SAX events. The
following methods are used to do this:
p.setContentHandler(handler)
Sets the current ContentHandler. A ContentHandler is used to process
XML elements and text.
p.getContentHandler()
Returns the current ContentHandler.
p.setDTDHandler(handler)
Sets the current DTDHandler. A DTDHandler is used to process DTD declarations
needed for parsing (unparsed entities and attributes).
p.getDTDHandler()
Returns the current DTDHandler.
p.setEntityResolver(handler)
Set the current EntityResolver. An EntityResolver is used to handle all
external entities.
p.getEntityResolver()
Returns the current EntityResolver.
p.setErrorHandler(handler)
Set the current ErrorHandler. An ErrorHandler is used to control how
parsing errors are handled.
p.getErrorHandler()
Get the current ErrorHandler.
If a particular handler isn't defined, those SAX events are usually
ignored. For instance, if you don't define a ContentHandler, the
parser won't do much of anything.
In addition to setting handlers, XMLReaders provide an interface
for enabling or disabling certain XML features. This is controlled
by the following pair of functions:
p.getFeature(featurename)
Returns the current setting for parsing feature featurename.
p.setFeature(featurename,value)
Change the value of featurename to value.
featurename is usually set to one of the following constants defined in
the xml.sax.handler module. value is set to 1 or 0 to indicate a true
or false value. The default values depend on the underlying
parsing engine being used.
feature_namespaces
Perform namespace processing.
feature_namespace_prefixes
Use the original prefixed names and attributes in namespace
declarations. Disabled by default.
feature_string_interning
Intern element names, prefixes, attribute names, namespace URIs,
and local names using the built-in intern() function. Disabled
by default.
feature_validation
Report all validation errors.
feature_external_ges
Include all external general entities.
feature_external_pes
Include all external parameter entities including those
defined in an external DTD.
For example:
from xml.sax import handler
p = make_parser()
p.setFeature(handler.feature_namespaces,1)
p.setFeature(handler.feature_validation,1)
Most features are read-only during parsing. Therefore,
setFeature() is almost always used before calling
p.parse().
In addition, XMLReaders have a number of properties that are used
to control behavior of the parser. Properies are controlled by
the following methods:
p.getProperty(propertyname)
Return the current setting for property propertyname.
p.setProperty(propertyname,value)
Change the value of property propertyname to value.
Values for propertyname are also defined in xml.sax.handler.
property_lexical_handler
Optional extension handler for processing lexical
events such as comments.
property_declaration_handler
Optional extension handler for processing DTD
declarations other than those handled by the
DTDHandler.
property_dom_node
DOM iteration.
property_xml_string
The literal string of characters that was the
source for the current event.
As for this writing, none of the above properties appear to be
supported in Python (so this is really only mentioned because it
is part of the SAX API).
Finally, the following method is used to control locale settings
of the parser.
p.setLocale(locale)
Set the locale for errors and warnings. This function
may or may not be supported and it is acceptable for
SAX parsers to ignore locale requests.
In an error occurs in any of the SAX functions, one of the
following exceptions will be raised:
SAXException(msg [,exception])
Base class for all other SAX exceptions. msg is a
human readable message and exception optionally
contains a different exception (this is used to
encapsulate more specific errors).
SAXParseException(msg, exception, locator)
Raised when a parsing error occurs. In addition
containing an error message, this exception provides
information about the error location that can
be obtained using the Locator interface (see
the ContentHandler section for details).
SAXNotRecognizedException(msg,exception)
Raised when an XMLReader is asked to support an unrecognized
feature or property.
SAXNotSupportedException(msg,exception)
Raised when an XMLReader is asked to enable a feature or
property that the implementation does not support.
InputSource Objects
-------------------
Internally, all input to a SAX parser is supplied using a special
InputSource object. InputSource really just defines a common I/O
interface that can be used by a parser. An instance i of an
InputSource class has the following methods:
i.setPublicId(id)
Set the public identifier
i.getPublicId()
Get the public identifier
i.setSystemId(id)
Set the system identifier
i.getSystemId()
Get the system identifier
i.setEncoding(encoding)
Sets the character encoding. encoding is typically a string like
'iso-8859-1' or 'utf-8' that would ordinarily appear in an
<?xml encoding="..."?> declaration.
i.getEncoding()
Return the character encoding.
i.setByteStream(bytefile)
Sets the byte-stream of the object to a Python file-like object.
The bytefile object should be a raw byte oriented file and not
a file that performs automatic character conversion (e.g., do not
pass a file that automatically decodes Unicode characters).
i.getByteStream()
Returns the byte-stream for the input source.
i.setCharacterStream(charfile)
Sets the character stream. charfile must be a file-like that
knows how to decode bytes into characters according to the
inputs encoding scheme. Usually this is a Unicode file object
as might be created by the codecs module.
i.getCharacterStream()
Return the character stream.
If an application wants to supply XML data through a non-traditional
mechanism, the InputSource class can be specialized and modified as
needed. The modified input can then be passed to the parse() method.
For example:
from xml.sax import xmlreader
from xml.sax import make_parser
class MyInput(xmlreader.InputSource):
...
# specialize InputSource
...
p = make_parser()
...
i = MyInput("foo.xml") # Create InputSource object
p.parse(i) # Parse
Content Handlers
----------------
Most of the work in a SAX parser is performed by a special
ContentHandler object. The module xml.sax.handler defines a class
ContentHandler that defines the interface. However, you often create
a handler by subclassing a default implementation. For example:
from xml.sax import saxutils
# Define a simple SAX handler
class SimpleHandler(saxutils.DefaultHandler):
def startElement(self,name,attrs):
print 'Start: ',name,attrs
def endElement(self,name):
print 'End: ',name
def characters(self,data):
print 'Data: ', repr(data)
sh = SimpleHandler()
p = make_parser()
p.setContentHandler(sh)
p.parse(file)
The following methods of a ContentHandler are invoked during
parsing.
c.setDocumentLocator(locator)
This method provides the parser with an object that can be used
for location tracking. This is especially useful for reporting
error messages and knowing where different SAX events occur.
locator is a special Locator object that provides the following
methods:
locator.getColumnNumber()
locator.getLineNumber()
locator.getPublicId()
locator.getSystemId()
The getColumnNumber() and getLineNumber() methods return the
approximate location with in a file. The getPublicID() returns
the public name of an entity (if available). The getSystemID()
returns the system name being used. This is often a filename
or URL.
Here is a very simple example of how these functions might be used
class SimpleHandler(saxutils.DefaultHandler):
# Obtain a locator object
def setDocumentLocator(self,locator):
self.locator = locator
def startElement(self,name,attrs):
col = self.locator.getColumnNumber()
line = self.locator.getLineNumber()
pubid = self.locator.getPublicId()
sysid = self.locator.getSystemId()
print 'startElement (%d,%d,%s,%s): %s' % (line,col,pubid,sysid,name)
def endElement(self,name):
col = self.locator.getColumnNumber()
line = self.locator.getLineNumber()
pubid = self.locator.getPublicId()
sysid = self.locator.getSystemId()
print 'endElement (%d,%d,%s,%s): %s' % (line,col,pubid,sysid,name)
def characters(self,data):
print 'characters: ', repr(data)
Here is a little sample output to see what kind of values are returned:
startElement (16,0,None,guac.xml): recipe
characters: u'\n'
characters: u' '
startElement (17,3,None,guac.xml): title
characters: u' '
characters: u'\n'
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -