Here’s the final day of my stream-of-conciousness blog on the 2006 XML Conference in Boston.
XML
Scholarship winning paper: Paolo Marinelli (and Stefano Zacchiroli)
Co-constraint Validation in a Streaming Context
Steaming processing relies on processing during parsing and, as such,
it’s event-based. The goal of this is, in part, to never build an
in-memory of the document in memory (usually because of size concerns).
Streaming often provides higher performance for less memory. Some of the
work on streaming currently is in XQuery and XPath evaluation, XSLT
processes (from IBM and Intel), and streaming validation with Xerces2
(among others).
Validation is the process of verifying that an XML document conforms to
structural (conforming to a grammar) constraints, datavalue constraints,
and co-constraints (based on other parts of the document). Two examples
are an xs:element not defining both a @type and a
xs:complexType|xs:simpleType (XML Schema) or an item is clothing if
@type=”CL” and it isn’t if @type=”NC”.
Schema languages fall into two camps: grammar-based (DTD, XML Schema,
RELAX NG) and Rule-based (Schematron, xlinkit). Grammar-based languages
have limited co-constraint support, whereas rule-based languages have
more support for support for co-constraints but limited support for
structural and datavalue constraints.
SchemaPath is a manual extension to XML Schema with a new construct
xs:alt and a new type xs:error. It has conditional type assignment which
provides support for the definition of co-constraints. An example:
<xs:element name="element">
<xs:alt cond="@type and (xs:complexType or xs:simpleType)"
type="xs:error" />
<xs:alt type="ElementDecl"/>
</xs:element>
The presence of co-constraints raises some interesting problems for
streaming validation. “If x has a preceding sibling y its value must be
an integer; otherwise, if it contains y it must contain just that
element; otherwise, if it is followed by a sibling y its value must be a
decimal; otherwise x is invalid.”
P P P
/ | /
Y X X X Y
123 | 2.3
Y
SchemaPath’s approach to the problem (after providing “appropriate
support for the definition of co-constraints”): managing conditional
declarations and streamingly evaluating XPath conditions. The outcome of
the validation is PSVI (type definition, validity). However, conditional
declaration might imply the impossibility of knowing the current
validity state.
<xs:element name="x">
<xs:alt cond="preceding-sibling::y" type="xs:integer" priority="3.0"/>
<xs:alt cond="child::y" type="CMY" priority="2.0"/>
<xs:alt cond="following-sibling::y" type="xs:decimal" priority="2.0"/>
</xs:element>
P
/
X Y
2.3
startElement(x) T = {CMY, xs:decimal, xs:error}
endElement(x) T = {xs:decimal, xs:error}, V = {valid, invalid}
XPath conditions are evaluated in parallel. An event-based evaluator for
each condition (either elementSatisfies or elementWillNeverSatisfy).
Now, reverse-axes are still a problem (we can’t start the processing
when we hit the starting tag because it’ll be too late). To overcome
this, each XPath condition is preprocessed, changing reverse axes to
absolute locations. The evaluation of such locations are done with an
extension of XAOS (IBM research), extended because it doesn’t support all forward
axes, functions and operations, or locations within predicates.
Root
| (descendant)
|
| Root
x . |
. . | (descendant)
[or] . b
/ (child)
a
--------------
/0,0
|
r1,1
/
x2,2 x3,2
|
a4,3
--------------
Root,0
.
.
x,2 [x,3] Root
. | .
. | .
or,2 or,3 .
. / .
. a,4 .
. .
. . . . . .
XPath subsets:
1. Neither downward nor forward axes and no absolute location paths
(conditions just on preceding nodes, type at the start tag, validity at
the end tag)
2. Neither forward axes nor absolute location paths (conditions on
contained nodes, type by the end tag, validity at the start tag)
3. Complex XPath (conditions on following nodes, type and validity after
the end tag)
Rewriting an absolute path [apparently not that hard]:
/descendant::x[preceding-sibling::y]
||
||
/
/descendant::x[/descendant::y/following-sibling::node() == self::node()]
Managing Content with the Atom Publishing Protocol
Andrew Savikas (O’Reilly)
The traditional publishing processing was a reasonable workflow if you
only needed to lay plumbing in a few directions. However, as we move
forward and try to interchange a lot of formats (DocBook, FrameMaker,
InDesign, Word, OpenOffice.org), you get a mess of pipes rather than a
parallel pair. The value for the future has moved into aggregation as
much as publication with less control of more input types. Think about
pulling in content from bloggers and the internet: you can’t enforce
your template and the styles used for stuff you’re pulling from
disparate sections.
How do we do this? Let’s build an “API for Content”. Make it easier to
publish/update content in a way that’s accessible from a variety of
applications (CRUD). We wanted to be consistent, but not too restricted,
standards-based, modular and scalable, and with content validation (is a
.png really a PNG?).
Initial attempts included a consistent folder system, FTPing ZIP files,
consitistent naming, and registered listeners for notifications. However,
Atom Publishing Protocol seemed well-suited for our problems (it’s not
just for bloggers). CRUD is already built-in to APP by using HTTP verbs.
Listeners are already taken care of by an Atom feed.
Producers now push content to the APP where it’s held in a XML
repository or a media repository then pushed to the Atom feeds. The four
uploadable/validatable formats: DocBook 4.4
(because we didn’t see the value in our own customizations), XHTML (for
online content), Images, and PDF. The Atom feeds coming out of the end
go to different customers: SafariU, Safari, International offices. Each
feed has a list of the content with links on modifying/downloading it
and some metadata.
Centralizing on APP allows us to shift the customization to the
application rather than upstream. Each application knows it needs to
plugin to 120V AC and transform it to the 48V DC they need (which means
that the other applications don’t have to be downgraded to 48V just for
the lowest common denominator effect).
We’re trying to work in feedback to the system by adding value after the
content has been uploaded (perhaps updating the metadata with sales
data).
Tradeoffs and challenges:
Downsides:
- There is some code duplication of content in code in applications
- Still some need for a separate cMS (this doesn’t solve versioning and workflow problems)
- Links to other content challenging (references, Atom IDs)
- Applications can only rely on basic (minimal) validity
Upsides:
- Applications can rely on basic (minimal) validity (both in schema and mime-type) [this isn’t part of Atom, it’s something he added]
- Eliminate “highest common denominator” effect (where annoying applications can constrain things unduly)
Meta-stylesheets
Michael Kay (Saxonica)
Because XSLT is an XML vocabulary, we at least get the benefits of XML
to deal with the pain of typing a verbose language and a lot of angle
brackets.
UI Management
An online banking application needed 400+ different output screens
(which were all very similar). The requirement was that there were
multiple “skins” (for branding) and support for multiple languages.
Business rules should be flexible (for promotions, perhaps).
Business rules were encoding in XML:
<rule name="max-savings-withdrawl"> <cond when="current-bal > 1000" value="current-bal * 0.5"/> <cond when="current-bal < 1000" value="current-bal"/> </rule>
The classical approach uses xsl:import, with one stylesheet per output
screen and common components in a library. However, the drawback to this
approach is that the granularity of reuse is too low, call-backs aren’t
simple, call-template is annoyingly verbose in XSLT1, and it’s hard to
handle multiple variants in multiple dimensions (adding language to
skins, for example).
One approach (had it been available) to fight some of these drawbacks
would be FXSL. It provides first-class functions (which fixes your
callback problem), but this demands an understanding of functional
programming.
The final, adopted solution:
Screen descriptions (400) Style descriptions
/
/
/
Message /
Translations------------META-STYLESHEET--------Business-Rules
______/ |
/ |
/ |
/ |
/ |
Deployed Deployed Deployed
Stylesheet Stylesheet Stylesheet
The drawbacks of using the META-STYLESHEET was that the screen
descriptions were essentially a home-brewed programming language and
were poorly documented.
Excel to XML Conversion
They started with a spreadsheet, took that raw XML and sanitized it,
then made it semantic. This company had a lot of spreadsheets, but they
were fairly well designed. What happened was that each new type of
spreadsheet defined a whole new complete pipeline rather than reusing
existing work (at the last stage of transforming the sanitized XML into
semantic XML).
Their solution was a mini transformation language (XLEX). It describes
the structure of the spreadsheet and the desired mapping.
<sheet nr="1">
<element name="sales>
<frameset repeat="down" height="1">
<element name="region">
<frameset repeat="across" width="1">
<element name="month">
<value x="1" y="2">
</element>
</frameset>
</element>
</frameset>
</element>
</sheet>
Sanitized XML
XLEX description =============> Semantic XML
/
/
XSLT interpreter
This solution is slow (because of the interpreter), so they’re planning
on compiling the XSLT:
XLEX -> XSLT compiler -> Generated Stylesheet------
===========> Semantic XML
Sanitized XML -------------------------------------/
Report Generator
An Orbeon pipeline to allow custom reports: XForms processor (select what
you want selected, displayed)->XSLT transform (generating both the
query and the later display transform)->XQuery (talking to
Tamino, doing data selection only)->XSLT transform (for high performance display)->XHTML. “I’m a great believe in pipeline
processing.” The XForms are actually server-side (HTML and Javascript
for the user).
Schema Generator
How to get generic types into small subtypes (a generic phone number
into a UK phone number). Often, the differences are relatively small but
are carried through from parents down to children. Automating the
generation makes changes in the generic schema less painful (you just
regenerate your own schemas). However, the generic schema needs to be
written inside some specific constraints.
Generic
message
schema
|
|
Constraints (codesets)--------XSLT
|
|
Specialized
message
schema
Observations
- When generating stylesheets use xsl:namespace-alias (it’s not that frightening, really)
- XSLT and XML Schema are, of course, quite namespace-sensitive (make sure your declarations and prefixes are correct)
- Unless you’re sure that source schemas follow coding restrictions, don’t read or analyze them
- When generating XQuery, just pretend it’s XML and use character maps to get around that lie (the pain is in the <) [or XQueryX]
- “Pipelining is invaluable for assembling the components (looking forward to XProc)”
- XQuery is good at inspecing stylesheets but bad at transforming them (namespaces and no identity transform both hurt)
- This isn’t the highest-performance technique
- Schema-awareness helps debugging (it’s worth it to write the schema beforehand)
- XQueryX will give the ability to manipulate fin-grained path expressions
- Wouldn’t a standard evaluate() function be really useful?
- XPath with no namespace context dependencies would be nice, too
- XProc will hopefully allow dynamically-generated pipelines
- Code generation really works well in the XML environment (and has a long history)
Panel: DITA
Alan Houser (Group Wellesley)
Introduction to the Darwin Information Typing Architecture
DITA was developed (in the late 90s) by IBM as a successor/replacement for IBMIDDoc and a
way to get out of a book-centric model. It was donated to OASIS and 1.0
was finalized by the DITA Technical Committee in February 2005
(approved by OASIS June 2005). DITA 1.1 is currently under development.
New stuff being worked on for 1.1: book-like structures and
metadata/attribute-based filtering.
Internationalization, localization, and translation have become more
complex and pressing business needs, which is one area that DITA was
intentionally designed to support.
DITA, as an architecture, is much more than a set of tags. Adopting XML
publishing often means developing an information model, an authoring
tool and workflow, a publishing component, and content management. DITA
tries to offer a lower [-er important] barrier to entry by addressing all of these XML
publishing needs.
Publishing minimalism showed that users need task-oriented documentation
rather than something to read start-to-finish.
DITA Design
- Topics
- stand-alone and reusable information units
- Information Typing
- task, concept, and reference types out of the box
- Specialization and Generalization
- ability to specialize into domains (contentious)
- Attribute-based Formatting (not element-based)
- support of specialization and customization
- Maps
- a rich set of sequence|navigation|hierarchy|group-based navigation stuff
- Metadata-based Filtering
- at built time decide whether to exclude or annotate metadata (for different output types)
- Content Re-Use
- explicit support at topic, block, and word/phrase (anything with an id)
Where to Start
Vendor support is growing (slow to start but speeding up).
DITA Open
Toolkit, a de-facto reference implementation
“Porn is still more
popular than DITA”
Sean Angus (XyEnterprise)
The story of Research in Motion (Blackberry[?]) adopting DITA. They hoped to increase
the number of products and variants, allow customized information,
accelerate time-to-market, reduce operating costs [never heard that
before…], and support localization. DITA made sense because it’s
well-supported (DITA-specific vendor support increasing), an open
standard (reduce administration costs/time), and gives the ability to
re-use content. DITA was replacing FrameMaker (not structured), which
wasn’t scalable with the number of new variants.
Challenges in transitioning to DITA revolved around keeping production
readiness (they didn’t want downtime), content analysis before content
conversion (”is this a topic or a reference”), new management
requirements, training and education (XML, DITA, and a new toolset,
all starting in the content conversion phase), and a cultural shift
(thinking in terms of reusable content).
Results:
- Administration easier
- Automated publishing (mostly) of different formats and languages
- Translation management has probably provided the best ROI (with the topic-level focus)
Cost savings for translations are sometimes “as high as 75%”. “Productivity improvements
of 20%” (get tech writers to author content, not struggle with DTP). “ROI
in 14 months.”
Scott Hudson (Flatirons Solutions)
Interoperability with DocBook and DITA (Why Can’t We Just Get Along?)
DocBook and DITA are fraternal twins. DocBook is large, well-documented,
book-centric, and has multiple outputs. DITA is designed for
interoperability, topic-centric, built around re-use, and multiple
outputs.
Which is the “best” standard? “Stop it, and can you
just get along?” Why should interoperability even matter?
Content exchange with other companies or parts of your own company may
necessitate it. Alternatively, interoperability is important during a
transitional phase, and a lot of tools support is built for DocBook
(that isn’t there for DITA).
[Discussion of the mapping between DITA and DocBook elements with
pictures. They’re pretty similar.]
DocBook and DITA are not interoperable. However, both standards specify
ways of customizing the standard (perhaps toward the other).
To get interoperability, you could:
- specialize your DITA elements with DocBook names (to help DocBook
authors) (what a pain) - likewise, DITA elements could be added to DocBook (bah, not DocBook)
- write round-trip transforms (and allow references back and forth)
(but now you don’t have any validation and you’ve got a lot of
transforms to manage) - A better way: transform content into a neutral interchange format (reducing the number of transforms), which would allow you to play nice with ODF. Called: DocStandards Interoperability Framework
Flatirons is working on the interoperability framework, and are getting
near the end of the element mapping. Follow their White Papers (on their
site) and the list at Oasis: docstandards-interop-tech@lists.oasis-open.org
John Hunt (IBM)
[Makes a joke out of the fact that even the XML geeks don’t like seeing
the slides as XML source (rather than showing them in a pretty way). He developed a DITA specialization for slideshows. “What about the styling an the presentation?” In this case,
he writes another DITA document to describe how the other DITA document
(the actually slide content) should be laid out. This “maintains the
separation of form and content”. Now he uses the DITA Open Toolkit (with
a few setups) to do the two-step process necessary to generate some
pretty HTML-Slidy (linked below). Now we have our pretty presentation.]
Lessons from the above shtick: DITA maintains XML best practices to
separate presentation and content, is semantically-rich, and CSS can
provide a pragmatic styling strategy (attach style properties based on
content selectors) [yeah, but CSS selectors sure ain’t XPath].
This proof-of-concept uses a DITA domain specification to extend
elements (p, say) to support child data elements (<p><text-indent
value=”2em”/></p>). Content topic + Policy
(formatting) topic = pretty (maybe). To make this work for real
users, an extensive “starter kit” with reasonable defaults (to get
users up and running without customization necessary upfront).
Relational database integration with RDF/OWL
What is an RDF/OWL ontology? It’s something we can encode and then use
to describe metadata about resource classes and (in particular) their
relations. OWL is a W3C update of DAML+OIL and there’s quite a bit of
government work in this area. It’s a good fit for AI work, but also
extends to stuff for regular folks, and is showing up a lot in
discussions of the Semantic Web (where every person was developing their
own ontology). “If they build an ontology, people will use it” the
“Field of Dreams attitude.”
RDF is a data model (not a syntax), composed of triples (Subject
Predicate (attribute name), Object (attribute value)). An example: urn:isbn:0554313113, http://purl.org/dc/elements/1.1/creator, “Herman
Melville”. This type of solution is great for loosely structured
data, but…
Integrating RDBMS Integration with RDF/OWL
Use case: Different address book database with different structure and
names where I want to combine the information in them (getting more
value out of each of these). To do this demo generate some data, load it
into MySQL, tell D2RQ (an RDBMS/RDF interface server) about the
databases, get a dump of the RDF data, create an ontology for that data,
and finally issue ontology-aware SPARQL queries against that data.
To generate the data in this case, he just cranked up Outlook and Eudora
and filled out all of the fields (just to find the structure), then used
Python to generate some reasonable data.
Load it into MySQL with basic CREATE DATABASE (hint: D2RQ makes
you define a primary key).
Tell D2RQ about the databases with generate-mapping (a D2RQ
command line utility), combine them, and start the server with the
combined mapping file.
Get some data to start ontology creation with a simple SPARQL query:
CONSTRUCT { ?s ?p ?o }
WHERE { ?s ?p ?o }
Use some XSLT to help cat the RDF together, where I can pass in the RDF
I care about…:
<rdfcat xmlns:xi="http://..."> <xi:include href="myfile1.rdf"/> <xi:include href="myfile2.rdf"/> <xi:include href="myfile3.rdf"/> </rdfcat>
[Demo of Ontology editing with SWOOP. One trick: You can ask it to open
an “ontology” which is actually just straight data (with no OWL
properties or metadata) and it defines some properties about your data
(already creating a lot of ontology for you). Save this.]
After SWOOP did the easy part, add more value to the ontology
(BusinessState in Outlook is equivalent to Eudora’s WorkState) (make a
generic phone that’s an superproperty of BusinessPhone,
PrimaryPhone,…). If you define email as an “inverse function property”
it’ll combine all the records with the same email as the same person.
Now take the new ontology and use rdfcat to add the ontology file to all
of the cat’d RDF (apply it to everything).
Issue queries (with the tool pellet):
PREFIX e: <http://...eudora>
SELECT * WHERE {
?s e:entries_workState "NY"
}
Symmetric relations (think spouses) can make the data quite interesting
(if the husband doesn’t have a home phone, give me the wife’s) (try to
do that in Eurdora).
Caveats are that you dump all of the data to a disk file, which probably
isn’t scalable.
XSL-FO 2.0: Laying Out The Future
Liam Quin (W3C) on new stuff in XSL-FO 2.0
An XML vocabulary for formatting documents for both print and screen
primarily driven by content (like books) with master pages. You’ll
probably generate XSL-FO from arbitrary XML using XSLT. There are
probably about 20 XSL-FO implementations (but some are pretty quiet)
with one every 3 or so months. Many XML editors support XSL-FO.
XSL-FO 1.1 is a W3C Recommendation as of this week [hooray!] and
implementations are already shipping.
In September 2006 they held a requirements workshop in Heidelberg with
40 people [!] to talk about the way forward for XSL-FO 2.0. There were
implementers, educators, documentors, users, and potential users.
Technical work will start early 2007.
What should they do? XSL-FO 1.x only explored a part of the (very
larger) problem space and people in Heidelberg were surprised that the
original (historical) XSL requirements document still had a lot of good
stuff. Specifically, a Pull Mode (”layout-driven formatting”),
non-rectangular text areas and advanced layout (thanks SVG, think
magazines), “copyfitting” (with try tables or constraint sets–”make
this stuff fit, and shrink X to do it”), joint work with SVG WG,
improved/advanced Japanese/Chinese/Mongolian text support (a good thing,
as it indicates use), and explicit MathML support (this may apply to a
lot of general typographic need).
[I’m taking off back to San Francisco now]


Whoa! Great writeup, Keith! This gives me TONS to look deeper into. Much appreciated :D
good