Here’s the continuation of my stream-of-consciousness blog on the 2006 XML Conference in Boston. The conference is running
four concurrent tracks: Publishing, Enterprise, Web, and Hands-On. I’ll
be jumping around between different tracks and trying to give a quick
summary of each session.
[Update: Eric van der Vlist wrote about the panel described below]
Darin McBeath (Elsevier)
The first custom LexisNexis search hardware terminals were small and
based on a very flat markup. Then, Mead Data Central got aquired by Reed
Elsevier (and they were very interested in STM and SGML). In 1997, the
core of Elsevier’s scientific publications (with good SGML markup) were searchable with the
LexisNexis search tools on the web. By 2001, Elsevier was embracing Web
Services (SOAP), with core functions of Search, Browse, and Retrieve
abstracted to access a central repository. Standardization around XML
with componentized DTDs, common element pools, and publicly published
DTDs was moving forward by 2002. They’re still using DTDs, not Schemas
(and developed their own internal validation tools). Today, new products
are XML-based and use WS. Partner/customers are becoming more important
when developing these WS.
Informal Publisher Survey
Are people really using XML in Publishing? A short survey sent to
publishers anonymously. Responses:
- The 5 technologies have had the greatest impact on publishing in the
last 10 years?
- XML XSLT PDF JAVA Internet
- How and where is XML utilized by your organization?
- Structuring content, messaging between systems, application
configuration, standards adherence
- What are the biggest challenges that XML poses for publishers?
- Migration of proprietary content, performance, retraining,
over/misuse, industry acceptance of XML standards, XML knowledge is
confined to the technical groups
- What are the main benefits of XML?
- Portability, not proprietary, human readable, repurposing, standards,
- What are the main weakness of XML?
- Namespaces, XML Schema is overly complex, everyone can do it,
external entity resolution, evolving standards, performance
- Is SGML still used within your organization?
- Legacy application, Editorial applications
- Do you feel your organization truly leverage the power of XML?
- Not everyone is convince of the bene, legacy intertia, costs
- Is XQuery a tool that enable publishers to unleash the true power of
- Yeah “enables”–but we’ve barely scratched the surface, we’re now able
to mix data and visualize in a totally new way, XQuery brings a whole
range of new capabilities and for the first time enables publishers to
‘database’ their content and effectively use it
- doGet(), doPut()
- calls doXQuery to execute a query and return the result
- Uses findXQuery to get a prepared query, binds URI
parameters to XQuery parameters, returns result
- Finds prepared queries or prepares it
- remove as many XPath statements as possible (auto-labeling was especially slow)
- removed as many elements as possible
- removed all unused attributes
- reduce data footprint (shorter elements)
The Silver Bullet for Publishers
Can XQuery (and XML databases) be the silver bullet for publishers and is XQuery ready for
What is XQuery
XQuery is a Query Language (SQL-replacer), a Integration Language, a Transformation
Language, a Full-Text Search Language (not in the current spec), and a
Programming Language (no current update mechanism). There are 9 XQuery
talks this year at the XML Conf and only 3 from last year. Saxonica is
doing XQuery tranformation, Hybrid XML Databses include IBM and Oracle,
Native Databases are available from MarkLogic and eXist, and DataDirect is doing Data Integration.
XML databses and XQuery make sense for document-centric
applications–XML->XML->XML. They allow you to not predifine a
granularity (you can search at <article> or <p> or
An example using XQuery is O’Reilly
Labs, where content analytics are publicly available (lines of code,
# of figures, common word tag cloud). Another example in publishing is
custom publishing portal SafariU, where
chapters or sections from a corpus of hundreds of books can be combined
into a custom aggregate.
SCOPUS is another example product built on XQuery and 50 million pieces
of content. JetBlue uses XQuery to handle its internal manuals, and
Oxford Press offers search functionality via XML databases.
Web Services can be implemented with the XQuery doing the SOAP message
processing, routing, and request processing without serializing anything
Mind the gap between the DTD people and the cutting-edge programmers.
Mashups (and Web 2.0) are here to stay. Realize that people are going to
come to your content from other directions than your front door (Google
Book Search). There’ll be a continued focus on markup.
Douglas Crockford (Yahoo!) presents
href="http://www.json.org">JSON, and Simon St. Laurent says
it’s sad to see that we’ve not really realized XML in one of the core
places it was supposed to work best (in the simple XML web).
Ajax, with an alternative to page replacement, needs a good way to
deliver its data. The first solution was ad hoc, then a database model,
then a document model, finally there were some attempts to use a
programming language model. JSON is one of the latter. JSON is a
a subset of ECMA-262 3e and works with a lot of programming languages.
It’s not a document, or a markup language, not a general
“Discovered” by Doug in 2001 while at State Software doing what would
later be called Ajax applications. (Others had been doing similar things
before this). 2002 brought JSON.org, with a web page describing the
format. There’s been no explicit promotion since then–it’s only been
adopted because people were having the same problems and liked this
better. Ajax in 2005 made JSON hot, and 2006 brought an RFC.
JSON Basics and Structure
and NewtonScript (”which, you’ll remember, will be part of a trillion dollar industry”).It’s values are Strings, Numbers, Arrays, Booleans, Objects and Null.
Strings are 0 or more Unicode characters, with no separate character
class. Numbers include integers, real or scientific, but no octal or hex
(for simplicity), and no NaN or Infinity. Two booleans: true, false.
Null is a value which isn’t anything. Objects are where it starts to get
interesting: an unordered container of key/value pairs. Keys are strings
and values are JSON values (which maps well to structs, records,
hashtables, etc). Whitespace is allowed between tokens to allow
pretty-printing. Finally, an array is an ordered sequence of values,
with the initial index number (0/1) not specified.
For PHP users confused when to use arrays vs objects: Use objects when the keys are strings, use arrays when the keys are
application/json is the registered mime type. JSON is strictly
Unicode, versionless, and very stable (anticipate no changes to JSON
Some supersets of JSON are YAML (a YAML decoder is a JSON decoder),
having JSON built in.
“JSON is the ‘X’ in AJAX.” JSON data is built into the page, and is
much shorter than many other data interchange formats. Use
parseJSON() to evaluate it rather than (no security)
Instead of dynamic script tag hacks and hidden iframes, Doug is trying to
get a new JSONRequest functionality into browsers to offer a safe way of
doing mashups and getting around the single-source requirements in
regular JS. [Ed: COMMENT HERE!]
Slides from this talk are available here [ed: cool].
Jason Hunter (MarkLogic) [ed: full disclosure: I’ve worked with this guy
before] is doing his session to a huge crowd. This is the first time
he’s given this talk [cool]. He’ll try to distill down his experience
from publishers (Elsevier, O’Reilly, Thomson, Oxford…) and
How is web publishing changing the web and how is the web changing web
Increasing content sizes, with more public content, more private
content, and more government content. Users expectations are increasing.
They want easy, immediate, and searchable access to all content (thanks,
Google). Google Book Search already has screenshots of pages and
searchable text. “This is more valuable than a printed book.” The nice
part of this technology also allows users access to rare books as easily
as mass-market texts (he gives an example of a Origin of
Species website with an original edition page scan next to the
People are standardizing around XML, partly with the hope that tools
would be ready for them once they’d switched. Thankfully, XML processing
capabilities are expanding and tools are evolving.
Answers not links (they aren’t good enough anymore). They also want to
be one step away from answers, not two (the answers on Google, not in
the link). As a publisher, you want to do this as well (hint: leverage
your XML’s structure). This will give you the big step above Google
Book Search (which only has flat text). Like the keynote, Jason gave the
O’Reilly Labs site as an example,
where specific categories, publication year, another other fields can
help differentiate search results (beating GBS). Another example uses
“the most beautiful markup I’ve ever seen” from Oxford Press, where
bibliographic entries can be boiled down to an at-a-glance page.
Next, Sweat the Content. There are two ways to make more money: either
create more content or do more with what you have. This can be
done as microproducts, where content is sliced into tiny pieces based on
their very specific interest. One example: filter all of Oxford’s
African American History content and delivery it to the specialists (who
wouldn’t want to buy the rest of that content). Another is an Elsevier
product to do citation analysis (”should we give this person tenure?”)
to show how often and where writers are cited. Custom publishing [again
mentioned at the keynote this morning] is another place where content
can be sweated, as SafariU shows.
Another trend is Content in Context: learn interesting things from the
location of text, situation of the users, and historical context. Can
you search for a medical condition near the words “contraindicated” and
“can kill you” to help me decide whether to prescribe a drug? Will you
weight the results more strongly if it’s in the Conclusion section?
Emphasis on Google, as they’re now on everyone’s mind. There’s a fear in
that you own the content and want to own the users, but publishers have
an opportunity for personalized landing pages, better search (thanks,
markup), and instant AdWords registration (when people start searching
for new words, register them for cheap before others realize).
Everyone wants User Participation, with blogging, reviewing, and tagging
for free from the users. Nature is experimenting now trying user-centric peer
reviews. Additionally, search and guidance can evolve with help based on
collective intelligence (”Did you mean …?”).
Users are now demanding Personalized Access. Give them a customized RSS
feed just for them. Also, why not search only the books that I tell you
are in my bookshelf?
Personalization also applies to Custom Learning, where Editors can reuse
and repackage content. The content can be chunked into object-driven
sections to allow future personalization.
XML also gives you the ability to Leverage Your Structure. This can help
the search, of course, but also browsing (break the content at the right
level), special views, historical analysis, and for printing actual
books. One example of a special view is a Press View, where pictures get
special prominence (each given with some context and their captions),
just some short summary text, and later letters to the Editor about this
What to do if you don’t have rich content? Enrichment: adding structure
with people, places, things, etc. “Tell me every time where Capitan Kirk
talks to an alien named _____”, once you’ve got <person>
and proximity (within 100 words..).
Content Analytics: how many million paragraphs in all O’Reilly books?
What are the most common index terms in Category:XML [Elements, XSLT,
And, outside of any technology, there’s a lot of focus on Agility.
People are feeling their way forward and have dreams, but aren’t sure
what will work and what won’t. Everything must be agile in design,
development, and in business. “The Red Queen Paradox: You must
run as fast as you can just to stay in place.”
Joel Amoussou (Efasoft)
What is JSR 170: Java Content Repository API (JCR)? It is not a
CMS API. What’s the difference between a CMS and a Content Repository? A
CMS is usally a layer on top of a content repository that adds on
versioning, administrations, authentication, and display templates.
Many industries are now trying to integrate their CMSes and their other
systems (CRM, others). To do this, we need a well-defined model and API.
Right now, there are as many as 800 different CMSes available. JSR 170
provides both a repository model and an API rather than 100s of
different APIs. In other worlds, we need an equivalent of JDBC/SQL to
query content repositories.
A main goal of JSR 170 is cross-repository integration. This allows
library services (check in/out), exchange between repositories,
aggregation, and observation (notify me when something is updated,
general subscription feeds).
JSR 170’s repository model is very heirarchical (DOM-like). It
[hopefully] removes the proprietary tie-in to particular (compliant)
repositories. Jackrabbit is an open source implementation, there are
already Eclipse plugins, and TCK a test suite available to check
vendor compliance. JSR 170’s DOM-like model uses a slightly extended XPath 2.0 [!] to
navigate the model.
Two levels of compliance are specified. Level 1 is read-only, but still
structured, with property types and node types, with XPath available as
a query language. Level 2 adds write comatibility, with referential
integrity and access-control. Optional features include versioning, JTA
support, observation, SQL queries, and locking.
There are already Native JSR 170 implementations: CRX Jeceira, Exo,
Apache Jackrabit, Compatible OSS implementations: Magnolia, Jahia, Lenya, jLibrary,
JBoss Portable, and Commercial implementations: via connectors, IBM, Weblogic, Oracle,
JSR 170 started being Java-only, but is being ported right now Python,
PHP, and .NET [look at the Jackrabbit sites for links, maybe?].
Michael Wechner (Wyona)
A presentation of what needs to be changed in the CMS space.
Each CMS is designing their own UI. This is silly, as the UI could be
standardized (mostly) like email clients. Yulup is a small intention for
Firefox that interacts between the CMS server and the user. A page can
be opened in Firefox, edited in a simple editor, and saved on the
server. The goal is decoupling the server from the UI. Perhaps this is
the sort of thing Berners-Lee was thinking about as the “writable web”?
Interacting with these severs in a simple, offline-possible, way that
doesn’t need FTP or WebDAV to transfers, would be quite nice,
especially if the user could edit in a slightly-modified copy of Word
Yulup, one client, discovers that pages are editable via auto-discovery
and introspection (thanks Atom). The client sees the link when the (page
in this case) is loading, follows it, and begins a discussion with the
server about shared abilities (what is the union of the server and the
But why invent a new protocol? Well, although Atom is growing more
popular in different spheres, it’s not really intended for this space.
WebDAV is old, but many implementations are broken. Rather than just
discussing endlessly, let’s try to influence other implementations by
creating our own opinionated implementation.
Yulup isn’t constrained to just XHTML, it can also edit generic XML or
wiki-text. There’s already some authentication available and locking
on checkout. In addition, it has an Atom [APP, wahoo!] interface (auto-discovery).
How does XML fit or not fit into future Web (2.0 or lack thereof)? The
panel is a discussion between four experts: Elliotte Harold (Polytechnic University), Eric van der Vlist (DYOMEDEA),
and Jason Hunter (MarkLogic) and is being led by Simon St. Laurent (O’Reilly)
[ed: Not sure how much of this discussion I’ll actually transcribe]
JH: As always, business needs are pushing new technological needs and in
the pretty circle, these new technologies are opening new business
needs. Also, Web 2.0’s pushing people to open their APIs is actually
quite a nice feature (that wouldn’t have happened before).
EH: The question of whether or should Web 2.0 actually be based on XML?
Many people find XML too complex and hard to use (and don’t see the
benefits): JSON, HTML5, “binary gook”. Much of this resistance seems to
be tunnel vision (only looking at our own, narrow problem domain). If
you really can rely on the programmer on the other end of your
application’s wire being you or a friend of yours, you may not need XML.
“As long as nobody’s talking to anybody else, you can get away with
these things [ad hoc solutions specific to a problem domain]”. XML and
WS are all about talking to people you don’t usually talk to: people on
other platforms, in other languages, across databases, and, most
importantly, across areas of expertise. Web 2.0 isn’t about the “small
contained applications” of the last 50 computers. It’s about
unconstrained, unplanned, unexpected convergences of domains. This
requires maximum flexibility in tools (there’s always the guy hacking in
Emacs…). The only data format for this type of flexibility is XML.
Demand XML for all your Web 2.0 data interchange.
EvdV: We’re now moving into a phase of innovation (this has happened
before, it comes in cycles–think of 10 years ago with XML). We’ve
consolidated XML so much that it’s impossible to move XML (look at the
difficulty of shifting between XML 1.0 and 1.1). “We need a new
innovation phase and I hope that Web 2.0 is this new phase.” Will Web
2.0 actually be about XML? It may not even be really important? What’s
important is to get XML on the web.
SStL: I hope that we will actually see some real innovation (after
hearing for years that everything cool in XML is actually just SGML).
But we have, indeed, come from the era of the county village, where
everyone knew everyone else. We no longer even have face-to-face or
postal interaction with our customers, even though we’re still getting
fantastic things done (even though we don’t know each other). I see many
things splitting (XML doesn’t have to be everything for everybody; go
look at JSON, data people!). Using XML at every stage may not matter
for the actual business.
Does the 2.0 number matter if it’s always “perpetual beta”?
JH: Yeah, it just needs to show that it’s a new innovation. It’s a good
time to be a techie–”you can make money again.”
EvdV: It’s not the name I would have chosen, but it’ll work now that
it’s out there. It will make switching to Web 3.0 (the Semantic web?)
How much threat is vendor politics to innovation (XForms vs Ajax being
implemented in the browser)?
EH: Ajax and XForms aren’t orthogonal. No browsers are actually
implementing XForms, whereas almost all support Ajax.
EvdV: I wouldn’t mind XForms, and you could even generate an HTML+Ajax
app out of XForms.
SStL: Web 2.0 seems to have become possible only after that last vendor
war. Ajax support even seems to be sort-of an accident.
Why should everyone give away data? How can I convince others to give
JH: As with everything, business people do this to get more money.
Google tries to emphasize that you own your own data, but Tim O’Reilly
argued in the NYT that obscurity is worse than anything. And, if you can
become the de-facto source for X, it doesn’t matter if I give away Xes,
as long as you know to come to me for Xes. Popularity first, monetary
SStL: Economists should really look at how Google can give bloggers,
say, money from 3rd party actors.
Jonathan Robie (DataDirect)
What’s hard about doing data on the web?
Well, many new applications
these days involve many different databases, and perhaps a few WS.
n-Manuals, n-Databases, many many things to integrate. If you’re not
careful, each consumer will write their own solution, and you’ll get a
Cartesian product number of implementations. The XQuery vision, on the
other hand, tries to get all the clients using XML+XQuery. With some
work (getting stuff into XML), you can query all of your underlying databases using XQuery.
XQuery makes sense because it’s got native support for XML. This trumps
most conventional programming languages (that need libraries to work with
XML). In addition, XQuery is designed for data integration and its XML
output is already directly useful (contrast that with handing someone a
database table). With any luck, your programmers will be more productive
because of the above.
Performance should be good, as XQuery was designed to be optimizable.
Exposing an XQuery Servlet
Client talks HTTP to a server that combines resources to generate a
response. XQuery is a good “cook” in this environment. Actually, a cook
is a good metaphor in this situation.
[defines REST, I won’t bother to repeat]
[He now demos the portfolio for the user ‘monolo’. The ‘monolo’ is in the URL,
which we can just change. The response is some XML representing the
A servlet can expose two different APIs. One can be the server API
(authentication, returning results) and the XQuery API. The mapping
between the URL is regular, looking for a XQuery file in a folder (based
on the URL) and passing parameters from the end of the URL to the
What is XQJ?
“JDBC for XQuery.” XQuery->XQuery Engine->XML Result. XQJ works
well in J2EE applications, where the XQJ compliant applications talk to
a data access layer connecting many databases.
What is good about XQJ?
It’s like JDBC, so it’s not that hard to learn and it’s also an industry
standard. It winds up being the glue because it fits into any Java
architecture. Additionally, its queries can be created or parameterized
at run-time. Finally, the interfaces are designed for use in J2EE
XQuery Servlet Walkthrough
XQuery for Data Integration
“If everything looks like a nail, all you need is a hammer.” If you can
get everything to look like XML, XQuery can handle it pretty easily. A
common approach is to take data from a relational database. In this
situation, you probably don’t want any extra columns, so do some amount
of optimization. Surprisingly (perhaps), you’ll also want to do this
sort of optimization is also important for large XML documents.
XQuery implementations can help with query rewriting and ignoring parts of the
content that don’t matter. EDI is one example of an XML converter.
Another source of data are external Web Services (and you can call these
directly from XQuery [depending on implementation]).
Robert Gaschen (AMSEC)
How do Navy manuals get put together? Tech Writer->xml
That process works well except for the Ship Valve Technical
Manual, an 8-10k page book (if printed all together). Reorganized
in the mid-90s, the new book was chunked based on the submarine class.
However, the consumer (sailor) was unhappy, as it was all-electronic and
a pain to aggregate their specific requirements. These problems needed to be fixed quickly and cheaply.
The new workflow, to solve this problem was simply this: tech
writer->xml data->transform->xml data->distribution. It
worked! But not as well as could be hoped: larger sections would lock up
the browser, full-text search was very slow, lost ability to pan and
zoom graphics, and print was still a static PDF.
Then, it was time to enhance to fix these issues:
Old search was Java-based, but the new one was based on XSLT, optimized
for quick, full-text search. Graphics were PNG but zooming was restored
with shifting to SVG. Printing capability was restored by putting it
back on the end-user with FOP and some small XSL-FO (mirroring the XPP
[DocBook dinner tonight, which I may or may not write about.]