When you see a data field with text like 2007-07-05 you are probably looking at a date in ISO 8601 date format. Year, month, day: YYYY-MM-DD

IS 8601 is in an international standard which gives several standard syntaxes for representing Gregorian dates and times. The full English title is ISO 8601:2004 Data elements and interchange formats — Information interchange — Representation of dates and times. It is only about 33 pages long; you can purchase it from your local standards body or from ISO, and as is common practice for ISO standards, there are final drafts available for free on the Internet. It is maintained by ISO TC 154 who have the Dr Who-ish name of Time Task Force.

Before IS 8601 there were multiple other standards for dates and times. For example, IS 2711 which allowed formats like 5 Jan 2000 and dates using ordinals. 2711 was withdrawn as an ISO standard in 1988, superceded by 8601. However, other nations and bodies have continued to use many of the other ex-standard formats, because it is convenient to have time written according to local conventions. The difficulty, you see, with IS 8601 is that they managed to get a nice unambiguous format for dates by adopting a format that no-one non-technical used: year, month, day. (I won’t be dealing with times in this blog.)

The second rub with IS 8601 is that it defines multiple formats. So as well as YYYY-MM-DD for years, such as 2000-01-01 you can also have the basic form 20000101. And the same date could also be represented as YYYY-DDD as 2000-001 with a basic form version of 2000001, where the DDD is an ordinal counting the number of days into the year. And it could be specified as a relative date too: you could specify it as a duration (from a base of, say, 1999-01-01) using the syntax PdddD where P means period and D means days: so P365D or even P12M where M is month. You can do the same with an explicit base and get the notations 19990101/P365D for example. The more exotic you get, the more chance that you have a need for what the standard calls “a mutual agreement” where the exchanging parties agree on what the notation means, because it could have several different meanings under the standard.

The third rub with IS 8601 is that it is based on the Gregorian calendar. This is unsurprising, in view of the dominance of the West and its ex-colonies in trade and standards adoption. However it imposes a conceptual, processing and formatting burden in places where other dates formats are used. It is not as uncommon as people think: I have lived in Taiwan and Japan, where non-Gregorian calendars are used for example. And obviously the Islamic calendar is in wide use.

XML Schema Standards and Dates

XML Schemas (XSD) is a Technical Recommendation made by an industry consortium W3C, which allows direct participation by representatives of fee-paying members. XML Schemas supports a wide range of ISO 8601-ish date formats. For most formats, there is no difference, and there is even an explicit appendix ISO 8601 Date and Time Formats which gives clear information.

XSD provides many different datatypes for dates, times and durations. However, it does not allow all ISO8601 syntaxes, and it does alter others. For example, ISO 8601 allow a year 0000. This is not allowed under the AC/DC calendar system, where you go straight from 1BC to 1AD. XSD’s date types disallow the year 0. Most importantly, the date notation used is the extended one with the minuses, so 2000-01-01 not 20000101. XSD allows you to “derive types” to restrict dates to certain values or ranges.

When we were discussing date formats in the W3C XML Schemas Working Group, I tried to get localized date formats allowed: I think it is the same principle as IRIs: it is good if a human can author directly (or generate directly) in the form or notation that they use to think about the data. However, my brilliant idea was rejected by the XML Schema Working Group (with the MS representative taking quite a strong stand that neutral/standard formats should be used, not localized ones) probably because I did not have a proof of concept. Since then, Jenni Tennison’s DTTL data typing language has come along, and is being adopted as part of the ISO DSDL multi-part standard. It is, I believe, exactly the right way to go: allow notations in the format that makes most sense to the stakeholders and application requirements, but provide a mapping to neutral/fixed-syntax formats.

In that sense, my personal belief is that ISO 8601 is a relic of a pre-markup and pre-schema mentality. That does not mean it is not valuable nor that it should not be maintained, nor indeed that it shouldn’t be the first port of call when looking at date formats. But it pushes localization to be an application consideration whereas I think it is just as legitimate and feasible to make it a markup/parsing (i.e. schema) level issue. This is not only because localized formats (rigorously described with an appropriate declarative schema language) make it easier for humans to read and write, but also because where the consumers and generators of data are computers and humans are relatively unimportant in the pipeline or critical path, then data field notations localized (again, rigorously described) for optimal computer performance is entirely appropriate and smart.

Office Document Formats

ODF and Open XML both use ISO 8601 dates in the YYYY-MM-DD form throughout for all dates. (The ODF spec uses US date MM/DD/YYY formatting in palces in its text, but don’t let that confuse you.)

ODF has quite a nice, basic and consistent approach to dates in spreadsheets: read and store them in a kind of ISO 8601 format but also allow a “null date” (such as 1899-12-31) to be specified to allow conversions of date into numbers. Spreadsheets very often actually store, manipulate or transfer dates as ordinal values from an index point: this makes calculations with dates very straightforwards. Representing dates as ordinals is also used in other ISO standards: for example, the SQL_DATE data type gives the number of days since January 1, 1841. (It gives this count as a simple integer.) See section 8.5.2 Calculation settings in ISO ODF for more information.

The draft specification for ISO Open XML, from Ecma, does have one oddity, which has attracted much controversy. In SpreadsheetML table cells only, dates are actually saved as durations, as ordinals. The base is set by an attribute on the workbook, and reflects the supported ranges of Excel on different platforms (on Windows, Excel does not support dates before 1900; on the Macintosh, Excel does not support dates before 1904; putting in such a date will be serialized out as 0 into SpreadsheetML.)

The reasons for saving as as duration rather than a date are obvious: it reflects the internal format directly, allows faster loading and save times to the XML, and allows faster loading and saving times when interfacing with an SQL system that uses SQL_DATE etc. The economic value of load/store times for Office documents is enormous, and it would be quite inappropriate to apply the criteria that one might use, say, for DOCBOOK documents, to standard office formats: I actually think that ODF gets it quite wrong here, and that best practice should dictate that optimized formats should be available. However, by the same token, I think that SpreadsheetML gets it wrong, and that it also should allow reading of data in ISO 8601 format as well as in its optimized notation.

The logical question that comes up is Should SpreadsheetML use ISO 8601 duration format rather than just raw ordinal integers. If the ISO 8601 standard notation was used, SpreadsheetML should use <v>P1D</v> to mark up the first day in the range, rather than <v>1</v>. However, the P and D are redundant, because the notation is clearly marked up by attributes (and documentation). This is the old issue of where the barriers should be between information in markup and information in embedded formats. I don’t see that <v>P1D</v> has any benefits over <v>1</v> frankly: it would seem to be an exercise in nominalism and pointless compliance.

<digression>The additional difficulty here is that we are let down by XSD here, again: XSD doesn’t allow the type of an element to be selected in part or whole by an attribute value on an ancestor, unlike ISO Schematron and ISO RELAX NG. XSD is completely deficient in support for these kinds of idioms, because the database mindset of its developers let them to conceive of attributes as merely funny kinds of elements rather than as metadata on an element, of the same importance and character as the element name. So XSD doesn’t allow attributes to select type; therefore Open XML would have to compromise its design, where elements are highly generic (i.e. data values in spreadsheets are in a “v” for value element) in order to allow values to be typed; however then Open XML could declare the value to be an xsd:duration which would then require the P1D notation. Another approach in XSD would be to use xsi:type where the v element is a union of integers, durations, string etc. However, then we would need to consider how to fit shared string references into the datatyping framework. Too much work! </digression>

The second reason why the ordinal values for dates in SpreadsheetML are controversial is because of an out-by-one adjustment that is needed for some functions for the first two months in 1900. To me, this is just a silly edge case: remembering that spreadsheets from Mac Excel don’t even get back to 1900, and on Windows they don’t go before 1900: it is hardly the wholesale subversion of the Gregorian calendar that you might suspect from various comments on the Web. ODF perhaps punts the issue, by allowing date indexes to start on 1899-12-31 or on 1900-01-01 (examples they give) and so they leave it up to the application developer or document generator to figure out which one is appropriate.

In my blog last month on Principles for reviewing standards, I took the position, which I think is the most reasonable one, that for embedded data fields the standard forms should be provided and optimized forms may be provided. From that POV, Open XML should also allow ISO 8601 durations and/or dates as well as the simple duration ordinal. And ODF should allow duration ordinals as a matter of best practice,