advertisement

Print
Primetime Hypermedia

MP3 Sound Bites

by Jon Udell
09/03/2004

In the inaugural column of this series on hypermedia, I mentioned an MP3 clipping service I wrote to enable quotation of sound bites. Before I explain how it works, let's review why it exists. Audio content--and of particular interest to me, spoken-word audio content--is flourishing. In the tech world, Doug Kaye's ITConversations web site is a great example. It features audio interviews with IT personalities, as well as recorded speeches from conferences--including the recent O'Reilly Open Source Convention. Kaye's audio engineering credentials are impeccable, but nowadays anyone can pick up a microphone and speak into an MP3 file. Today, for example, I listened to Dave Winer's thoughts on the business model for Wi-Fi and blogs, recorded while he was driving northward in Wisconsin. In my own journalistic work, I increasingly record and post audio interviews.

Although the amount of audio content keeps growing, the time available for listening remains constant. Until and unless we achieve a radical breakthrough in speech-to-text translation--and I'm not holding my breath--we'll need to find another way to make audio content more granular, and easier to consume selectively. In the realm of hypertext we do this by quoting passages--that is, lifting fragments out of texts that we read, and placing them (with attribution) into texts that we write. Hypertext pioneer Ted Nelson wanted things to work differently. He wanted us to include (his word: transclude) passages, not copy them. But Nelson's vision of Xanadu never materialized. XLink and XPointer, the key standards designed to support transclusion, remain obscure. What we've actually got are whole-document URLs plus a few strategies for subdocument addressing. The common one is destination anchors--blog permalinks, for example--that mark locations within documents. Less common, and requiring active server participation, are annotation systems that work with arbitrary byte ranges.

In the realm of hypermedia, there's a fairly natural way to address parts of audio or video files. Everyone understands the notion of an AV timeline. Media players support random access along that timeline. Some, for example QuickTime Pro, can also select clips. That's one model for blogging audio quotes. You'd select a sound bite, copy it, create the clip as a new audio file, and post it. That works, but it severs the contextual relationship between the original file and the clip that was excerpted from it. A strategy that preserves that relationship would enable companion web technologies to preserve and enhance the context. Think about how Bloglines, Technorati, and Feedster assemble the conversations swirling around blog items, or about how del.icio.us and furl.net assemble sets of bookmarks related to those same items. Canonical URLs are what make all this magic possible, and we ought to use them as much as possible on the hypermedia Web.

So here's the canonical URL for Dave Winer's audioblog posting:

http://cyber.law.harvard.edu/blogs/gems/dave/coffeeNotesAugust30.mp3

It's a 9MB, 21-minute chunk of audio. In the excerpt that most interests me, which begins at 15:40 and ends at 18:48, Dave talks about how a city-sponsored blog--in his example, a hypothetical blog for Madison, Wisconsin--would enable visitors to tap into local activities and events. It's a complete thought, related to but distinct from other segments in which Dave muses about finding hotels and the business model for franchised free Wi-Fi. I might like to refer my own city planners to Dave's idea. Here's one way to do that:

Click here to listen to Dave's three-minute pitch for city-sponsored blogs.

What will happen when you click on that URL is, sadly, rather unpredictable. Depending on your combination of operating system, browser, and media player (with or without an associated plugin), it might fail to play, it might play after fully downloading, or (ideally) it might play while downloading. Currently, I'm only achieving the ideal effect using QuickTime--with Safari only on OS X, and with both Firefox and MSIE on Windows. When you start shoving bits at a browser behind a Content-Type: audio/mpeg header, it seems, almost anything is liable to happen. But give it a try anyway, and if that fails, select this address:

http://udell.infoworld.com:8002/?site=cyber.law.harvard.edu &url=/blogs/gems/dave/coffeeNotesAugust30.mp3 &beg=0:15:40&end=0:18:48&dur=0:21:29

and paste it into your favorite media player (removing the spaces I inserted for formatting), then use the player's Open URL, Open Location, or Open Stream feature to play it.

What's happening behind the scenes is absurdly naive. Here's the Python script that implements the clipping service.

This code does The Simplest Thing That Could Possibly Work and, amazingly, it does. After querying the remote server for the length of the MP3 file, it maps the start and end times passed on the URL line, along with the MP3 file's duration, to a byte range. Then it uses the HTTP 1.1 Range header to ask the remote server to reach blindly into the middle of the requested MP3 file and return the indicated range. The code just ignores the kilobyte or so of ID3 metadata that might exist at the beginning of the file. It likewise pays no attention to MP3 frame boundaries. It's virtually certain that the returned fragment will neither begin nor end cleanly on a frame boundary. But MP3 players are evidently prepared to deal with this kind of abuse. All the ones I've tried gamely hunt for the first whole frame and begin playing.

Note that some (but not all) media players use the HTTP 1.1 Range header to permit direct random access into large MP3 files. If you begin loading the above MP3 file into RealPlayer or Winamp, you'll discover that you can immediately drag the slider to minute 18 and start playing at that point. (Other players--including QuickTime and Windows Media Player--don't support this random-access feature.) This was a revelation to me, it surprises nearly everyone I've shown it to, and it really blurs the distinction between streaming and downloading.

Now, what I don't know about the MP3 format and the internals of MP3 players would fill volumes, and mine is certainly neither a robust nor a complete solution. But I'm presenting it anyway because it works well enough to demonstrate the idea, and to support exploration of the issues it raises. The most interesting ones have nothing to do with the obvious flaws in my implementation. Clearly a thread-per-request architecture will not scale; the URL syntax should be simplified; and MP3 frames are not always guaranteed to be independent of one another. But solving these problems won't matter until some more basic challenges are dealt with.

Among those challenges, the dysfunctional relationship between browsers and media players looms large. Everyone working at the intersection of these two classes of application discovers two painful truths. First, that you will have to spend an inordinate amount of time trying to figure out how to make things work as smoothly as possible. Second, that you will fail anyway. File types, MIME types, players, and plugins combine in bewildering ways. I'm hoping that a new initiative to modernize the long-stagnant plugin API--announced in June by the Mozilla Foundation along with Adobe, Apple, Macromedia, Opera, and Sun--is a sign of desperately needed progress in this area.

Another fundamental challenge is the scarcity of media servers. The hack I'm demonstrating here is motivated almost entirely by that scarcity. Lots of people can upload a file to a web server. Hardly anybody can upload a file to a media server. If you can manage to do that, your content will likely be timecode-addressable--albeit using methods that few people are aware of. (Last month's column, for example, demonstrated how to excerpt from a RealVideo stream. That procedure has been available for years1, but is still virtually unknown.) Unless media servers suddenly proliferate wildly, though, we can't rely on them to achieve a more granular hypermedia Web.

I think that web proxies may be part of the answer. My clipping service is such a thing. It can excerpt from an MP3 file sitting on virtually any HTTP-1.1-compliant web server. Note that the service itself can run anywhere, including on your local machine. I've got an instance of it running on the TiBook I'm typing on now. A URL convention that refers to a localhost-based clipping service would be one way to make media fragments appear at canonical URLs while distributing the work of serving them. Another, and probably more attractive, approach would be to Akamize the service.

In the long run, it doesn't really matter to me whether we adapt HTTP to meet the requirements of granular hypermedia, or make media servers based on alternate protocols more ubiquitous and more web-native, or both. I just want everyone to be able to capture and share ideas that are increasingly embedded in media files but not easily accessible there. Here's a parting example. Although I was at OSCON 2004, I arrived too late for Robert ("r0ml") Lefkowitz's keynote. Fortunately Doug Kaye recorded it. When I listened to that recording, there were a couple of points--on how boring IT constructs such as CRM and billing might be construed as "kewl" and therefore attractive to open source developers--that I wanted to bookmark. Thanks to Doug Kaye's implementation of a version of my clipping service, and to del.icio.us, here are those bookmarks:

the crm systems can talk to each other, because it's really a huge peer to peer system. and we know peer to peer is kewl. (http://www.itconversations.com/shows/detail139.html)

we're doing 1900 billable events/second, averaging 9 cents each. that is the world's biggest micropayment system. so can we agree that a telecommunications billing system is kewl? (from http://www.itconversations.com/shows/detail139.html)

I hope you'll agree that this way of capturing and sharing ideas is kewl too, and I hope open source developers will want to help make the process work more smoothly.

Notes

1 In principle, you need only attach start/stop parameters to the rtsp:// URL, like so: rtsp://cdo.earthcache.net/roc-01.media.globix.net/COMP001916MOD1/ t_assets/20040630/de051b22c39a6136848374594628ed01cf86197f.rm? start=11:45&end=13:05. In practice, to have the best chance of a correct browser/player interaction, you need to embed that URL in a .ram wrapper file, such as: http://www.oreillynet.com/network/2004/08/03/examples/JavaOneLibertyWebServices.ram .

Jon Udell is an author, information architect, software developer, and new media innovator.


Return to the O'Reilly Network