I recently needed to filter and process some Atom feeds. I know enough XML that I could process them with my own SAX filter, but this seemed like a better opportunity to use the XML::Atom module. Fortunately, it was very easy.

Installation

Installation went well from the CPAN shell. It required a few other modules I didn’t have installed, but this was on a new machine and they all seemed pretty standard. I also didn’t have libxml2 installed, but that’s also a common dependency.

Usage

XML::Atom provides several objects. The important ones for my purposes represent feeds and entries. When I publish an Atom feed, it’s a list of entries. When you subscribe to my feed, you get a chronological list of entries.

My project needed to work through a list of feeds to find new entries to process. XML::Atom’s documentation pointed me in the right direction, though I had to make a few guesses.

Working with Feeds

Suppose I have a feed. Suppose it’s just an XML file on my local hard drive. I need an XML::Atom::Feed object:

    use XML::Atom::Feed;
    my $feed = XML::Atom::Feed->new( 'sample.xml' );

That worked. A feed has several attributes, including a title. The Feed object provides accessors:

So far, so good. What can I do with it? Let me check the title:

    warn $feed->title();

Working with Entries

More importantly, a feed contains entries. My goal was to process those somehow. The code is again simple:

    for my $entry ($feed->entries())
    {   
    }   

What’s in $entry? XML::Atom::Entry objects. This is where the documentation started to get a little sketchy, but the code is straightforward and sensible. A little guessing worked out fine.

My application must process each entry once. A feed may get refreshed once a day, but newer versions might include already-processed entries. Fortunately, the Atom specification includes a unique identifier for each entry. It’s the responsibility of the feed creator to provide these. It’s easy to fetch them from the Entry objects, though I suspect that I’ll hash them just for a little extra paranoia:

    my $id = $entry->id();

My application also needs the entry titles:

    my $title = $entry->title();

The really important part of my application uses the content of the entry. That’s the main text. This is where the documentation was unhelpful and I had to read the source. It turns out that there’s a content() method:

    my $content = $entry->content();

That didn’t give me text; it gave me an object. I wanted the text:

    my $body    = $content->body();

That’s all of the pieces I needed to build my application. It’s only a handful of method calls; I’m pleased;

Enhancements

It’s a little bit unrealistic to expect that I’ll only ever parse local feeds. It’s useful to do so when developing so as not to punish someone’s web site with any of my programming errors, but it would be nice to be able to parse live URLs. How does that work? Here’s what I wanted to write:

    my $feed = XML::Atom::Feed->new('http://example.com/some_feed.xml' );

That actually worked. The documentation made me think that it required a URI object, but the version I tested (0.25) handled this case nicely.

I put off doing this project for a while because I’d never consumed Atom data before. It’s surprising how easy it was.