In my last post on the topic of rewriting my podgrabber utility, I promised to post the rewrite-code-in-progress to a Bazaar repository. You can branch from here if you’re interested. In this post, I’m going to discuss the paradigm I’m following for getting files from a webserver, pulling them onto a computer, then onto a portable media device.
In the current version of podgrabber, there was a concept of a download manager which would take a URL and save the file to a particular directory. This download manager was built with a small amount of extensibility in a very clunky way. I looked at the URL in order to determine how to download the file. After getting the files from the webserver to my computer, a single function would synchronize files between my computer and my portable media device.
This approach works, but it doesn’t provide a cohesive approach to the problems. It also isn’t very extensible. In order to come up with new file sources (such as FTP) would probably involve a lot of cut and paste and an ever-growing download method. And synchronizing downloaded files to anything other than some MP3 player that shows up as a USB disk drive would prove quite painful.
As I was contemplating rewriting podgrabber, it occurred to me that downloading files from a webserver and getting files onto an MP3 player really only presented one unified problem: moving files from one place to another. So, we have some source of media files (let’s call it a media store) which knows which files it has available, can perhaps delete some of those files, and can perhaps add new files to itself. Examples of media stores are RSS feeds, a directory of downloaded files on a computer, and a group of media files on a portable media device. Here is the code I have so far for an RSS media store and a hard disk based media store::
import urllib
import mediaFile
import os
import xml.parsers.expat
from elementtree import ElementTree
class ListFailed(Exception):
"""Exception raised if listing files in a mediaStore fails"""
pass
class DeleteFailed(Exception):
"""Exception raised if deleting a file in a mediaStore fails"""
pass
class AddFailed(Exception):
"""Exception raised if copying a file to a mediaStore fails"""
pass
class NoPath(Exception):
"""Exception raised if no getStorePath method has been implemented"""
pass
class UnspecifiedMediaFileType(Exception):
"""Exception raised if a mediaFile type hasn't been specified"""
pass
class IMediaStore:
def getMediaFileType(self):
"""return the class of media file this mediaStore houses"""
raise UnspecifiedMediaFileType
def list(self):
"""return a list of mediaFile objects in this mediaStore"""
raise ListFailed
def deleteFile(self, fileName):
"""remove a file from this mediaStore"""
raise DeleteFailed
def getStorePath(self):
"""return the prefix this store has been configured with"""
raise NoPath
def addFile(self, mediaFile):
"""copy specified mediaFile into this mediaStore
This method breaks a pure 'interface' approach by actually
implementing the functionality. This addMethod implementation
should be the same for any mediaStore, so everyone should just
use this 'interface'."""
outfile = self.getMediaFileType()(os.path.join(self.getStorePath(), mediaFile.getFileName()), "w")
while 1:
chunk = mediaFile.read(8 * 1024)
if not chunk:
break
outfile.write(chunk)
outfile.finalizeWrite()
class RSSMediaStore(IMediaStore):
def __init__(self, feedUrl, proxy=None):
self.feedUrl = feedUrl
if proxy is None: self.proxy = {}
else: self.proxy = proxy
self.mediaFileType = mediaFile.HTTPFile
def getStorePath(self):
return self.feedUrl
def getMediaFileType(self):
return self.mediaFileType
def list(self):
opener = urllib.FancyURLopener(self.proxy)
f = opener.open(self.feedUrl)
feed_text = f.read()
try:
feed_tree = ElementTree.fromstring(feed_text)
except xml.parsers.expat.ExpatError:
return []
item_list = [mediaFile.HTTPFile(i.find("enclosure").attrib.get("url", "No URL")) for i in feed_tree.findall("*/item") if i.findall("enclosure")]
return item_list
class FileSystemMediaStore(IMediaStore):
def __init__(self, directory):
self.directory = directory
self.mediaFileType = mediaFile.DiskFile
def getStorePath(self):
return self.directory
def getMediaFileType(self):
return self.mediaFileType
def list(self):
files = [self.mediaFileType(f) for f in [os.path.join(self.directory, ff) for ff in os.listdir(self.directory)] if os.path.isfile(f)]
return files
class MTPMediaStore(IMediaStore):
pass
I’ve created an interface class called IMediaStore mostly to be able to keep track of what I want this thing to do. I felt like I was coding in Java by feeling like I had to put common methods in the IMediaStore class, but this really helps to keep things straight.
In contrast to the previous generation of podgrabber, I now have nice chunks of functionality which are logically separated and aren’t going to get overly complex. Both the filesystem based media store and the RSS media store know how to get a list of files they contain and return that list.
Which brings me to the types of files themselves. I expect the files that I’m dealing with to act something like “real” files on a filesystem. Using urllib can make things go a little easier for. But I know when I start creating an implementation for my MTP-based Creative Zen Vision W, I’m going to run into some difficulty. So I created a media file interface and implemented an HTTP file and a filesystem file::
import urllib
import re
import os
class LocalFilePathDoesNotExist(Exception):
"""Exception raised by MediaFile objects when there is either no file on disk for this media
file or the object just doesn't know about it."""
pass
class FileNotReadable(Exception):
"""Exception raised by MediaFile objects when the file is not readable"""
pass
class FileNotWritable(Exception):
"""Exception raised by MediaFile objects when the file is not writable"""
pass
class IMediaFile:
"""MediaFiles may be files actually located on a hard drive, on a web
server, or on a portable media device. This interface describes what can
be done with media files in order to copy them from one medium to another.
In order to copy from a MediaStore, the MediaFile must either provide an
implementation for read or getLocalFilePath. read allows us to copy bytes
of data from one location to another. getLocalFilePath allows us to either
open the actual file in read mode or use shutil.copyfile. In order to copy
to a MediaStore, the MediaFile must either provide an implementation for
write or getLocalFilePath. write allows us to copy bytes directly to the
file. getLocalFilePath allows us to either use shutil.copyfile or open the
local file in write mode and copy bytes in.
In the case of a media device which we don't have a filesystem interface to
copy files to or the ability to write bytes directly to (such as an MTP
device), finalizeWrite can come in handy. One strategy is to write the file
to a temporary location using either write or shutil.copyfile (after
determining the tmp file's location with getLocalFilePath) and upon the call
to finalizeWrite, we can copy the file to the device. In the case of MTP
devices, there are a number of mtp-* utilities which can copy files on a
filesystem to the MTP device.
"""
def __init__(self, location, mode="r", **kw):
self.location = location
self.mode = mode
self.__dict__.update(kw) ##is this a kludge? maybe.
self._init()
def read(self, bytes=None):
raise FileNotReadable
def write(self):
raise FileNotWritable
def getLocalFilePath(self):
raise LocalFilePathDoesNotExist
def getFileLocation(self):
return self.location
def finalizeWrite(self):
raise FileNotWritable
def getBytesRead(self):
raise FileNotReadable
def getBytesWritten(self):
raise FileNotWritable
def getFileName(self):
raise FileNotWritable
def _init(self):
pass
test_quote = """['\"]"""
attach_re = re.compile('''^(attachment|inline);s*filenames*=s*''' + test_quote + '''(.*?)''' + test_quote)
class HTTPFile(IMediaFile):
"""HTTP file
"""
def __str__(self):
try:
filePath = self.getLocalFilePath()
except LocalFilePathDoesNotExist:
filePath = "No File Path Found"
return "" % filePath
def __repr__(self):
try:
filePath = self.getLocalFilePath()
except LocalFilePathDoesNotExist:
filePath = "No File Path Found"
return "" % filePath
def _init(self):
self.bytes_read = 0
try:
proxy = self.proxy
except AttributeError:
proxy = {}
self.opener = urllib.FancyURLopener(proxy)
self.opener_file = self.opener.open(self.location)
self.filename = os.path.basename(self.opener_file.url)
headers = self.opener_file.headers
for key, val in headers.items():
#print key,val
#kludge to get the filename out of the MIME contents
if (key == "content-disposition") and ((val.startswith("attachment;")) or (val.startswith("inline;"))):
attach_match = attach_re.match(val)
if attach_match:
self.filename = attach_match.groups()[1]
def read(self, bytes=None):
if bytes:
chunk = self.opener_file.read(bytes)
else:
chunk = self.opener_file.read()
self.bytes_read += len(chunk)
return chunk
def getBytesRead(self):
return self.bytes_read
def getFileName(self):
return self.filename
def getLocalFilePath(self):
return self.location
class DiskFile(IMediaFile):
def __str__(self):
try:
filePath = self.getLocalFilePath()
except LocalFilePathDoesNotExist:
filePath = "No File Path Found"
return "" % filePath
def __repr__(self):
try:
filePath = self.getLocalFilePath()
except LocalFilePathDoesNotExist:
filePath = "No File Path Found"
return "" % filePath
def _init(self):
self.bytes_read = 0
self.bytes_written = 0
self.diskFile = open(self.location, self.mode)
self.filename = os.path.basename(self.location)
def __del__(self):
self.diskFile.close()
def read(self, bytes=None):
if bytes:
chunk = self.diskFile.read(bytes)
else:
chunk = self.diskFile.read()
self.bytes_read += len(chunk)
return chunk
def write(self, chunk):
retVal = self.diskFile.write(chunk)
self.bytes_written += len(chunk)
return retVal
def getBytesRead(self):
return self.bytes_read
def getBytesWritten(self):
return self.bytes_written
def getFileName(self):
return self.filename
def getLocalFilePath(self):
return self.location
def finalizeWrite(self):
pass
class MTPFile(IMediaFile):
pass
In order to move the files onto my Zen, I’ll have to flesh out the MTPFile class. (In case you don’t know, MTP is Microsoft’s “media transfer protocol”.) There are a number of command line utilities which I can wrap in order to interact with an MTPdevice. There is also a library which would allow me to get closer to the metal, but unfortunately there are no Python bindings for the MTPlibrary. So, I will either wrap the command line utilities or swig the library.
Given this foundation, the following code will take the first 10 items from CNet’s “Buzz out Loud” podcast and store them in my “/home/jmjones/mediaStoreTest”directory::
import mediaStore
sourceStore = mediaStore.RSSMediaStore("http://www.cnet.com/i/pod/cnet_buzz.xml")
destStore = mediaStore.FileSystemMediaStore("/home/jmjones/mediaStoreTest")
for f in sourceStore.list()[:10]:
destStore.addFile(f)
This was a quick tour of my file handling code. Next time, I’ll get into a common way of synchronizing files between two media stores.


Reminds me of: http://jakarta.apache.org/commons/vfs/index.html
...and in Ruby, http://rio.rubyforge.org
Hi, Jeremy,
I'm curious about why you chose to use urllib and parse the RSS feed directly, instead of using Mark Pilgrim's feedparser module. It lets you handle RSS and Atom feeds transparently, and also understands how to tell if the feed content has changed since it was last fetched (if you cache the modified date and etag). Maybe I'm jumping the gun, and you plan to add that optimization at a later phase.
Doug
Hi Doug,
Since I'm only interested in items which have enclosures (which I think is common between RSS and Atom - not totally positive) it was pretty easy to just parse them out manually. If I were doing anything more than that, I'd probably use Mark's feedparser. I'm still going to look at it eventually and see if it adds any greater value to what I'm doing.
Thanks for the post!
The feedparser module makes working with feeds incredibly easy. You don't have to worry about the parsing (which becomes a real pain when you encounter a feed that isn't well-formed) or format (since it handles Atom and RSS). If you track the time when you last fetched a feed, feedparser will only download the full feed again if there have been changes to its contents. That can speed up processing and reduce network overhead.
Have a look at http://blog.doughellmann.com/2007/04/pymotw-queue.html and http://www.doughellmann.com/PyMOTW/fetch_podcasts.py for some simplistic examples. I have some more complex code I wrote for http://www.CastSampler.com to take advantage of the timestamp checking, but I haven't cleaned it up for general release yet. I'll see if I can get to that in the next week or so.
Doug