Toby Segaran is the author of O'Reilly's recently-released book, Programming Collective Intelligence. In this eagerly anticipated new title, Toby takes us into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general by mining the "collective intelligence" presented in commonly collected user data.
Programming Collective Intelligence is a practical book that demonstrates how to build web applications that mine the enormous amount of data created by people on the Internet. We recently spoke to Toby about his new book and why these kind of machine learning techniques are so important in the Web 2.0 era.
O'Reilly: Tim O'Reilly called Programming Collective Intelligence the "first practical guide to programming Web 2.0 applications." Was that your intent when you set out to write it, and what tools and techniques does your book give programmers to tackle Web 2.0 topics?
Segaran: The original idea for the book was to teach people the basics of machine learning in a more approachable manner. Typical machine learning books tend to be highly theoretical and use examples that most people have trouble relating to. After I talked the idea over with a couple of people, it became clear that the most germane use of these algorithms was in dealing with all the data collected by the latest wave of web applications, so I decided to center my examples around analysis of data that a web application might collect. When I pitched the book to O'Reilly they loved the idea of doing something around Web 2.0 that would actually teach people new techniques but not scare them off with a "machine learning" title.
O'Reilly: There's a fair amount of theoretical information floating around on these topics, but not so much practical, hands-on information. What were some of the biggest challenges in writing a practical book on programming collective intelligence?
Segaran: The biggest challenge was finding enough examples that people could relate to and datasets that people could build themselves. Someone building a new application would have access to their site's data, but those just working through the book needed datasets from elsewhere. I spent a lot of time talking to my friends getting ideas for interesting social web sites that they had used, searching through long lists of open APIs and coming up with examples that I felt were really cool.
The other challenge, of course, was trying to explain the algorithms with the assumption that readers were probably not interested in a lot of equations. Taking equations from other books and describing them with words and images was something I worked very hard to get right.
O'Reilly: Let's back up a little. How would you define the terms "Web 2.0" and "collective intelligence?" Does collective intelligence play a key part in what are considered Web 2.0 technologies?
Segaran: Wow, I'm kind of afraid to contribute to the controversy that is the "Web 2.0 definition." I will say that it is almost universally agreed that Web 2.0 is, in part, about user contributed data and some sort of agregation thereof -- in that sense, collective intelligence is almost part of the definition of Web 2.0
O'Reilly: For a long time Amazon has been considered the leader in utilizing collective intelligence, and their recommendation engine has surely been one of the reasons for their success. How do you feel about what Amazon is doing in this field today, are they staying on top of the latest and greatest technology for using collective intelligence?
Segaran: I am not very familiar with Amazon's internal workings, but I would say it's truly amazing that they manage to give recommendations on such a large scale. The recommendation system has been credited with increasing sales immensely.
O'Reilly: Can you give us some of your favorite real-life examples of how these techniques are being used today?
Segaran: Clearly the recommendation systems of companies like Amazon and Netflix, and Google's PageRank have been hugely successful for those companies. I think what I find most interesting is that smaller companies are using these ideas for more specialized problems -- Collective Intellect, for example, mines message boards and blogs to find relevant information for financial analysis and other vertical subject areas. Another great example is Metaweb, which opens up all sorts of possibilities for harnessing collective intelligence in the future.
O'Reilly: For readers not familiar with Metaweb, can you explain what they're doing and why you find it exciting?
Segaran: Metaweb is building a semantic data storage infrastructure for the web that allows people to collectively build a graph of things and relationships between them. There's a good write-up in the Economist about what they're doing for those who would like more information.
O'Reilly: Why is Python especially well-suited for this kind of work?
Segaran: I chose Python for a few reasons:
O'Reilly: Does your book assume a working knowledge of Python, or will non-Python programmers be able to learn from it also?
Segaran: I tried to make it accessible to programmers of other languages, I think I was moderately successful. There are a number of blogs online where portions of the code have been translated to Ruby, Lisp, and Java. The code is commented and explained, so even if readers don't want to try to run it, hopefully they'll get a fairly detailed idea of what's happening.
O'Reilly: When opening up the gates to gather user data, sometimes you can get a sizable amount of garbage along with the desired data. How important is document and spam filtering, and what are the best approaches for dealing with these? I see you've devoted an entire chapter to it.
Segaran: Many people will tell you that "filtering" is one of the most important concepts in the modern world. I agree with this characterization. I know it's a cliche, but we really do live in an era of information abundance and time scarcity. Since I receive about 500 spam messages a day and maybe see at most one or two in my inbox, I consider spam filtering to be effective enough at this point -- what's still lacking is something which finds information online for me that I will find interesting; I know there are a number of attempts to solve this problem and I've tried many of them but nothing has retained my attention so far.
In general, finding and processing relevant information amongst the billions of documents and data points available to them is probably the way most knowledge workers spend the majority of their time. Better filtering algorithms could improve our efficiency immensely.
O'Reilly: One reviewer has called Programming Collective Intelligence a "book about analyzing Web data using statistical and AI methods in Python." Would you agree with that characterization, and why or why not?
Segaran: Yes, I suppose the statement is true, taken literally. "Web Data" could mean a lot of different things... this is really about user data: actions, behaviors, stated or implied preferences. I hoped to write a book that would not only teach people these methods, but inspire them to learn more about them and give them ideas for cool features they could add to their own projects.
Bruce Stewart is a freelance technology writer and editor.
Return to the O'Reilly Network.
Copyright © 2009 O'Reilly Media, Inc.