Like so many other people out on the Internet, I get unsolicited commercial email or "spam". Until recently, I could handle spam by just deleting it or using email aliases. Unfortunately, my server was rendered useless by a spam attack launched by an unknown spammer. The experience forced me to improve my spam defenses. In two articles, I will share the research and results of my effort to implement an anti-spam system. In this first installment, I will briefly cover various anti-spam systems and the system I chose, a network level defense. In the next installment, I'll dig deeper into the details of an implementation with qmail. (The information is general enough that it could be applied to other email systems such as Postfix or Sendmail.)
Let's begin by covering the current state of the anti-spam world. Since spam is such a widespread problem, there have been an increasing number of anti-spam measures devised in the last four years. Some measures involve legislation, some techniques require large groups of people, and some are simple techniques that individuals can use. I will cover some of the more popular defenses that individuals or administrators can implement on their own as well as mention a few of the up-and-coming systems. None of these systems is perfect. I do not claim that any of these will work without a hitch.
Here is a brief rundown of the most popular techniques in use, in order of increasing sophistication:
The first technique involving spam is to choose a hard to guess email address and hide it from all publicly viewable content. This makes it hard for a spammer to guess or use a dictionary attack on your email address. If you do get spam, you will just have to ignore it or live with it. If you want to get email from people, you must somehow give them your email address in advance through a non-Internet channel or use a web form as an email proxy to your address. For some people this works, but it usually means their email address is hard to remember. This technique defeats the purpose of the Internet by making it hard to communicate with other people.
The next technique, aliases, escalates previous one.
Essentially you use several email aliases for the same account. For
example, it is easy to configure some mail systems to deliver all
firstname.lastname@example.org, where something is any valid
email character string. You could have an alias for public display on
the Internet, such as on a web site or in a Usenet post. You could
also have an alias for each web site where you register. This is a
great way to find out if that web site you registered with is selling
your email address or being lax about protecting your email
If a spammer (or web site) abuses one of your addresses, you can just block that particular alias. I used this technique formerly, and it worked pretty well. Still, it doesn't prevent all the spam. Another issue to consider is that most popular email clients don't support different aliases well. Finally, it is worth noting that some anti-spam companies have sprung up around just this idea.
The next technique is content filtering, which is actually an entire family of techniques. It is also the most widely implemented and discussed technique. All the systems work in roughly the same manner. Incoming email is processed by a filter. This filter scans for patterns that may characterize the email as spam or not. It's that simple. The techniques all vary on where and what they filter on. The filter could be at the server or it could be at the client. It might look at the email headers, the email body, or both.
Another interesting variation is where the patterns come from. Some systems use your address book for a list of valid senders. Others let you enter your own list of words. Some compile patterns from email sent in by people on the Internet. Others build patterns based on your past email. The most popular variations include collaborative filtering, Bayesian filtering, and fingerprinting. No matter what anyone claims, no system works 100% for everyone. As filtering techniques have improved, spammers have continued to work around them.
One content-filtering technique, called challenge-response, is on the rise and deserves a separate description. Essentially, the email content is filtered by the sender's email address and compared against a whitelist of approved addresses. If the sender's email address isn't in the list, an email is sent to the recipient and they are challenged to respond. The challenge is usually simple for a human to perform but hard for a computer.
Some challenges require the user to type a word in from a noisy image. I've even seen one that asks the user to "count the kittens" in a picture. I think these techniques are very successful, but I worry that they may be alienating certain groups like non-English users or the visually impaired. Also, some people refuse to use this technique since they don't want to annoy or offend the senders. Still, some people praise the technique and consider it so special that there has even been patent activity around it.
Another category of spam defense is the network level defense. This technique simply involves looking at the IP address of the machine sending an email and deciding if it is allowed or not. This lookup is done against a blocklist, which is just a list of IP addresses considered to be bad. If the IP address is allowed, then the mail connection proceeds and the email is processed. If the IP address is not allowed, then the TCP connection is dropped or the SMTP connection is aborted with a descriptive error like "Your machine is on the XYZ blocklist, bye bye".
This system works because IP addresses are hard to forge and people can't get new IP addresses easily. If an IP becomes useless to a large portion of the Internet, the spammer must spend energy to get a new IP. The benefits of this defense are unique. If it works, it prevents the wasted network and CPU utilization that spam causes on mail servers. It also unique in that it is geared more toward administrators than end users. Unfortunately, it too has its weaknesses. Currently, spammers routinely take over other non-secured hosts on the Internet in order to relay spam. Also, some blocklists end up being ineffective as they are incomplete, inconsistent, or too extreme in their practices.
One last technique is cryptographic authentication. This is more of a proposed or future technique than one that is currently used in practice. The idea is similar to using a whitelist of approved emails or hosts. The difference is that you only allow senders that have the proper credentials based on modern cryptography. These credentials would be impossible to forge and expensive to re-purchase continually.
This technique is worth mentioning since there are groups working hard on a secure email infrastructure. Such a system would require an authentication piece as well. If this were built, not only would we be able to send email securely, but we could have the ability to filter spam. Unfortunately, since the existing email infrastructure is so huge and entrenched, it will take a long time for such a system to get built.
All of these techniques have their pros and cons. The cons are especially annoying with naive or poor implementations. They may filter an email and never let the original sender know that their message was marked as spam. They may block a host that should actually be allowed. Some require a lot of user intervention. Most will accidentally block subscribed mailing lists. Some systems that share patterns over the Internet mark forwarded "joke" emails as spam, though this may be a feature. Some require lots of email or time to learn your valid email patterns. All of this will make it harder for new people (customers or anonymous contributors) to communicate with you.
To cut to the chase, I looked at the above techniques and chose a network level defense. The choice was easy for me since that system was the only one that could have protected my machine's resources from the recent spam attack I endured.
On November 19, 2002, I was getting 10-20 TCP connections per
second from around 300 different IP networks to my machine at a
colocation facility. I checked the source IPs and they were coming
from all over the globe. The destination email addresses all conformed
to a simple pattern; this indicated that something was performing a
simple algorithmic attack. My computer was really sluggish from
queuing all of the bounce messages. My
qmail queue was
over 13,500 messages at that point. In fact, I couldn't even send out
email through the machine's
localhost interface. The load
caused all sorts of timeouts for other systems on the machine.
After about 24 hours, the attack ended. I didn't receive or relay any spam, but I was really upset. I did some research on the net and a few people in the spam community believed that this had the signature of a Klez Cluster Attack. This is an attack where the spammer uses a cluster of machines infected with a Klez virus to relay spam to various hosts on the Internet. Think of it as unsuspecting users donating their machine time to the spam@home project. These types of attacks appear to be increasing steadily, and I'm not the only one upset about it.
So, with that experience, I came up with a simple list of features that my spam defense would have to provide:.
With that in hand, I did some research on the existing network level spam defenses and talked to a few friends. Let me go over that research right now.
Network Level spam systems owe their design to the original group MAPS. The MAPS project started in 1997 as a small private mailing list called the Realtime Blackhole List (RBL). It was composed of like-minded anti-spammers. Paul Vixie, a widely known netizen, was one of the main persons involved with the group and he helped publicize their efforts. They created a list of IP addresses that spammers were using and allowed other members to query their database in real-time, over the Internet. If an IP address was in that list, and it attempted to send mail through any of the MAPS subscribers' networks, the packets were "black holed" or dropped. This worked well against some of the main spammers who were coming from known networks.
At first, the RBL group used the Border Gateway Protocol (BGP) for distributing this blackhole list or database to other systems. Although BGP was normally used for exchanging global routes between core Internet routers, it could also be used for distributing the RBL database. Since almost all of the systems that could talk BGP were routers, the RBL system was mostly useful to people in control of their routers. It also required good knowledge of the protocol and a decent Internet connection. These features kept the RBL from being useful to a larger set of administrators.
A simpler system was devised in order to make the system much more approachable to normal administrators with fewer resources. In the same way that the MAPS group used an existing protocol for a new purpose, they found another system that would fit this new set of requirements. They chose to use the most successful distributed database system that was already in use, the Domain Name Service.
Paul Vixie and a few others already had expert knowledge of the DNS system, having worked on the BIND DNS server, the most widely used DNS server of the time. Choosing DNS allowed them to reuse a lot of existing software and avoid conflicts with existing firewall rules. Also, because it was DNS, it was already lightweight and well tested. This adaptation on top of DNS is the system used by probably 99% of the network level spam providers today. In the anti-spam community, the protocol is called IP4R, which is probably derived from the phrase "IPv4 Reverse Lookup".
When you query a server via IP4R for IP addresses, your query is
similar to the query that a host uses when looking up the name
associated with an IP address. Suppose you use the 18.104.22.168 IP address
for your system, which is called
a.b.com. When you
connect to a server on the Internet, the destination server will query
its DNS servers for the name associated with your IP address. It does
that by querying for a DNS record in the namespace
22.214.171.124.in-addr.arpa from the root name servers. If there
is a PTR record set up by your ISP for 126.96.36.199 (or the destination
server's DNS was setup correctly), the query will eventually return
a.b.com as the name associated with that IP.
When email servers want to use IP4R to see if a host is from a
hostile IP address, they do a similar lookup with a few
differences. The first difference is that IP4R does not use the same
DNS namespace. You would use the blocklist provider's namespace rather
in-addr.arpa. For example, if we used a service from
example.com, they may tell us to use the namespace
rbl.example.com. The second difference is in the DNS
reply from the lookup. A normal IP reverse lookup would expect a reply
to return a hostname. An IP4R reply, on the other hand, returns a
special IP address to indicate its answer. Let's step through a simple
IP4R lookup to illustrate.
First, we would setup our software to query for IP addresses in
our example providers namespace:
Suppose we are checking on a host with the IP of 188.8.131.52. The DNS
query would go out on the internet for the name
184.108.40.206.rbl.example.com. If the IP address is in the
blocklist, the server will reply with the address
127.0.0.2 (in DNS parlance, an "A" or address record of
127.0.0.2). If the IP address isn't in the database, the query will
return an empty reply. That is all there is to it. The idea of
reusing DNS and keeping it as simple as possible was a great idea.
There is also another optional record that an IP4R provider can send as the result of a query. In DNS terminology the record is called a TXT or "text" record. These records just hold character strings. If an IP is in the database, the provider can also return a TXT record in the reply to the query. Within the reply, the IP4R provider can put an explanation of why the record exists in the database and who to contact about it. This is important, and I'll show you how this comes into play in the actual implementation.
The last part of the protocol to mention is required for testing
purposes. All IP4R providers should have the
address in their database. This is in the
network and should never be an address on the Internet. Since it can't
be on the Internet, we can safely use it to test queries to the IP4R
provider. A neat trick for doing this is to use the
command on the command line. For example, if we use the
rbl.example.com again, you should be able to do:
# ping 220.127.116.11.rbl.example.com 64 bytes from 127.0.0.2: icmp_seq=0 ttl=255 time=0.1 ms 64 bytes from 127.0.0.2: icmp_seq=1 ttl=255 time=0.0 ms
If your ping starts pinging (like the example above), then you have a proper setup. If you get some other error (usually an "unknown host"), then you know to review your configuration.
Also, since we get the address of '127.0.0.2' if an IP is in the
blocklist, we can use the same simple technique to see if arbitrary
addresses are in the IP4R provider's database. Using the
18.104.22.168 address example again:
# ping 22.214.171.124.rbl.example.com.
If it's in there, then you should get ping replies. If the address isn't in the blocklist, then you should get an 'unknown host' or similar error.
Before we move on, let me briefly cover something I glossed over in that last section. In order to use an IP4R protocol, your mail software must support it. The good news is that most email servers (or, more pedantically, Mail Transfer Agents or MTAs), support this. If a protocol is simple in design, it is usually simple to implement.
In the beginning of this article we took a brief tour of the various anti-spam techniques in order to determine the right solution. Then we went deeper into the best one for my situation, a network level spam defense. In the next article, I'll go over which blocklist provider I chose, giving a detailed description of an install with my mail server and discussing the positive results of this effort.
Dru Nelson has been on the Internet since 1988. After starting an ISP in Florida, he moved to the San Francisco Bay area and has been involved with large Internet infrastructure at companies like Four11 (Yahoo Mail), eGroups (Yahoo Groups), and Plaxo. He is now the CTO and co-founder of BrightRoll.com.
Return to the Linux DevCenter.
Copyright © 2009 O'Reilly Media, Inc.