Like so many Internet abuses–spam, artificial measures to boost search engine rankings, and most recently, anonymous postings by businesses or political campaigns that pretend to be unaffiliated–click fraud is an Internet war that undergoes constant metamorphosis.

Both sides (those who wish to abuse the system and those who want to keep it clean) are always increasing their sophistication, pushing more and more criteria through more and more complex algorithms in an effort to outsmart the foe.

In these sorts of battles, the good guys are always attempting to reduce the burden on humans by automating their responses, and run up against the problems involved in anticipating a creative human foe. Also endemic to these situations are complaints about unfairness, inaccurate classifications, and other weaknesses suggesting a need for standardization.

What interests me about the company Click Forensics is the distributed, P2P-like strategy put into practice by their Click Fraud Network. It attempts to maximize the volume and variety of data available for sifting the good clicks from the bad.

The Network brings in a huge amount of input data and allows fast turn-around for results. The question is whether this technique will work for the domain of click fraud.

As in all the collaborative, Web 2.0 sorts of sharing that are currently making the news (or quietly making revolutions in various fields), the Click Forensics strategy reaches out to large numbers of affected users in an appeal to combine their power and try for ever-better results. The philosophy behind this strategy is intriguing; whether or not it proves successful, it is already yielding some interesting public results.

The basics of PPC quality

Everyone agrees there’s a certain amount of abuse in the pay-per-click (PPC) system of Web advertising. It’s probably perpetrated mostly by competitors of the sites of that pay for the clicks, or by ad publishers boosting their profits–but there are many other motivations, and the proportion of each motivation in the mix is unclear.

We do know, though, that people abuse the system by running bots that visit PPClinks, by passing through a variety of proxy servers to hide repeat visits, and even by paying people in low-wage countries to click the links.

Nevertheless, click fraud is a mushy concept. Is it fraud if somebody clicks on a link for jewelry because they love to do virtual window-shopping, even though they know they can’t afford to buy? Or what about a kid who clicks on all the pretty pictures?

That’s why I avoided the loaded term “click fraud” in the title to this article, choosing the less eye-catching term “quality.” Some commentators prefer the term “invalid clicks” to the somewhat inflammatory “click fraud.” (Click fraud is a subcategory of invalid clicks.) We’ll return to this topic throughout the article.

Checking for click fraud can be done by search engines and other sites that host ads, or by independent companies such as Click Forensics. The data collected by each company is comparable, but each has its own particular recipes for determining suspicious clicks. The President and CEO Click Forensics, Tom Cuthbert, in a phone conversation with me, divided the criteria into three types. A few examples of each type follow:

Technical
Suspicious activities include an IP address in a region you don’t serve, or repeat clicks from a single IP address. A referring page that has nothing to do with your PPC link should raise questions, and if the referring page is a proxy server known to serve up click fraud, you’ve practically got an open-and-shut case.
Behavioral
A very brief visit, or a large number of clicks in a short period of time, suggest the visitor is either a bot or a person who isn’t really reading the site.
Economic or market
PPC links that cost more are at higher risk of click fraud.

Hardly any of the preceding criteria are a dead giveaway. Many people are fast readers or like to compare different sites; that could lead to a legitimately high frequency of clicks. A person may visit your site from a country you don’t sell to because she’s on vacation. But many companies claim to be able to identify likely fraud through rankings that assign weights to each factor–and the factors can run into the hundreds.

(Another problem related to click fraud is impression fraud. This is the opposite of click fraud: deliberately bringing up ads in search engines without clicking on them. The intent here is to make it seem like the advertiser is wasting the words that it has bid on. The result could be that the advertiser abandons a potentially profitable campaign, or gets booted off by the search engine.)

The wider the breadth of data you have access to, the more easily you can identify bad IP addresses or suspicious patterns. That’s one reason (aside from the complexity of the whole process) advertisers can rarely fight click fraud by themselves. Search engines can mine their entire sets of pages and logs; consequently, major search engines have a head start over minor ones at identifying fraud.

As an example of search engine efforts, Yahoo! claims to run each PPC click through literally thousands of filters to detect low-quality clicks. These filters are tuned constantly, just as email system administrators tune their spam filters. Yahoo! depends both on an internal team of statisticians and reports from customers, which they encourage. Not only can a customer sometimes get a refund; its report can help Yahoo! identify a new pattern of fraud and update its filters. John Slade, Senior Director of global product management for Yahoo! Search Marketing, says, “We’ve given advertisers billions of free clicks.”

While search engines can aggregate data about all their clicks, the advertisers themselves can collect information along some dimensions beyond those that are available to search engines. The advertisers can figure out, for instance, how much time elapses between visits to different pages on their site. The knowledge they get is very limited so long as each advertiser has only its own data, but companies such as Click Forensics can combine the data from many advertisers and mine it for larger trends.

The distributed approach

So it makes sense for Click Forensics to provide a data collection service that combines customer data on an ongoing basis. Advertisers can participate in their Click Fraud Network for free if they get fewer than 100,000 clicks per month. Click Forensics offers an enhanced set of services for a fee to customers with more than 100,000 clicks per month.

First, each advertiser in the network loads a page tag into its key web pages, similar to the counters and other tracking devices web sites use for various purposes. This tag, according to VP of Product Development Tom Charvet, sends the Network standard visitor information (IP address, time, referring URL, and so forth) as well as some extra data such as the screen resolution. The tag also sends the browser a cookie that is useful for tracking both the current session and future visits to the advertiser’s site.

The advertiser also sends the Network the web server’s log file entries related to the visit. This also has standard visitor information.

Using forms of analysis familiar to the industry, such as how many pages the visitor looks at and for how long, Click Forensics returns a report about all the site’s visitors to the advertiser on a daily basis, and builds statistics based on data from all advertisers to continuously improve its analysis.

While the data supports Click Forensics’ business, they also maintain a public face through some related initiatives:

  • An attempt to standardize report formats so that advertisers can more easily communicate with search engines and publishers.

  • A forum for the advertising industry called the Click Quality Council.

  • A study being conducted by Dr. Catherine Tucker at MIT, subjecting the data from the Network to econometric analysis to figure out what sorts of users engage in click fraud, and the relationships between them

  • A Click Fraud Index (somewhat like white papers released by other search marketing companies) that reports on trends in the field.

This community approach lets Click Forensics identify patterns and tactics used by fraudsters by watching campaigns in various industries (just as the search engines do), as well as patterns across many search engines and publishers.

Challenges in the search for click fraud

Various ways of collecting data can feed into sophisticated data mining techniques–but are these techniques valid?

Last summer, Google challenged the key the techniques of data collection, including some used in reports from Click Forensics. The main problems stem from the use of page tags or cookies instead of depending on log files. Shuman Ghosemajumder, Google’s business product manager for Trust & Safety, reiterated the findings in a recent blog and follow-up. Click Forensics also posted a response.

The Google complaint deals with data detection rather than analysis. Detection problems can be fixed, as I confirmed in a phone call with Ghosemajumder, so I won’t focus on it further in this article. But other people have questioned common techniques for data analysis as well, a more troubling charge.

I talked to Michael Stebbins, VP of marketing at ClickTracks, who refutes the idea that the definition of click quality can be the same for every advertiser. “For example,” he told me, “if the average visit time across all your PPC ads is 240 seconds, and one particular ad is sending visitors that stay less than 15 seconds, ClickTracks would flag that one ad suspicious. On another web site, the overall average might be lower than 15 seconds, and that’s fine. This idea of everyone joining hands and singing the same song–setting one standard for a quality visitor–sounds appealing but doesn’t make sense.” This was a hard-won decision taken after much research and discussion by staff.

ClickTracks therefore relies on intrasite analysis. They downplay obvious checks such as multiple clicks from the same IP address, not only because search engines filter out those clicks, but because most fraudsters have moved beyond such simple methods. Instead, ClickTracks compares visitor behavior for ads on the same site and variance in patterns of clicks over time; they also favor combinations of statistics, as described on their site.

I also talked to Dmitri Eroshenko, CEO of Clicklab. He too denies that click fraud can be detected by applying common parameters to different sites; each advertiser is unique.

So Clicklab, rejected the idea of collecting statistics and applying them to customers across the board. They offer instead a premium service, with prices starting at several hundred dollars per month, to carry out manual inspections of log files over periods of weeks.

Eroshenko is pessimistic about the ability of small sites to detect click fraud accurately, if they lack the resources to invest in this kind of expert support. He expressed a related skepticism about most of the statistics published that estimate the incidence of click fraud. Normally, he says, they are based on only a few measures, are not calibrated against any control group, and draw on input that is not validated by human analysis.

The maturation of an industry

Some common themes turned up during my conversations with all these industry experts. One is an attempt to standardize the parameters used to classify click quality, another is increasing transparency in how data is collected and analyzed. Both suggest that a young industry is maturing.

Jeffrey K. Rohrs, President & Chief Internet Strategist at the interactive marketing agency Optiem, pleaded for broad transparency in in an open letter to search engines, publishers, and advertisers called The Sausage Manifesto. He believes that the unappealing, secret methods for determining fraud–the current sausage-making in the click fraud industry and search enginess–must be exposed somewhat to public light.

ClickTracks provides an example of transparency. Their reports to the advertiser provide details on each session from each suspicious ad (along with various other useful combinations of statistics). The emphasis is on providing enough information for an informed human to make a judgment–not just to accept an automated indication of fraud. For instance, if suspicious activity warrants close examination, the advertiser can list such things as the referring sites and session times on the advertiser’s site associated with each session started by the ad.

On the subject of standardization, the Click Quality Council started by Click Forensics will be just one constituent of a new Click Measurement Working Group started by the Interactive Advertising Bureau. The goal of this initiative is as simple and fundamental as defining “what is a click?” But if it can do its job (currently it’s reported to be moving slowly), it will eliminate the destructive public arguments between search engines, advertisers, and third-party click fraud monitoring firms over the extent of the problem, examples of which were mentioned earlier.

At Yahoo!, Slade promises that Yahoo! will move to the standard set up by the IAB and then submit to an external audit to show that it conforms. Rohrs also looks forward to this kind of standardization and auditing: “The IAB Click Measurement Working Group’s standard will be only step one. It has to be following by third-party auditing to make sure everybody is counting clicks according to the standard.”

Thoughts about widespread data collection

I respect the research of ClickTracks and Clicklab, which call for an individual assessment–and one that involves human intelligence–by advertisers. But it seems reasonable to me that it is beneficial to throw a wide net (perhaps a politically incorrect metaphor in this age of depleted fish populations) to detect patterns in click quality, as in the other Internet problems mentioned at the beginning of this article,

Therefore, I see Click Forensics’ attempt to provide a common framework for everyone to work together as a natural evolution in the field. It’s an application of the popular trend of combined intelligence from many inputs.

True, the data is maintained by Click Forensics and not shared openly. But there are privacy concerns as well as proprietary ones driving this decision: no advertiser can get access to others’ raw data.

In theory, a collaborative information-sharing system would treat all participants as equals and give them equal access to data. Potentially, if others could have access to the data in Click Forensics’ Network, they could discover new techniques for solving the problem. On the other hand, perhaps they could also discover new ways to abuse PPC links. I don’t see an easy way to safely open up the data in this area.

From restitution to repair

Before I move to the final observation in this article, I should recognize explicitly that fraud is a social and legal issue, not just a technical one. Trade publications have reported that the Securities and Exchange Commission and FBI are looking into click fraud. But maybe the lack of news in this area is a positive sign, because applying older legal frameworks to a new and fast-moving Internet area can lead to problems.

In a contrasting movement, many industry leaders I talked to blurred the distinction between fraud and simply trying to get rid of categories of poorly-performing clicks. Of course, proving fraud means proving intent, which is very hard to do. But for purely practical reasons, a shift from detecting bad guys to fixing bad ads may be beneficial. I detected this trend in Click Forensics as well as other firms.

Getting a refund for fraud is useful and feels good, according to Stebbins at ClickTracks, but often it’s not worth the effort. “It’s much more economical to tune out the offending property of the ad. Usually this involves lowering or eliminating the syndication bid price, narrowing the geographic target, or removing the ad altogether. Many of our customers cut spending by 40-50% while maintaining the same number of conversions.” Ultimately, the goal is to get rid of poorly-performing ads.

I know this sounds counterintuitive, because if you come up with a really good ad, you would expect those who have an animus toward you to attack it. But it’s a refrain I heard repeatedly during my conversations.

Another company, Authenticlick, takes this philosophy even further. Authenticlick uses statistical modeling to identify varying degrees of click quality. Authenticlick’s President and CEO Michael Leonard says its AQUA (Authenticlick QUAlity) score can be used by advertisers to optimize their campaigns around better performing sources of traffic, in addition to documenting refund requests on the basis of fraud or low quality clicks. The scoring technology can be used not only by advertisers but by search engines and ad networks as well; it lets them filter out low quality traffic (including click fraud), price clicks according to their quality, and improve their advertiser customers’ ROI.

The proof of the pudding, so far as Web advertising goes, is the conversion rate, which is the number of visitors that end up making purchases, filling out forms, or performing other actions desired by advertisers. Leonard told me that Authenticlick doesn’t use conversion as a criterion for rating clicks, but compares their ratings to conversion rates to demonstrate that their scoring technology works.

They don’t identify particular clicks as fraudulent (Authenticlick’s position is that this impossible in most cases) but rather that metrics related to the click give it a score associated with higher or lower conversion rates. The company wants to help the PPC industry evolve by enabling search engines and publishers to differentiate themselves and compete on the basis of the quality of their traffic. As I said earlier, this goal is consistent with what I heard from other companies as well.

And then we’d return to defining quality in terms of customer goals, instead of fighting a war.