ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Creating Google Custom Search Engines

by Bernard Farrell
09/06/2007

Why Do You Need a Better Search Engine?

It's early in the evening and some old school friends just called unexpectedly. They'll be at your house in 90 minutes and you want to make a quick meal with what you have in the house. Where do you find an easy to make recipe using the ingredients you've got to hand (chicken pieces, onions, potatoes, cream, wine, and seasonings)? You open your browser, go to Google, and search for a recipe with these ingredients and you're given nearly 600,000 pages. Well, that's not a lot of help. It might take you an hour to find a good recipe among all those sites.

There must be a better way to search just the sites that you'd normally use for recipes and return a smaller set of results that are more likely to help you get a meal on the table. This is where a Google Custom Search Engine can help you.

What Is a Custom Search Engine?

A custom search engine (CSE) tells Google which sites to search and which to avoid when dealing with a search query. This makes it much easier to get specific, guided answers to questions about a specific subject area. If you create a CSE you can use your expertise in a subject to control where Google looks for information about that topic. And you may even make some money in the process, because the custom search engine returns AdSense advertisements with each set of results. If you have an AdSense account, then the revenue from those advertisements can go to you. Don't get too excited, you'll probably get less than a few dollars a month from AdSense unless your search engine gets really successful.

You can tune this list of sites over time, adding and removing sites from the list. This makes it easy to improve the results based on the queries entered by people who are using your custom search engine.

Here's an example of a custom search engine for Bermuda. The editor of this CSE has chosen 80 sites that are worth searching to produce results about Bermuda that help him and others. So, if I love scuba diving and I search on this site for the term "scuba tours," I get results that are about scuba tours in Bermuda. If I used the normal Google search page, I'd get over 2 million results and I'd spend much more time looking for Bermuda-related scuba touring information.

A properly built CSE returns search results intended for specific audiences or areas of interest. It's important to get the search engine name and description right so people don't get frustrated when trying to use it. And then the rest of the work is deciding where to search. Later, you can organize the results into categories to help fine tune the results, but that's an optional step you can skip when you're starting out. It may sound like a lot of work, but you can actually create a new custom search engine in minutes. You'll see how in the next section. Later, I'll show you how to tune the CSE interface, site selection, and results.

Less Than 10 Minutes to a New Search Engine

I'm going to create a custom search engine for recipes. Along the way I'll show you the type of information you'll need to make your own CSE. Before you get going, you need a Google account. If you don't have one, go to the Google create account page.

The first step in creating your custom search engine is to visit the Google CSE home page. Here we'll complete a two-step process: entering the setup data and then trying out the engine. After that, our CSE is ready for use. There are restrictions on what you can do with a custom search engine, these are given on Google's CSE Terms of Service page.

Before you start, you need some basic information. Don't worry if you're not sure how to complete some of the details, you can easily change any of the CSE settings after you've created your engine.

This is one of the most important parts of creating the search engine. Here you control which how much of each site is included in results by how you enter its URL. I'll use the site http://homecooking.about.com/library/archive/ as an example.

URL specified for CSE What's Included Scope
*.about.com/* The whole about.com domain All pages in about.com
homecooking.about.com/* All of about.com's Homecooking site Many pages
homecooking.about.com/library/archive/* Recipes part of the site Recipes pages
homecooking.about.com/library/archive/blbbindex.htm Only this page. One page

Unless you're creating the custom search engine for a non-profit, university, or government agency you have to display advertisements on the results pages. And, of course, you must agree to those Terms of Service.

The completed first page looks something like the following.

CSE page figure
Figure 1. Initial CSE page

When you click the Next button, you're taken to the Try it Out page. Here you can test whether your engine produces results that you like. Don't set your expectations too high for this page, because you'll come back and further tune them later. I'll try the query for pie crust and see what I get back.

pie crust query figure
Figure 2. Pie crust query

These results are appropriate and they all come from the right site. I check the box to get a confirmation email and press Finish.

Note: this confirmation email has lots of useful pointers for management of your engine, so it's worth requesting it. If you forgot to click this box or you delete the email message, don't worry. Log on to Google and click on the My Search Engines link.

I've placed this search engine online with all of its data so you can see how it works and what has been specified to Google. You can find my site for the search engine at http://recipeclues.com/. And the Google homepage for the Simple Recipes search engine is here.

Now we can start to customize the search engine in a number of useful ways. First, I'll show you how to add more sites to the list that Google uses for results.

Adding More Sites Using Forms

You can add more sites to your search engine in a number of ways. If you've already got a list of sites in mind, you can just go to the My Search Engines page and click on the control panel link for your search engine. Then, click on the Sites tab and you have a form that allows you to add and remove sites and check whether you've already got a site defined in your search engine.

sites tab figure
Figure 3. Sites tab

If you've only got a small number of sites, this interface is just about usable. One challenge is that it only allows you to work with 20 of the sites that you've specified at a time. Initially, that may work. But once you have a custom search engine that's using more than about 40 sites, you'll find this interface very tiring to use.

When you click on the Add Sites button, you'll get a popup dialog that allows you to enter individual sites.

enter individual sites figure
Figure 4. Enter individual sites

On this form, the final option lets you dynamically extract links from a page and add them to your search engine. It's useful when the page you're pointing at is a blogroll or a list of linked sites. This has been recently added to the CSE control interface and is not currently available in the bulk input approach.

I'll use the bulk site form to quickly input several additional recipe sites I want added to my engine.

bulk site form figure
Figure 5. Bulk site form

Note how I've used the wildcard option. In the case of the BBC site, all the recipes start with the same substring, with URLs like http://www.bbc.co.uk/food/recipes/swiftsuppers_vegpasta.shtml. I can put the * right after the first part of the URL, it doesn't need to be separated from the rest by a slash mark.

If you'd like to add sites to your engine as you come across them, then you can install the Google Marker bookmarklet in your browser. This is added as a button to your browser (IE or Firefox). Once in place, you simply click on it when you're at a site you'd like to add to your engine. You'll see the following simple dialog box. Fill it out and press Save to add the site to your engine.

google marker figure
Figure 6. Google Marker

When you're starting out, these form based approaches for maintaining your search engine work. But as you add sites, you'll find it hard to work with multiple sites quickly and add various properties to the sites as you input them. Next we'll look at using XML to specify the sites used in the CSE.

Specifying Your CSE Using XML

Before getting into the details here, I need to issue a warning. When you make changes to your search engine using XML files, you may accidentally break your working engine. So, please make sure to keep original versions of the XML files in a safe place in case you need to restore them.

You can specify all the settings for your custom search engine in XML format. The easiest way to get going with this is to look at the current settings for your engine in XML form.

Go to the Control Panel for your custom search engine and click on the Advanced tab. Here you can download two different types of information. The Context information defines the global settings for the engine. These are the settings that you entered when first creating the engine and things like whether volunteers are allowed or what colors to use when displaying results. The Annotations information is the heart of your search engine. It's has all the information about which pages and sites to include in the results and how these are treated.

First, let's get the current context information for the engine by clicking on the last Download in XML Format button on this tab. This doesn't really download anywhere, it just displays the current context information in another browser window or tab. You can then use your browser to actually save the XML to a file on your system.

<?xml version="1.0" encoding="UTF-8"?>
<GoogleCustomizations>
  <CustomSearchEngine version="1.0" volunteers="true" 
  keywords="homecooking "easy to prepare" "simple cooking"" 
  Title="Simple Recipes Search Engine" 
  Description="Easy-to-prepare recipes for home cooks. If you want to have a meal ready in 60 minutes, look here for good recipes that you can use immediately." 
  language="en" visible="true">
    <Context>
      <BackgroundLabels>
        <Label name="_cse_rlplbd3nkfw" mode="FILTER"/>
        <Label name="_cse_exclude_rlplbd3nkfw" mode="ELIMINATE"/>
      </BackgroundLabels>
    </Context>
    <LookAndFeel nonprofit="false"/>
  </CustomSearchEngine>
</GoogleCustomizations>

Initially, the most important values are those in the <BackgroundLabels> section of the XML. When using XML later to change the sites included in your engine, you'll need the name values that are in the <Label> nodes in this section.

Now we download the data (annotations) in this engine by clicking on the Download in XML button that's in the Annotations section. You can also just visit this Download URL to get the same information. Again, you can save the result browser window to a file.

<?xml version="1.0" encoding="UTF-8"?>
<GoogleCustomizations>
  <Annotations>
    <Annotation about="www.bbc.co.uk/food/recipes/swiftsuppers*">
      <Label name="_cse_rlplbd3nkfw"/>
    </Annotation>
    <Annotation about="www.karenscountrykitchen.com/*">
      <Label name="_cse_rlplbd3nkfw"/>
    </Annotation>
...
    <Annotation about="www.recipeswizard.com/*">
      <Label name="_cse_rlplbd3nkfw"/>
    </Annotation>
    <Annotation about="www.dmoz.org/Home/Cooking/*">
      <Label name="_cse_rlplbd3nkfw"/>
    </Annotation>
</Annotations>
</GoogleCustomizations>

If you have more than one custom search engine in Google, the annotations download will contain the annotations for all of your search engines. They'll be mixed in together based on the latest sites added to either engine.

This is why the label names from the context information are so important. Every <Annotation> node in a specific search engine will have either the FILTER or ELIMINATE mode name value in its <Label> node.

You could select the appropriate nodes for a given engine in a text editor, but if there are more than about 50 nodes it gets tiring really quickly. The easiest way to extract the nodes you want is to use XSLT and XPath to transform the downloaded annotations into just those for the search engine you're working on.

I've created an XSLT file to do this transform based on the name values for this search engine. You can modify it to use the name values for your own search engine.

Refining Your Search Engine

Now that you have the context and annotation files on your system, there are several ways in which you can refine your search engine. First I'll show you how to personalize your engine by changing the colors on the resulting page.

I made the changes shown below to the downloaded context file. These set the color of the background of the results pages, the border that goes around Google advertisements, and the title at the top of each search result returned.

<?xml version="1.0" encoding="UTF-8"?>
<GoogleCustomizations>
  <CustomSearchEngine version="1.0" volunteers="true" 
  keywords="homecooking "easy to prepare" "simple cooking"" 
  Title="Simple Recipes Search Engine" 
  Description="Easy-to-prepare recipes for home cooks. If you want to have a meal ready in 60 minutes, look here for good recipes that you can use immediately." 
  language="en" visible="true">
    <Context>
      <BackgroundLabels>
        <Label name="_cse_rlplbd3nkfw" mode="FILTER"/>
        <Label name="_cse_exclude_rlplbd3nkfw" mode="ELIMINATE"/>
      </BackgroundLabels>
    </Context>
    <LookAndFeel nonprofit="false">
      <Colors background="#FEFAFF" 
         border="#000033" title="#993300" />
    </LookAndFeel>
  </CustomSearchEngine>
</GoogleCustomizations>

I can then go back to the Advanced tab in the control panel for my search engine and upload this new file (remembering to first save a copy of the original file). Immediately after the upload has completed, I can use the Preview tab to see what the changes look like.

Here are the various color parameters that can be set and what they change on the search engine results page.

Parameter Defines color for
Background Background for the results page
Border Border drawn around advertisements and above results
Title Title of the URL above each search result
Text Text shown for each search result
URL Non-working URL shown below each search result
Visited Title URL after it has been visited
Light Other information beside URL. Cached or refinement labels

By making changes to the annotations file you really control the heart of your search engine. As well as defining more sites to include, you can also specify categories for each site and given them a higher or lower priority.

Google calls the categories Refinements. These appear at the top of the results page and are used to focus in on a set of the total results. For the Diabetes Search Engine the results page has several of these including symptoms, complications, and research.

refinements figure
Figure 7. Refinements

You can see that each result is displayed with a set of labels indicating how it was categorized. And by clicking on one of the label hyperlinks, the results for sites with that label only are returned.

To create a refinement, you must first modify the context file to specify the label names that will appear on the results page.

<?xml version="1.0" encoding="UTF-8"?>
<GoogleCustomizations>
  <CustomSearchEngine version="1.0" volunteers="true" ...>
    <Context>
      <Facet>
        <FacetItem Title="Lunch">
          <Label name="lunch" mode="BOOST" Rewrite="" IgnoreBackgroundLabels="false" weight="0.7"/>
        </FacetItem>
      </Facet>
      <Facet>
        <FacetItem Title="Dinner">
          <Label name="dinner" mode="FILTER" Rewrite="" IgnoreBackgroundLabels="false"/>
        </FacetItem>
      </Facet>
      <BackgroundLabels>
        <Label name="_cse_rlplbd3nkfw" mode="FILTER"/>
        <Label name="_cse_exclude_rlplbd3nkfw" mode="ELIMINATE"/>
      </BackgroundLabels>
    </Context>
    ... 
  </CustomSearchEngine>
</GoogleCustomizations>

I've defined two different labels that can be assigned to each site in the search engine. When the Lunch hyperlink is clicked, all sites with this label assigned have their priority boosted so they appear closer to the top of the search results. In this case, unlabeled sites can also appear in the results. We could boost by a larger amount (up to 1) to emphasize them even more.

When the Dinner one is chosen, only sites that are labeled with the Dinner refinement will appear in the results. All others, including unlabelled sites, will not be shown.

Now that we have some refinements, I can change the annotations file to assign these labels to the sites that I want. Each site is already defined in an <Annotation> node, and I simply add new <Label> nodes for each refinement that I want to assign.

    <Annotation about="www.absoluterecipes.com/*">
      <Label name="_cse_rlplbd3nkfw"/>
      <Label name="lunch"/>
      <Label name="dinner"/>
    </Annotation>
    <Annotation about="www.allfood.com/mmeal.cfm*">
      <Label name="_cse_rlplbd3nkfw"/>
      <Label name="dinner"/>
    </Annotation>

Once I've completed these changes I can load my new annotations file up to my search engine. Don't forget to backup the original annotations first!

If you'd like to know more about these files, Google has pages documenting both the context file format and the annotations file format.

Conclusion

This article is intended to provide you with enough of an introduction to Google Custom Search Engines that you can create one of your own. This will give you a chance to use your expertise in a specific area to assist people who are looking for answers or help in that area. The detail on the advanced topics will also allow you to customize the data your engine returns and how it's presented to people.

References

Google's CSE FAQ answers some common questions about CSEs.

CSE Documentation will help you figure out some of the more difficult parts of customization.

The Google Custom Search blog will keep you updated on new facilities and changes.

You can get help from other CSE users in the Google Custom Search Help groups

Google has some featured CSE examples that may inspire you.

Bernard Farrell is a software architect focusing on user experience issues. He has over 30 years experience in software design and development including user interfaces for early workstations. He created the diabetes search engine and write a blog about diabetes technology.


Return to ONLamp.

Copyright © 2009 O'Reilly Media, Inc.