O'Reilly Network    
 Published on O'Reilly Network (http://www.oreillynet.com/)
 See this if you're having trouble printing code examples


Analyzing Baseball Stats with R

by Joseph Adler
10/27/2004

The internet is a great resource for the sports fan; there are dozens of sites where one can look up statistics on recent sporting events or on the great players of the past. Baseball statistics appeal to many: some follow their favorite team's seasonal progress, some monitor their fantasy teams, and some are just obsessed by the sheer complexity that is the world of Major League Baseball.

Let's assume for a minute that your love of stats transcends the run-of-the mill obsession and that you know that all of this data can do more than impress your friends at parties (well, sports bars) or tell you that Barry Bonds is the best hitter out there (or is he?). Suppose that you want to be able to calculate defensive ability, or find the best-valued players, or even predict the results of the post-season. Or suppose that you wanted to look at the relationship between player salaries and games won. With all of the raw data available to the stats nut, these calculations are at your fingertips. Below is an introduction to one of my favorite ways to examine the abundance of raw data available on the web: using R to analyze baseball data.

In this article, we focus on salaries. It's playoff time again, and one of the key storylines is team payrolls. In the American League Division Series, the team with the highest payroll ever (the Yankees) played a team with one of the lowest (the Twins). Disappointingly, Moneyball star Billy Beane's budget Oakland A's didn't make the playoffs this year. However, Billy's old assistant, Paul De Podesta, had a team (the LA Dodgers) that did make the playoffs this year. In this article, we'll use R to look at salaries: how are they spread out among players; how do they rate to player performance; and how well do they predict team performance.

1.0 A short introduction to the R project and language

R is a language and environment for statistical computing and graphics. But it's more than that: R is a mature open source software project with support from many developers, an interpreted functional language, and an extensible system for data analysis. A large community of contributors has written libraries of functions, called "packages," for R. You can get more information about R, download sources, documentation, and binaries, from the R Project site. You can get information about R packages from the Comprehensive R Archive Network.

I like to use R to examine baseball statistics because R is very intuitive. A fan can easily calculate formulas without doing any programming. For example, calculating the earned run average (ERA) for a few hundred pitchers is as easy as typing "ERA <- ER/IP".

I tested all the examples in this article on a Windows machine with R 2.0, but you can also run R on Mac OS X, Linux, or many versions of Unix.

1.1 A little about notation

In the examples in this article, I show what you type on the console in R and how the R interpreter responds. The prompt for the command line is ">" if the system is ready for a new statements, and "+" if a command runs over several lines. Commands entered into R are shown in red, and responses from R are shown in black. For some statements and some results, R will post output and errors to the console.

1.2 The R Environment

Let's take a quick look at the R environment in Windows.

The R environment in Windows

R includes a toolbar with some commonly used operations, a console window, and windows showing graphical output, help, edit windows, or other results. You can enter commands into the console window, and R responds with errors and results when appropriate. The R GUI also lets you load packages that are stored locally or install and update packages from the internet.

The R environment is a little different on Mac OS, Linux, and other Unix variants, but the language and tools are the same.

1.3 Objects and evaluation in R

Everything in R is an object. In this article, we'll use a few basic types: scalar values, arrays, functions, and data frames. The simplest objects in R are scalar values, which include numerical, character, and Boolean values. R will evaluate expressions entered onto the command line and return the results. Here are a few simple examples:


> 5
[1] 5
> 5/6
[1] 0.8333333
> "hello"
[1] "hello"
> 1 == 2
[1] FALSE
> 2^3 + (2 * 3)
[1] 14

You can just use R as a calculator, but it can do way more than that. First, you can define compound objects. The c() function is used to build arrays. For example, here is an array object with the earned run totals for five pitchers with the most wins in baseball during 2004.


> c(82, 92, 66, 71, 116)
[1] 82  92  66  71  116

You can do arithmetic (or use other operators or functions) with arrays just like you can with single values. Let's divide the five pitchers' earned run totals by the number of innings pitched.


> c(82, 92, 66, 71, 116) / c(226.2, 237.0, 228.0, 214.1, 208.1)
[1] 0.3625111 0.3881857 0.2894737 0.3316207 0.5574243

Incidentally, you can mix scalar values with arrays, and R will apply the scalar result to every item in the array. We can multiply the expression above by 9 to calculate earned run average.


> c(82, 92, 66, 71, 116) / c(226.2, 237.0, 228.0, 214.1, 208.1) * 9
[1] 3.262599 3.493671 2.605263 2.984587 5.016819

You can create named objects in R using the assignment operator and reference them by name.


> ER <- c(82, 92, 66, 71, 116)
> IP <- c(226.2, 237.0, 228.0, 214.1, 208.1)
> ERA <- ER / IP * 9
> ER
[1]  82  92  66  71 116
> IP
[1] 226.2 237.0 228.0 214.1 208.1
> ERA
[1] 3.262599 3.493671 2.605263 2.984587 5.016819

Names in R are case sensitive; you can define two different objects called "ERA" and "eRA." R also lets you create more complex database-like objects called data frames from a set of columns. Many functions or procedures in R use data frames, and most of the examples in this article use data frames to store and query data. Internally, a data frame is represented as a list of arrays. Here is a simple example of a data frame, using these pitching stats:


> pitchers <- c("C Schilling", "R Oswalt", "J Santana", "R Clemens", "B Colon")
> pitching <- data.frame(pitchers, IP, ER, ERA)
> pitching

     Pitchers    IP  ER      ERA
1 C Schilling 226.2  82 3.262599
2    R Oswalt 237.0  92 3.493671
3   J Santana 228.0  66 2.605263
4   R Clemens 214.1  71 2.984587
5     B Colon 208.1 116 5.016819

A very useful function in R is the edit() function. Given a data frame, R will pop up a spreadsheet-like window that lets you see and change values for variables. When you finish editing, the edit function will return the edited object. (This function doesn't change the original object.)

R allows you to select a subset of observations from a data frame using the subset function. This function takes a data frame and a condition as arguments and returns a data frame. Here are some simple examples of subsets:


> subset(pitching, ERA < 5)
     Pitchers    IP ER      ERA
1 C Schilling 226.2 82 3.262599
2    R Oswalt 237.0 92 3.493671
3   J Santana 228.0 66 2.605263
4   R Clemens 214.1 71 2.984587
> # pick all observations with the minimum ERA:
> subset(pitching, ERA == min(ERA))
   Pitchers  IP ER      ERA
3 J Santana 228 66 2.605263

You can extract individual columns from a data frame by name:


> pitching$IP
[1] 226.2 237.0 228.0 214.1 208.1

Comments in R are preceded by a pound (#) sign. We use some comments in the examples below to describe what each statement does.

1.4 Functions and procedures in R

You probably noticed that a few of the examples above used functions and procedures. Like any modern language, R includes many functions and procedures for common operations and allows you to write your own procedures to extend the functionality of R or to simplify repetitive tasks. Functions return values that can be used like scalars in numerical expressions.

In R, you can list arguments in order or explicitly name arguments:


> pi
[1] 3.141593
> sin(pi)
[1] 1.224606e-16
> sin(x=pi)
[1] 1.224606e-16

In R, the number and order of arguments is flexible. Some procedures assume default values for arguments if you omit them; see each function's help file for more information.

Many functions in R function on multiple data types, including scalar values, arrays, matrices, and data frames. When applied to arrays, some functions return arrays while others return scalar values. Here are a few examples of functions calculated on scalar and array values:


> log(10)
[1] 2.302585
> ERA
[1] 3.262599 3.493671 2.605263 2.984587 5.016819
> log(ERA)
[1] 1.1825243 1.2509530 0.9575337 1.0934613 1.6127960
> mean(ERA)
[1] 3.472588

Some functions have side effects such as popping up windows displaying graphs or allowing values to be edited. For example, you can get help on a function in R by using the help() procedure. To get help on the procedure print, you would type:


> help(print)

In a graphical environment, a window would pop up with help information.

2.0 Importing baseball data

Now, let's try loading some historical baseball data into R. For the examples in this article, I focus on player salaries and team performance. We used Version 5.1 of the comma-delimited data from the Baseball Archive. The Baseball Archive is the result of a collaborative project begun by Sean Lahman to produce a freely available database of major league baseball stats. You can download baseball data from this site for non-commercial use, but the authors request a donation. (See the Baseball Archive for more information about terms of use and download links.)

R includes a function read.table that loads an external text file and returns a data frame. We'll use this function to load in a few of the most important tables. Suppose that you have your data in the directory "D:/lahman51-csv." The Baseball Archive tables are comma-separated value files with no column names in the header. Each call to read.table specifies the separator as a comma, specifies that there is no header line, and includes a list of column names. In our example, we'll load and use only two tables.


> salaries.cols <- c('yearID', 'teamID', 'lgID', 'playerID', 'salary') 
> salaries <- read.table(file="D:/lahman51-csv/Salaries.csv",
+      header=FALSE, sep=",", col.names=salaries.cols)
> teams.cols <- c('yearID', 'lgID', 'teamID', 'franchID', 'divID',
+      'Rank', 'G', 'GHome', 'W', 'L', 'DivWin', 'WCWin', 'LgWin',
+      'WSWin', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO',
+      'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'ERA', 'CG', 'SHO',
+      'SV', 'IPOuts', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP',
+      'name', 'park', 'attendance', 'BPF', 'PPF', 'teamIDBR',
+      'teamIDlahman45', 'teamIDretro')
> teams <- read.table(file="D:/lahman51-csv/Teams.csv",
+      header=FALSE, sep=",", col.names=teams.cols)

The salaries table includes only five columns. The teams table includes 49 different columns, summarizing the season for each team and year. As you can see, the columns include a mix of unique identifiers for teams and players, some information about each baseball team, and some common baseball statistics.

If you'd like more information about what each abbreviation stands for, see the documentation at the Baseball Archive. If you'd like more information about what each statistic means, mlb.com has a good reference page.

2.1 Example: Checking data integrity

Before we begin our analysis, let's check to make sure that the files loaded completely and that all the data makes sense. Many things can go wrong in extracting or loading data, so it's always a good idea to do a few sanity checks on a data mining or data analysis project. In this section, I show a few examples for the salaries table.

Let's start by checking that all the observations have been loaded. We know that there are 15,616 lines and five columns in the salaries file. To check that the file loaded correctly, we'll get the number of rows and columns in the salaries table using the "length" function.


> length(salaries) # to check number of columns
[1] 5
> names(salaries) # now let's look at the column names
[1] "yearID"   "teamID"   "lgID"     "playerID" "salary"
> length(salaries$playerID) # check length of playerID column
[1] 15616

Now, let's look at a few sample observations in each data frame. We'd like to take a look at the loaded data to make sure that nothing obvious looks wrong: columns of missing values, mislabeled columns, etc. In R, you can select a subset of the columns or rows from a data frame by using R's [] notation to select a subset of rows and columns.


> salaries[1:10,1:5] # look at first 10 rows, all 5 columns
   yearID teamID lgID  playerID salary
1    1985    TOR   AL ackerji01 170000
2    1985    CHA   AL agostju01 147500
3    1985    TOR   AL alexado01 875000
4    1985    MIN   AL anderal02  62500
5    1985    BOS   AL armasto01 915000
6    1985    OAK   AL atherke01 107333
7    1985    CLE   AL ayalabe01 303333
8    1985    CHA   AL baineha01 675000
9    1985    OAK   AL bakerdu01 575000
10   1985    KCA   AL balbost01 205000

No surprises so far. As a final check, let's look at some basic statistics for this table to make sure that every value makes sense. R includes a summary() function that provides some basic statistics about each variable. For numeric variables, this procedure returns some distribution information (the minimum value, value at the 25th percentile, median, value at the 75th percentile, and maximum) and the mean. For character variables, this procedure returns (up to) the six most common values. This data is great for sanity-checking a variable. If the minimum or maximum values look too low or high (respectively), you may need to filter outliers or do further checks that the data has loaded correctly. If the quartile, median, or mean values are surprising, or if the distribution is unexpected, you may want to double check your data source to make sure that it's correctly formatted, that it loaded completely, and that it means what you think it means.

Let's check summary information on the salaries data frame:


> summary(salaries)
     yearID             teamID      lgID           playerID
 Min.   :1985.000   CLE    :  586   AL:7804   clemero02:   19
 1st Qu.:1990.000   PHI    :  583   NL:7812   francjo01:   19
 Median :1995.000   SLN    :  583             gaettga01:   19
 Mean   :1994.592   BAL    :  580             bondsba01:   18
 3rd Qu.:1999.000   PIT    :  576             henderi01:   18
 Max.   :2003.000   NYA    :  575             larkiba01:   18
                    (Other):12133             (Other)  :15505
     salary
 Min.   :       0
 1st Qu.:  175000
 Median :  375000
 Mean   : 1202134
 3rd Qu.: 1280000
 Max.   :22000000

We see that the salary data spans 1985 to 2003, as expected. The top team had 586 players over 19 years. The average number of players per year is 386/19 = ~30.8. The official roster has 25 players for most of each season, so that number is not unreasonable. The split between leagues (AL and NL) is about even, which makes sense. Now, let's look at the playerID numbers. First, let's check the maximum value. It looks like no player appears more than 19 times in the table, which is a good thing as there are only 19 years between 1985 and 2003. It also looks like there are only a few players who have played during all 19 years, which is also good as very few players play in the major leagues for 19 years or more. The maximum salary of $22 million looks right (that's A-Rod!), but the minimum looks funny. A salary of 0? Surely that would violate union rules! Let's do a little more investigation:


> s0 <- subset(s, salary==0) # select rows where salary is 0
> length(s0$yearID) # check number of rows where salary is 0
[1] 2
> print(s0) # only two rows, so let's look at them
      yearID teamID lgID  playerID salary
5796    1993    NYA   AL jamesdi01      0
12006   1999    PIT   NL martija02      0

There are only two values that are zero; that's not a huge problem (we can investigate these two cases to correct them or just cut these values from our analysis). Below, we focus on year 2003 statistics, so this doesn't affect our analysis at all.

2.2 Example: Plotting a histogram of 2003 salaries

We're now ready to start doing some analysis. For convenience, let's start by creating a new data frame containing only salary values from 2003.

The easiest way to do this for a data frame is to use the subset function. This function takes a data frame as an argument and returns a new data frame containing a subset of the rows and columns from the input data frame. This function takes many options, but we'll focus on two: the data frame object and the row selection condition. For convenience, let's start by creating a data frame containing only salaries from the year 2003.


> salaries2003 <- subset(salaries, yearID == 2003)
> # verify that only values from 2003 are in the new data frame
> summary(salaries2003$yearID)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2003    2003    2003    2003    2003    2003

Now, let's take a look at the distribution of salaries in 2003 (a histogram), using the hist() command.


> hist(salaries2003$salary)

By default, R will split the salary range into ten evenly spaced segments, then count the number of observation in each segment, and plot the results as a column chart. (That's not completely true. What R will actually do is use a function called "Sturges" to calculate the appropriate number of bins given the data and then use this value. For this data, it works out to about ten bins.) When you type this command, R will pop up a new window showing the following diagram.

It looks like most salaries are pretty low, so let's increase the detail in this diagram. Let's increase the number of bins to 100.


> hist(salaries2003$salary, breaks=100)

After typing this command, R will display a graphic like this in the graphics window:

Hmmm, it looks like a lot of salaries are still bunched up together in the lowest bin. We can use a log scale to show more detail among the low values by plotting the log of salary. As you may remember, you can easily apply a function to a column of values using R. Let's calculate the (base 10) logarithm of each salary in our data and plot the results:


> hist(log10(salaries2003$salary), breaks=100)

R will plot a diagram like this.

It still looks like most salaries are pretty low. Most likely, this is because major league teams have a number of young players under contract for the league minimum (or close to the minimum). Turnover, even within a season, among low paid players is very high, as teams "call up" players from minor league teams and "send down" players to minor league teams. (As an exercise, it would be interesting to filter rookies from the data to focus on free agents or players who have been through salary arbitration.)

2.3 Example: Relationship between wins and team statistics in 2003

Next, let's look at the relationship between some different statistics about each team and the number of wins in the year 2003. Our primary focus is on wins, but we'll look at the relationship between other pairs of statistics as well. Let's start by creating a new data frame for 2003 and looking at the available columns:


> teams2003 <- subset(teams, yearID == 2003)
> names(teams2003)
 [1] "yearID"         "lgID"           "teamID"         "franchID"
 [5] "divID"          "Rank"           "G"              "GHome"
 [9] "W"              "L"              "DivWin"         "WCWin"
[13] "LgWin"          "WSWin"          "R"              "AB"
[17] "H"              "X2B"            "X3B"            "HR"
[21] "BB"             "SO"             "SB"             "CS"
[25] "HBP"            "SF"             "RA"             "ER"
[29] "ERA"            "CG"             "SHO"            "SV"
[33] "IPOuts"         "HA"             "HRA"            "BBA"
[37] "SOA"            "E"              "DP"             "FP"
[41] "name"           "park"           "attendance"     "BPF"
[45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"

There's a lot of good stuff here: hits (H), walks (BB), home runs (HR), runs scored (R), strikeouts (SO) by batters, runs allowed (RA), earned run average of pitching staff (ERA), strikeouts by pitching staff (SOA), and other variables. Unfortunately, there is no total salary, so we need to calculate total team salaries and add that column.

We know the salary of each player from the salaries data frame. We'd like to create a new data frame describing the total payroll of each team in 2003 and then merge the teams2003 table above with the salary information. Here's how to do that in R.


> payrolls2003 <- aggregate(salaries2003$salary,
>      list(salaries2003$teamID), FUN=sum)
> payrolls2003[1:5, 1:2] # let's look at the first few observations
  Group.1         x
1     ANA  79031667
2     ARI  80657000
3     ATL 106243667
4     BAL  73877500
5     BOS  99946500
> # looks like the variable Group.1 is the team and x is the salary
> # let's rename the columns in the data frame
> names(payrolls2003) <- c('teamID', 'payroll')
> payrolls2003[1:5, 1:2]
  teamID   payroll
1    ANA  79031667
2    ARI  80657000
3    ATL 106243667
4    BAL  73877500
5    BOS  99946500
> # the column names now make sense, so let's merge
> # (note that we're replacing teams2003 with teams2003)
> teams2003 <- merge(teams2003, payrolls2003, by='teamID')
> names(teams2003)
 [1] "teamID"         "yearID"         "lgID"           "franchID"
 [5] "divID"          "Rank"           "G"              "GHome"
 [9] "W"              "L"              "DivWin"         "WCWin"
[13] "LgWin"          "WSWin"          "R"              "AB"
[17] "H"              "X2B"            "X3B"            "HR"
[21] "BB"             "SO"             "SB"             "CS"
[25] "HBP"            "SF"             "RA"             "ER"
[29] "ERA"            "CG"             "SHO"            "SV"
[33] "IPOuts"         "HA"             "HRA"            "BBA"
[37] "SOA"            "E"              "DP"             "FP"
[41] "name"           "park"           "attendance"     "BPF"
[45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"
[49] "salary"
> # good, now we have the salary column; let's do a quick check
> summary(teams2003$payroll)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
 19630000  50448130  68979830  70942070  83553040 152749800
 

We now have salary information by team, so we're ready to plot some charts. R includes many functions for graphing values. One of my favorite plots is the pairs() plot. This function prints out a matrix of different scatter plots, one plot for every pair of variables in your data. It's a great way to look at a lot of relationships at once. It's a little hard to describe how to do this without an example, so let's use the pairs() function on a few variables in the 2003 team data.


> # pick 8 columns for plotting, then plot the values
> teams2003.forpairs <- subset(teams2003,
+      select = c("W", "R", "H", "HR", "BB", "ERA", "SOA", "payroll"))
> pairs(teams2003.forpairs)

R will display a diagram like this in the graphics output window:

As you can see, for every combination of variables in wins (w), runs (R), hits (H), home runs (HR), walks ("base on balls", BB), earned run average (ERA), strikeouts (SOA), and payroll (salary), there is a scatter plot. Actually, there are two scatter plots that are mirror images. Here's an example: there are two scatter plots comparing wins and runs. The second scatter plot from the left in the top row shows runs horizontally (the X axis) and wins vertically (the Y axis). The second scatter plot from the left in the top row shows wins horizontally (the X axis) and runs vertically (the Y axis).

This matrix of plots is a very good way to look for correlation between variables. On the diagonal, you can see the names of variables. For each plot (i,j) where i is the column and j is the row, the variable of the Y-axis is given by cell (j,j), and the variable of the X-axis is given by cell (i,i). For example, let's look at the cell in the fifth row and third column, (3,5). In this cell, the X-axis represents hits and the Y-axis represents walks.

That explanation was precise, but could be difficult to understand, so here's an illustration of how this works. I marked up a second copy of this plot to try to make it easier to understand. Here's a procedure for reading these charts. I picked a cell in the grid below and highlighted it in red. Let's start by figuring out what the vertical (Y) dimension means. All plots in a row share the same vertical (Y) dimension, so just look for the box in the row with a name. In this case, it's BB. Now, let's figure out the horizontal (X) dimension. All plots in a column share the same horizontal (X) dimension, so look for the box in the column with a name. Here, it is H.

There is another trick for deciphering these diagrams. Notice that a scale is given for some boxes. If each dimension has a different scale, this can help you read the boxes. For example, see the first label at the top of this pairs plot, showing a dimension from 600 to 900? Well, all plots in that column share that same scale. It turns out that the only measurement with an appropriate range is Runs, which is what that column means.

Now, what can we learn from these plots? Scatter plots are a great way to see relationships between pairs of values: a line or curve is a good indication of a strong relationship. You can see if the relationship is strong or weak depending on how noisy it is. If the points are all clustered together, that can also mean that there is a relationship. Let's focus on just payroll. The far right column has plots with payroll as the X (vertical) dimension, so these plots depict each measurement as a function of payroll. Most of these plots show a fuzzy relationship between payroll and performance. For example, teams with higher payrolls had lower ERAs, and teams with higher payrolls had more wins. However, this isn't as some other relationships in this data. For example, look at the relationship between strikeouts and ERA: the higher the strikeouts, the lower the ERA. Or look at the relationship between hits and runs. Just looking at this data, I'd guess that payroll is a good predictor of wins but that other metrics are much better. Let's use some other tools in R to see if this is true.

2.4 Example: Building a model for wins in 2003

How important is a team's payroll in determining wins? Let's try to fit a simple linear model, and ask R to show us some summary statistics about that model. We'll use an ordinary linear regression model, modeling wins as a function of payroll. The lm() function of R takes a model formula and data frame as arguments and returns a model object. (There are many other options available; see the help file for more information.) Below, we also use the summary() and plot() routines to examine the model object.


> teams2003.payroll.lm <- lm(W ~ payroll, teams2003)
> summary(teams2003.payroll.lm)

Call:
lm(formula = W ~ salary, data = t2003_n)

Residuals:
       Min         1Q     Median         3Q        Max 
-33.616468  -7.513867   2.578131   8.684631  19.165197 

Coefficients:
                Estimate   Std. Error  t value   Pr(>|t|)    
(Intercept) 6.679329e+01 6.231839e+00 10.71807 2.0376e-11 ***
salary      1.997880e-07 8.188955e-08  2.43973   0.021288 *  
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 12.35296 on 28 degrees of freedom
Multiple R-Squared: 0.1753126,  Adjusted R-squared: 0.1458595 
F-statistic:  5.95226 on 1 and 28 DF,  p-value: 0.02128785

When you fit a line in R, the standard lm() routine returns a lot of statistical information about how well the model fits. For a thorough explanation of how this function works, and what all the results mean, please take a look at a statistics book.

Here's my interpretation of a few of these results. First, R shows the "residuals." The residuals are the difference between wins predicted by this model and the actual number of wins for each team. (So, a negative residual is an overestimate, a positive residual is an underestimate.) Ideally, we'd see a symmetrical distribution--median near zero, min and max opposite each other. The median of 2.57 is pretty close to zero; there are 162 games a year, so that implies that the median predicted number of wins is about 2% over the mean number of Wins. In 2003, the Detroit Tigers came close to setting the record for the most losses in one season and ended with 43 losses and a payroll of $49 million. This model predicted 76.61647 wins for the tigers, hence the residual of -33.61647 shown above. At the opposite end of the scale, in 2003, the Oakland A's won 96 games with a payroll of only $50 million (remarkably close to the Tigers' payroll). According to this model, Oakland was expected to win 76.83840 games, hence the residual of 19.165917.

Next, R lists the coefficients and what each one means. The "Estimate" column contains the actual coefficients. In this case, the equation is Wins = 67.9329 + 1.99788 * 10-7 * salary. R also shows statistical values like standard error, t test, and p values that can be used to understand the significance of each variable. The significance codes (shown by asterisks on the right) are intended for quickly ranking the significance of each variable. Next, R lists the residual standard error, the coefficient of determination (R^2 and adjusted R^2), F statistics, and p values (all measurements of how well the model fits).

The coefficient of determination (R^2) represents the proportion of the variation in wins that is explained by the variation in salary. For this model, the R^2 value of .1753126 implies that salary explains some of the variation in wins. To help understand how well salary explains wins, let's look at what happens if we try to predict wins from other values. If we built a model for wins in terms of runs scored and earned from average, the R^2 value would be .8959. If we built a model for wins in terms of at bats, the R^2 value would be .04687. This implies that salary explains more than at bats but less than runs scored and earned run average. I'd say that there is a weak relationship between salary and wins: teams with higher salaries tend to perform better, but the players still have to go out and play baseball.

The analyses shown here demonstrate a weak correlation between payroll and wins. Teams that spend more money tend to do better than teams that spend less money, but not always. What the analysis doesn't show is why this is true. Some of this has to do with long-term contracts: young players are committed to the same team at a fixed salary for a few years, proving a good value to some low payroll teams. Some of this has to do with improperly valuing players with long-term contracts: as good as A-Rod is, a single $22 million player wasn't enough to make Texas a winning team. There's a lot more opportunity to explore what's going on here: what performance measurements teams pay for, and what performance measurements best predict wins.

3.0 Where to go from here

Whether you're a baseball fan who never thought she could calculate her own stats, an experienced stats geek who wants to check out some new tools, or a statistician looking for some fun data to analyze, I hope you learned something and try out some of these tools yourself.

In addition to The Baseball Archive, you may want to look at historical baseball data collected by Retrosheet, which includes much more detailed information about games. For some insight into baseball, take a look at Baseball Prospectus or Baseball Think Factory. If you'd like to take a look at data about NFL football, check out the Pro Football Reference site. For more detail about R, check out the R manuals.

Joseph Adler has many years of experience in data mining and data analysis at companies including DoubleClick, American Express, and VeriSign. He is the inventor of several patents for computer security and cryptography, and the author of Baseball Hacks.


Return to the O'Reilly Network.

Copyright © 2009 O'Reilly Media, Inc.