Analyzing Baseball Stats with Rby Joseph Adler
The internet is a great resource for the sports fan; there are dozens of sites where one can look up statistics on recent sporting events or on the great players of the past. Baseball statistics appeal to many: some follow their favorite team's seasonal progress, some monitor their fantasy teams, and some are just obsessed by the sheer complexity that is the world of Major League Baseball.
Let's assume for a minute that your love of stats transcends the run-of-the mill obsession and that you know that all of this data can do more than impress your friends at parties (well, sports bars) or tell you that Barry Bonds is the best hitter out there (or is he?). Suppose that you want to be able to calculate defensive ability, or find the best-valued players, or even predict the results of the post-season. Or suppose that you wanted to look at the relationship between player salaries and games won. With all of the raw data available to the stats nut, these calculations are at your fingertips. Below is an introduction to one of my favorite ways to examine the abundance of raw data available on the web: using R to analyze baseball data.
In this article, we focus on salaries. It's playoff time again, and one of the key storylines is team payrolls. In the American League Division Series, the team with the highest payroll ever (the Yankees) played a team with one of the lowest (the Twins). Disappointingly, Moneyball star Billy Beane's budget Oakland A's didn't make the playoffs this year. However, Billy's old assistant, Paul De Podesta, had a team (the LA Dodgers) that did make the playoffs this year. In this article, we'll use R to look at salaries: how are they spread out among players; how do they rate to player performance; and how well do they predict team performance.
1.0 A short introduction to the R project and language
R is a language and environment for statistical computing and graphics. But it's more than that: R is a mature open source software project with support from many developers, an interpreted functional language, and an extensible system for data analysis. A large community of contributors has written libraries of functions, called "packages," for R. You can get more information about R, download sources, documentation, and binaries, from the R Project site. You can get information about R packages from the Comprehensive R Archive Network.
I like to use R to examine baseball statistics because R is very intuitive. A fan can easily calculate formulas without doing any programming. For example, calculating the earned run average (ERA) for a few hundred pitchers is as easy as typing "ERA <- ER/IP".
I tested all the examples in this article on a Windows machine with R 2.0, but you can also run R on Mac OS X, Linux, or many versions of Unix.
1.1 A little about notation
In the examples in this article, I show what you type on the console in R and how the R interpreter responds. The prompt for the command line is ">" if the system is ready for a new statements, and "+" if a command runs over several lines. Commands entered into R are shown in red, and responses from R are shown in black. For some statements and some results, R will post output and errors to the console.
1.2 The R Environment
Let's take a quick look at the R environment in Windows.
R includes a toolbar with some commonly used operations, a console window, and windows showing graphical output, help, edit windows, or other results. You can enter commands into the console window, and R responds with errors and results when appropriate. The R GUI also lets you load packages that are stored locally or install and update packages from the internet.
The R environment is a little different on Mac OS, Linux, and other Unix variants, but the language and tools are the same.
1.3 Objects and evaluation in R
Everything in R is an object. In this article, we'll use a few basic types: scalar values, arrays, functions, and data frames. The simplest objects in R are scalar values, which include numerical, character, and Boolean values. R will evaluate expressions entered onto the command line and return the results. Here are a few simple examples:
> 5  5 > 5/6  0.8333333 > "hello"  "hello" > 1 == 2  FALSE > 2^3 + (2 * 3)  14
You can just use R as a calculator, but it can do way more than that. First, you can define compound objects. The c() function is used to build arrays. For example, here is an array object with the earned run totals for five pitchers with the most wins in baseball during 2004.
> c(82, 92, 66, 71, 116)  82 92 66 71 116
You can do arithmetic (or use other operators or functions) with arrays just like you can with single values. Let's divide the five pitchers' earned run totals by the number of innings pitched.
> c(82, 92, 66, 71, 116) / c(226.2, 237.0, 228.0, 214.1, 208.1)  0.3625111 0.3881857 0.2894737 0.3316207 0.5574243
Incidentally, you can mix scalar values with arrays, and R will apply the scalar result to every item in the array. We can multiply the expression above by 9 to calculate earned run average.
> c(82, 92, 66, 71, 116) / c(226.2, 237.0, 228.0, 214.1, 208.1) * 9  3.262599 3.493671 2.605263 2.984587 5.016819
You can create named objects in R using the assignment operator and reference them by name.
> ER <- c(82, 92, 66, 71, 116) > IP <- c(226.2, 237.0, 228.0, 214.1, 208.1) > ERA <- ER / IP * 9 > ER  82 92 66 71 116 > IP  226.2 237.0 228.0 214.1 208.1 > ERA  3.262599 3.493671 2.605263 2.984587 5.016819
Names in R are case sensitive; you can define two different objects called "ERA" and "eRA." R also lets you create more complex database-like objects called data frames from a set of columns. Many functions or procedures in R use data frames, and most of the examples in this article use data frames to store and query data. Internally, a data frame is represented as a list of arrays. Here is a simple example of a data frame, using these pitching stats:
> pitchers <- c("C Schilling", "R Oswalt", "J Santana", "R Clemens", "B Colon") > pitching <- data.frame(pitchers, IP, ER, ERA) > pitching Pitchers IP ER ERA 1 C Schilling 226.2 82 3.262599 2 R Oswalt 237.0 92 3.493671 3 J Santana 228.0 66 2.605263 4 R Clemens 214.1 71 2.984587 5 B Colon 208.1 116 5.016819
A very useful function in R is the edit() function. Given a data frame, R will pop up a spreadsheet-like window that lets you see and change values for variables. When you finish editing, the edit function will return the edited object. (This function doesn't change the original object.)
R allows you to select a subset of observations from a data frame using the subset function. This function takes a data frame and a condition as arguments and returns a data frame. Here are some simple examples of subsets:
> subset(pitching, ERA < 5) Pitchers IP ER ERA 1 C Schilling 226.2 82 3.262599 2 R Oswalt 237.0 92 3.493671 3 J Santana 228.0 66 2.605263 4 R Clemens 214.1 71 2.984587 > # pick all observations with the minimum ERA: > subset(pitching, ERA == min(ERA)) Pitchers IP ER ERA 3 J Santana 228 66 2.605263
You can extract individual columns from a data frame by name:
> pitching$IP  226.2 237.0 228.0 214.1 208.1
Comments in R are preceded by a pound (#) sign. We use some comments in the examples below to describe what each statement does.
1.4 Functions and procedures in R
You probably noticed that a few of the examples above used functions and procedures. Like any modern language, R includes many functions and procedures for common operations and allows you to write your own procedures to extend the functionality of R or to simplify repetitive tasks. Functions return values that can be used like scalars in numerical expressions.
In R, you can list arguments in order or explicitly name arguments:
> pi  3.141593 > sin(pi)  1.224606e-16 > sin(x=pi)  1.224606e-16
In R, the number and order of arguments is flexible. Some procedures assume default values for arguments if you omit them; see each function's help file for more information.
Many functions in R function on multiple data types, including scalar values, arrays, matrices, and data frames. When applied to arrays, some functions return arrays while others return scalar values. Here are a few examples of functions calculated on scalar and array values:
> log(10)  2.302585 > ERA  3.262599 3.493671 2.605263 2.984587 5.016819 > log(ERA)  1.1825243 1.2509530 0.9575337 1.0934613 1.6127960 > mean(ERA)  3.472588
Some functions have side effects such as popping up windows displaying graphs or allowing values to be edited. For example, you can get help on a function in R by using the help() procedure. To get help on the procedure print, you would type:
In a graphical environment, a window would pop up with help information.
Pages: 1, 2