Appendix: R Reference - R in a Nutshell
Pages: 1, 2, 3, 4, 5
boot
This package provides functions for bootstrapresampling.
Functions
| Function | Description |
|---|---|
| EEF.profile | Calculates the log-likelihood for a mean using anempirical exponential family likelihood. |
| EL.profile | Calculates the log-likelihood for a mean using anempirical likelihood. |
| abc.ci | Calculates equitailed two-sided nonparametricapproximate bootstrap confidence intervals for a parameter, given a set of data and an estimator of the parameter, usingnumerical differentiation. |
| boot | Generates Rbootstrap replicates of a statistic applied to data. |
| boot.array | Takes a bootstrap object calculated by one of thefunctions boot, censboot, or tilt.boot and returns the frequency(or index) array for the bootstrap resamples. |
| boot.ci | Generates five different types of equitailed two-sidednonparametric confidence intervals. These are the first-ordernormal approximation, the basic bootstrap interval, theStudentized bootstrap interval, the bootstrap percentileinterval, and the adjusted bootstrap percentile (BCa)interval. All or a subset of these intervals can begenerated. |
| censboot | Applies types of bootstrap resampling that have beensuggested to deal with right-censored data. It can alsoperform model-based resampling using a Cox regressionmodel. |
| control | Finds control variate estimates from a bootstrap outputobject. |
| corr | Calculates the weighted correlation given a data setand a set of weights. |
| cum3 | Calculates an estimate of the third cumulant, orskewness, of a vector. Also, if more than one vector isspecified, a product-moment of order 3 is estimated. |
| cv.glm | Calculates the estimated K-foldcross-validation prediction error for generalized linearmodels. |
| empinf | Calculates the empirical influence values for astatistic applied to a data set. |
| envelope | Calculates overall and pointwise confidence envelopesfor a curve based on bootstrap replicates of the curveevaluated at a number of fixed points. |
| exp.tilt | Calculates exponentially tilted multinomialdistributions such that the resampling distributions of thelinear approximation to a statistic have the requiredmeans. |
| freq.array | Takes a matrix of indices for nonparametric bootstrapresamples and returns the frequencies of the originalobservations in each resample. |
| glm.diag | Calculates jackknife deviance residuals, standardizeddeviance residuals, standardized Pearson residuals, approximate Cook statistic, leverage, and estimated dispersion. |
| glm.diag.plots | Makes plot of jackknife deviance residuals againstlinear predictor, normal scores plots of standardized devianceresiduals, plot of approximate Cook statistics againstleverage/(1 − leverage), and case plot of Cookstatistic. |
| imp.moments, imp.prob, imp.quantile | Central moment, tail probability, and quantileestimates for a statistic under importance resampling. |
| imp.weights | Calculates the importance sampling weight required tocorrect for simulation from a distribution with probabilitiesp when estimates arerequired assuming that simulation was from an alternativedistribution with probabilities q. |
| inv.logit | Given a numeric object, returns the inverse logit ofthe values. |
| jack.after.boot | Calculates the jackknife influence values from abootstrap output object and plots the correspondingjackknife-after-bootstrap plot. |
| k3.linear | Estimates the skewness of a statistic from itsempirical influence values. |
| lik.CI | Function for use with the practicals in Davison andHinkley (1997), Bootstrap Methods and TheirApplications, Cambridge Series in Statistical andProbabilistic Mathematics, No. 1. |
| linear.approx | Takes a bootstrap object and, for each bootstrapreplicate, calculates the linear approximation to thestatistic of interest for that bootstrap sample. |
| logit | Calculates the logit of proportions. |
| nested.corr | Function for use with the practicals in Davison andHinkley (1997), Bootstrap Methods and TheirApplications, Cambridge Series in Statistical andProbabilistic Mathematics, No. 1. |
| norm.ci | Using the normal approximation to a statistic, calculates equitailed two-sided confidence intervals. |
| saddle | Calculates a saddlepoint approximation to thedistribution of a linear combination of Wat a particular point u, where W is a vector of randomvariables. |
| saddle.distn | Approximates an entire distribution using saddlepointmethods. |
| simplex | This function will optimize the linear functiona\%*\%x subject to theconstraints A1\%*\%x <=b1, A2\%*\%x >=b2, A3\%*\%x =b3, and x >=0. Either maximization or minimization is possiblebut the default is minimization. |
| smooth.f | Uses the method of frequency smoothing to find adistribution on a data set that has a required value, theta, of the statistic ofinterest. |
| tilt.boot | This function will run an initial bootstrap with equalresampling probabilities (if required) and will use the outputof the initial run to find resampling probabilities that putthe value of the statistic at required values. It then runs animportance resampling bootstrap using the calculatedprobabilities as the resampling distribution. |
| tsboot | Generates Rbootstrap replicates of a statistic applied to a time series.The replicate time series can be generated using fixed orrandom block lengths or can be model-based replicates. |
| var.linear | Estimates the variance of a statistic from itsempirical influence values. |
Data Sets
| Data Set | Class | Description |
|---|---|---|
| acme | data.frame | The acme data framehas 60 rows and 3 columns. The excess returns for the AcmeCleveland Corporation, along with those for all stocks listedon the New York and American Stock Exchanges, were recordedover a 5-year period. These excess returns are relative to thereturn on a riskless investment such as U.S. Treasurybills. |
| aids | data.frame | The aids data framehas 570 rows and 6 columns. Although all cases of AIDS inEngland and Wales must be reported to the Communicable DiseaseSurveillance Centre, there is often a considerable delaybetween the time of diagnosis and the time that it isreported. In estimating the prevalence of AIDS, account mustbe taken of the unknown number of cases that have beendiagnosed but not reported. The data set here records thereported cases of AIDS diagnosed from July 1983 until the endof 1992. The data is cross-classified by the date of diagnosisand the time delay in the reporting of the cases. |
| aircondit | data.frame | Proschan reported on the times between failures of theair-conditioning equipment in 10 Boeing 720 aircraft. Theaircondit data framecontains the intervals for the ninth aircraft, while aircondit7 contains those for theseventh aircraft. Both data frames have just one column. Notethat the data has been sorted into increasing order. |
| aircondit7 | data.frame | Proschan reported on the times between failures of theair-conditioning equipment in 10 Boeing 720 aircraft. Theaircondit data framecontains the intervals for the ninth aircraft, while aircondit7 contains those for theseventh aircraft. Both data frames have just one column. Notethat the data has been sorted into increasing order. |
| amis | data.frame | The amis data framehas 8, 437 rows and 4 columns. In a study into the effect thatwarning signs have on speeding patterns, Cambridgeshire CountyCouncil considered 14 pairs of locations. The locations werepaired to account for factors such as traffic volume and typeof road. One site in each pair had a sign erected warning ofthe dangers of speeding and asking drivers to slow down. Noaction was taken at the second site. Three sets ofmeasurements were taken at each site. Each set of measurementswas nominally of the speeds of 100 cars, but not all siteshave exactly 100 measurements. These speed measurements weretaken before the erection of the sign, shortly after theerection of the sign, and again after the sign had been inplace for some time. |
| aml | data.frame | The aml data framehas 23 rows and 3 columns. A clinical trial to evaluate theefficacy of maintenance chemotherapy for acute myelogenousleukemia was conducted by Embury et al. at StanfordUniversity. After reaching a stage of remission throughtreatment by chemotherapy, patients were randomized into twogroups. The first group received maintenance chemotherapy, andthe second group did not. The aim of the study was to see ifmaintenance chemotherapy increased the length of theremission. The data here formed a preliminary analysis thatwas conducted in October 1974. |
| beaver | ts | The beaver dataframe has 100 rows and 4 columns. It is a multivariate timeseries of class "ts" andalso inherits from class "data.frame". This data set is partof a long study into body temperature regulation in beavers.Four adult female beavers were live-trapped and had atemperature-sensitive radio transmitter surgically implanted.Readings were taken every 10 minutes. The location of thebeaver was also recorded, and her activity level wasdichotomized by whether she was in the retreat or outside ofit, since high-intensity activities only occur outside of theretreat. The data in this data frame comes from those readingsfor one of the beavers on a day in autumn. |
| bigcity | data.frame | The bigcity dataframe has 49 rows and 2 columns. The city data frame has 10 rows and 2columns. The measurements are the populations (in 1000s) of 49U.S. cities in 1920 and 1930. The 49 cities are a randomsample taken from the 196 largest cities in 1920. The city data frame consists of thefirst 10 observations in bigcity. |
| brambles | data.frame | The brambles dataframe has 823 rows and 3 columns. The location of livingbramble canes in a 9-m square plot was recorded. We take 9 mto be the unit of distance so that the plot can be thought ofas a unit square. The bramble canes were also classified bytheir age. |
| breslow | data.frame | The breslow dataframe has 10 rows and 5 columns. In 1961, Doll and Hill sentout a questionnaire to all men on the British Medical Registerinquiring about their smoking habits. Almost 70% of the menreplied. Death certificates were obtained for medicalpractitioners, and causes of death were assigned on the basisof these certificates. The breslow data set contains theperson-years of observations and deaths from coronary arterydisease accumulated during the first 10 years of thestudy. |
| calcium | data.frame | The calcium dataframe has 27 rows and 2 columns. Howard Grimes of the BotanyDepartment, North Carolina State University, conducted anexperiment for biochemical analysis of intracellular storageand transport of calcium across plasma membrane. Cells weresuspended in a solution of radioactive calcium for a certainlength of time, and then the amount of radioactive calciumthat was absorbed by the cells was measured. The experimentwas repeated independently with nine different times ofsuspension each replicated three times. |
| cane | data.frame | The cane data framehas 180 rows and 5 columns. The data frame represents arandomized block design with 45 varieties of sugarcane and 4blocks. The aim of the experiment was to classify thevarieties into resistant, intermediate, and susceptible to adisease called "coal of sugarcane" (carvao dacana-de-acucar). This is a disease that is commonin sugar-cane plantations in certain areas of Brazil. For eachplot, 50 pieces of sugarcane stem were put in a solutioncontaining the disease agent, and then some were planted inthe plot. After a fixed period of time, the total number ofshoots and the number of diseased shoots wererecorded. |
| capability | data.frame | The capability dataframe has 75 rows and 1 column. The data consists of simulatedsuccessive observations from a process in equilibrium. Theprocess is assumed to have specification limits (5.49, 5.79). |
| catsM | data.frame | The catsM data framehas 97 rows and 3 columns. One hundred and forty-four adult(over 2 kg in weight) cats used for experiments with the drugdigitalis had their heart and body weight recorded.Forty-seven of the cats were female, and 97 were male. ThecatsM data frame consistsof the data for the male cats. The full data can be found indata set \link[MASS]{cats}in package MASS. |
| cav | data.frame | The cav data framehas 138 rows and 2 columns. The data gives the positions ofthe individual caveolae in a square region with sides oflength 500 units. This grid was originally on a 2.65μm squareof muscle fiber. The data consist of those points falling inthe lower-left quarter of the region used for the data setcaveolae.dat. |
| cd4 | data.frame | The cd4 data framehas 20 rows and 2 columns. CD4 cells are carried in the bloodas part of the human immune system. One of the effects of thehuman immunodeficiency virus (HIV) is that these cells die.The count of CD4 cells is used in determining the onset offull-blown AIDS in a patient. In this study of theeffectiveness of a new antiviral drug on HIV, 20 HIV-positivepatients had their CD4 counts recorded and then were put on acourse of treatment with this drug. After using the drug for 1year, their CD4 counts were again recorded. The aim of theexperiment was to show that patients taking the drug hadincreased CD4 counts, which is not generally seen in HIV-positive patients. |
| cd4.nested | boot | This is an example of a nested bootstrap for thecorrelation coefficient of the cd4 data frame. |
| channing | data.frame | The channing dataframe has 462 rows and 5 columns. Channing House is aretirement center in Palo Alto, California. The data wascollected between the opening of the house in 1964 until July1, 1975. During that time, 97 men and 365 women passed throughthe center. For each of these, their age on entry and also onleaving or death was recorded. A large number of theobservations were censored mainly due to the resident beingalive on July 1, 1975, when the data was collected. Over thecourse of the study, 130 women and 46 men died at ChanningHouse. Differences between the survival of the sexes, takingage into account, was one of the primary concerns of thisstudy. |
| city | data.frame | The bigcity dataframe has 49 rows and 2 columns. The city data frame has 10 rows and 2columns. The measurements are the populations (in 1000s) of 49U.S. cities in 1920 and 1930. The 49 cities are a randomsample taken from the 196 largest cities in 1920. The city data frame consists of thefirst 10 observations in bigcity. |
| claridge | data.frame | The claridge dataframe has 37 rows and 2 columns. The data comes from anexperiment that was designed to look for a relationshipbetween a certain genetic characteristic and handedness. The37 subjects were women who had a son with mental retardationdue to inheriting a defective X-chromosome. For each suchmother, a genetic measurement of her DNA was made. Largervalues of this measurement are known to be linked to thedefective gene, and it was hypothesized that larger valuesmight also be linked to a progressive shift away fromright-handedness. Each woman also filled in a questionnaireregarding which hand she used for various tasks. From thesequestionnaires, a measure of hand preference was found foreach mother. The scale of this measure goes from 1, indicating women who alwaysfavor their right hand, to 8, indicating women who alwaysfavor their left hand. Between these two extremes are womenwho favor one hand for some tasks and the other for othertasks. |
| cloth | data.frame | The cloth data framehas 32 rows and 2 columns. |
| co.transfer | data.frame | The co.transfer dataframe has 7 rows and 2 columns. Seven smokers with chickenpoxhad their levels of carbon monoxide transfer measured uponbeing admitted to the hospital and then again after 1 week.The main question was whether 1 week of hospitalization hadchanged the carbon monoxide transfer factor. |
| coal | data.frame | The coal data framehas 191 rows and 1 column. This data frame gives the dates of191 explosions in coal mines that resulted in 10 or morefatalities. The time span of the data is from March 15, 1851, until March 22, 1962. |
| darwin | data.frame | The darwin dataframe has 15 rows and 1 column. Charles Darwin conducted anexperiment to examine the superiority of cross-fertilizedplants over self-fertilized plants. Fifteen pairs of plantswere used. Each pair consisted of one cross-fertilized plantand one self-fertilized plant that germinated at the same timeand grew in the same pot. The plants were measured at a fixedtime after planting, and the differences in heights betweenthe cross- and self-fertilized plants were recorded in eighthsof an inch. |
| dogs | data.frame | The dogs data framehas 7 rows and 2 columns. Data on the cardiac oxygenconsumption and left ventricular pressure was gathered onseven domestic dogs. |
| downs.bc | data.frame | The downs.bc dataframe has 30 rows and 3 columns. Down's syndrome is a geneticdisorder caused by an extra chromosome 21 or a part ofchromosome 21 being translocated to another chromosome. Theincidence of Down's syndrome is highly dependent on themother's age and rises sharply after age 30. In the 1960s, alarge-scale study of the effect of maternal age on theincidence of Down's syndrome was conducted at the BritishColumbia Health Surveillance Registry. This data frameconsists of the data that was collected in that study. Motherswere classified by age. Most groups correspond to the age inyears, but the first group comprises all mothers aged 15–17and the last is those aged 46–49. No data for mothers over 50or below 15 was collected. |
| ducks | data.frame | The ducks data framehas 11 rows and 2 columns. Each row of the data framerepresents a male duck that is a second-generation crossbetween a mallard and a pintail. For 11 such ducks, abehavioral index and plumage index were calculated. These weremeasured on scales devised for this experiment, which was toexamine whether there was any link between which species theducks resembled physically and which they resembled inbehavior. The scale for physical appearance ranged from 0(identical in appearance to a mallard) to 20 (identical to apintail). The behavioral traits of the ducks were on a scaleof 0 to 15, with lower numbers indicating more mallard-likebehavior. |
| fir | data.frame | The fir data framehas 50 rows and 3 columns. The number of balsam-fir seedlingsin each quadrant of a grid of 50 five-foot-square quadrantswere counted. The grid consisted of 5 rows of 10 quadrants ineach row. |
| frets | data.frame | The frets data framehas 25 rows and 4 columns. The data consists of measurementsof the length and breadth of the heads of pairs of adultbrothers in 25 randomly sampled families. All measurements areexpressed in millimeters. |
| grav | data.frame | The gravity dataframe has 81 rows and 2 columns. The grav data set has 26 rows and 2columns. Between May 1934 and July 1935, the U.S. NationalBureau of Standards conducted a series of experiments toestimate the acceleration due to gravity, g, at Washington, DC. Each experimentproduced a number of replicate estimates ofg using the same methodology. Althoughthe basic method remained the same for all experiments, thatof the reversible pendulum, there were changes inconfiguration. The gravitydata frame contains the data from all eight experiments. Thegrav data frame containsthe data from experiments 7 and 8. The data is expressed asdeviations from 980.000 in centimeters per secondsquared. |
| gravity | data.frame | The gravity dataframe has 81 rows and 2 columns. The grav data set has 26 rows and 2columns. Between May 1934 and July 1935, the U.S. NationalBureau of Standards conducted a series of experiments toestimate the acceleration due to gravity, g, at Washington, DC. Each experimentproduced a number of replicate estimates ofg using the same methodology. Althoughthe basic method remained the same for all experiments, thatof the reversible pendulum, there were changes inconfiguration. The gravitydata frame contains the data from all eight experiments. Thegrav data frame containsthe data from experiments 7 and 8. The data is expressed asdeviations from 980.000 in centimeters per secondsquared. |
| hirose | data.frame | The hirose dataframe has 44 rows and 3 columns. PET film is used inelectrical insulation. In this accelerated life test, thefailure times for 44 samples in gas-insulated transformerswere estimated. Four different voltage levels wereused. |
| islay | data.frame | The islay data framehas 18 rows and 1 column. Measurements were taken of paleocurrent azimuths from theJura Quartzite on the Scottish island of Islay. |
| manaus | ts | The manaus timeseries is of class "ts" andhas 1, 080 observations on one variable. The data values aremonthly averages of the daily stages (heights) of the RioNegro at Manaus. Manaus is 18 km upstream from the confluenceof the Rio Negro with the Amazon but because of the tiny slopeof the water surface and the lower courses of its flatlandaffluents, they may be regarded as a good approximation of thewater level in the Amazon at the confluence. The data herecovers 90 years from January 1903 until December 1992. TheManaus gauge is tied in with an arbitrary benchmark of 100mset in the steps of the Municipal Prefecture; gauge readingsare usually referred to sea level, on the basis of a mark onthe steps leading to the Parish Church (Matriz), which isassumed to lie at an altitude of 35.874 m according toobservations made many years ago under the direction of SamuelPereira, an engineer in charge of the Manaus SanitationCommittee Whereas such an altitude cannot, by any means, beconsidered to be a precise datum point, observations have beenprovisionally referred to it. The measurements are inmeters. |
| melanoma | data.frame | The melanoma dataframe has 205 rows and 7 columns. The data consists of measurements made on patientswith malignant melanoma. Each patient had his or her tumorsurgically removed at the Department of Plastic Surgery, University Hospital of Odense, Denmark, during the period1962–1977. The surgery consisted of complete removal of thetumor together with about 2.5 cm of the surrounding skin.Among the measurements taken were the thickness of the tumorand whether it was ulcerated or not. These are thought to beimportant prognostic variables in that patients with a thickand/or ulcerated tumor have an increased chance of death frommelanoma. Patients were followed until the end of1977. |
| motor | data.frame | The motor data framehas 94 rows and 4 columns. The rows were obtained by removingreplicate values of timefrom the data set mcycle.Two extra columns were added to allow for strata with adifferent residual variance in each stratum. |
| neuro | matrix | neuro is a matrixcontaining times of observed firing of a neuron in windows of250 ms either side of the application of a stimulus to a humansubject. Each row of the matrix is a replication of theexperiment, and there are a total of 469 replicates. |
| nitrofen | data.frame | The nitrofen dataframe has 50 rows and 5 columns. Nitrofen is a herbicide thatwas used extensively for the control of broad-leaved and grassweeds in cereals and rice. Although it is relatively nontoxicto adult mammals, nitrofen is a significant teratogen andmutagen. It is also acutely toxic and reproductively toxic tocladoceran zooplankton. Nitrofen is no longer incommercial use in the United States, having been the firstpesticide to be withdrawn due to teratogenic effects. The datahere comes from an experiment to measure the reproductivetoxicity of nitrofen on a species of zooplankton(Ceriodaphnia dubia). Fifty animals wererandomized into batches of 10, and each batch was put in asolution with a measured concentration of nitrofen. Then thenumber of live offspring in each of the three broods of eachanimal was recorded. |
| nodal | data.frame | The nodal data framehas 53 rows and 7 columns. The treatment strategy for apatient diagnosed with prostate cancer depends highly onwhether the cancer has spread to the surrounding lymph nodes.It is common to operate on the patient to get samples from thenodes, which can then be analyzed under a microscope, butclearly it would be preferable if an accurate assessment ofnodal involvement could be made without surgery. For a sampleof 53 prostate cancer patients, a number of possible predictorvariables were measured before surgery. The patients then hadsurgery to determine nodal involvement. The point of the studywas to see if nodal involvement could be accurately predictedfrom the predictor variables and which ones were mostimportant. |
| nuclear | data.frame | The nuclear dataframe has 32 rows and 11 columns. The data relates to theconstruction of 32 light-water reactor (LWR) plantsconstructed in the United States in the late 1960s and early1970s. The data was collected with the aim of predicting thecost of construction of additional LWR plants. Six of thepower plants had partial turnkey guarantees, and it ispossible that, for these plants, some manufacturers' subsidiesmay be hidden in the quoted capital costs. |
| paulsen | data.frame | The paulsen dataframe has 346 rows and 1 column. Sections were prepared fromthe brain of adult guinea pigs. Spontaneous currents thatflowed into individual brain cells were then recorded and thepeak amplitude of each current measured. The aim of theexperiment was to see if the current flow was quantal innature (i.e., that it is not a single burst but instead isbuilt up of many smaller bursts of current). If the currentwas indeed quantal, then it would be expected that thedistribution of the current amplitude would be multimodal withmodes at regular intervals. The modes would be expected todecrease in magnitude for higher current amplitudes. |
| poisons | data.frame | The poisons dataframe has 48 rows and 3 columns. The data form a 3 × 4factorial experiment, the factors being three poisons and fourtreatments. Each combination of the two factors was used onfour animals, the allocation to animals having been completelyrandomized. |
| polar | data.frame | The polar data framehas 50 rows and 2 columns. The data consists of the polepositions from a paleomagnetic study of New Caledonianlaterites. |
| remission | data.frame | The remission dataframe has 27 rows and 3 columns. |
| salinity | data.frame | The salinity dataframe has 28 rows and 4 columns. Biweekly averages of thewater salinity and river discharge in Pamlico Sound, NorthCarolina, were recorded between the years 1972 and 1977. Thedata in this set consists only of those measurements in March, April, and May. |
| survival | data.frame | The survival dataframe has 14 rows and 2 columns. The data measured thesurvival percentages of batches of rats who were given varyingdoses of radiation. At each of six doses there were two orthree replications of the experiment. |
| tau | data.frame | The tau data framehas 60 rows and 2 columns. The tau particle is a heavyelectron-like particle discovered in the 1970s by Martin Perlat the Stanford Linear Accelerator Center. Soon after itsproduction, the tau particle decays into various collectionsof more stable particles. About 86% of the time, the decayinvolves just one charged particle. This rate has beenmeasured independently 13 times. The one-charged-particleevent is made up of four major modes of decay as well as acollection of other events. The four main types of decay aredenoted rho, pi, e, and mu. These rates have been measuredindependently 6, 7, 14, and 19 times, respectively. Due tophysical constraints, each experiment can only estimate thecomposite one-charged-particle decay rate or the rate of oneof the major modes of decay. Each experiment consists of amajor research project involving many years' work. One of thegoals of the experiments was to estimate the rate of decay dueto events other than the four main modes of decay. These areuncertain events and so cannot themselves be observeddirectly. |
| tuna | data.frame | The tuna data framehas 64 rows and 1 column. The data comes from an aerial linetransect survey of southern bluefin tuna in the GreatAustralian Bight. An aircraft with two spotters on board flewrandomly allocated line transects. Each school of tuna sightedwas counted and its perpendicular distance from the transectmeasured. The survey was conducted in summer when tuna tend tostay on the surface. |
| urine | data.frame | The urine data framehas 79 rows and 7 columns. Seventy-nine urine specimens wereanalyzed in an effort to determine if certain physicalcharacteristics of the urine might be related to the formationof calcium oxalate crystals. |
| wool | ts | wool is a timeseries of class "ts" andcontains 309 observations. Each week that the market was open, the Australian Wool Corporation set a floor price thatdetermined its policy on intervention and was therefore areflection of the overall price of wool for the week inquestion. Actual prices paid varied considerably about thefloor price. The series here is the log of the ratio betweenthe price for fine-grade wool and the floor price, each marketweek between July 1976 and June 1984. |
class
This package provides functions for classification.
Functions
| Function | Description |
|---|---|
| SOM, batchSOM | Kohonen's self-organizing maps (SOMs) are a crude formof multidimensional scaling. |
| condense | Condenses training set fork-nearest-neighbor(k-NN) classifier. |
| knn | k-nearest-neighbor classificationfor test set from training set. For each row of the test set, the k-nearest (in Euclideandistance) training set vectors are found, and theclassification is decided by majority vote, with ties brokenat random. If there are ties for the kth nearest vector, allcandidates are included in the vote. |
| knn.cv | k-nearest-neighborcross-validatory classification from training set. |
| knn1 | Nearest-neighbor classification for test set fromtraining set. For each row of the test set, the nearestneighbor (by Euclidean distance) training set vector is found, and its classification used. If there is more than one nearestneighbor, a majority vote is used, with ties broken atrandom. |
| lvq1, lvq2, lvq3 | Moves examples in a codebook to better represent thetraining set. |
| lvqinit | Constructs an initial codebook for learning vectorquantization (LVQ) methods. |
| lvqtest | Classifies a test set by 1-NN from a specified LVQcodebook. |
| multiedit | Multiedit for k-NNclassifier. |
| olvq1 | Moves examples in a codebook to better represent thetraining set. |
| reduce.nn | Reduces training set for a k-NNclassifier. Used after condense. |
| somgrid | Plotting functions for SOM results. |
cluster
This package provides functions for clusteranalysis.
Functions
| Function | Description |
|---|---|
| agnes | Computes agglomerative hierarchical clustering of thedata set. |
| bannerplot | Draws a "banner, " i.e., basically a horizontal barplot visualizing the(agglomerative or divisive) hierarchical clustering or another binary dendrogram structure. |
| clara | Computes a "clara"object, a list representing a clustering of the data intokclusters. |
| clusplot | Draws a two-dimensional (2D) "clusplot" on the currentgraphics device. |
| coef.hclust | Computes the "agglomerative coefficient, " measuring theclustering structure of the data set. |
| daisy | Computes all the pairwise dissimilarities (distances)between observations in the data set. |
| diana | Computes a divisive hierarchical clustering of the dataset, returning an object of class diana. |
| ellipsoidPoints | Computes points on the ellipsoid boundary, mostly fordrawing. |
| ellipsoidhull | Computes the "ellipsoid hull" or "spanning ellipsoid, "i.e., the ellipsoid of minimal volume ("area" in 2D) such thatall given points lie just inside or on the boundary of theellipsoid. |
| fanny | Computes a fuzzy clustering of the data intokclusters. |
| lower.to.upper.tri.inds | Computes index vectors for extracting or reordering oflower or upper triangular matrices that are stored ascontiguous vectors. |
| mona | Returns a list representing a divisive hierarchicalclustering of a data set with binary variables only. |
| pam | Partitioning (clustering) of the data intokclusters "around medoids, " a more robust version ofk-means clustering. |
| pltree | Generic function drawing a clustering tree ("dendrogram") on the currentgraphics device. There is a twins method; see pltree.twins for usage andexamples. |
| predict.ellipsoid | Computes points on the ellipsoid boundary, mostly fordrawing. |
| silhouette | Computes silhouette information according to a givenclustering in k clusters. |
| sizeDiss | Returns the number of observations (samplesize) corresponding to a dissimilarity-like objector, equivalently, the number of rows or columns of a matrixwhen only the lower or upper triangular part (withoutdiagonal) is given. It is nothing else but the inversefunction of f(n)= n(n −1)/2. |
| sortSilhouette | Computes silhouette information according to a givenclustering in k clusters. |
| upper.to.lower.tri.inds | Computes index vectors for extracting or reordering oflower or upper triangular matrices that are stored ascontiguous vectors. |
| volume | Computes the volume of a planar object. This is ageneric function and a method for ellipsoid objects. |
Data Sets
| Data Set | Class | Description |
|---|---|---|
| agriculture | data.frame | Gross national product (GNP) per capita and percentageof the population working in agriculture for each countrybelonging to the European Union in 1993. |
| animals | data.frame | This data set considers 6 binary attributes for 20animals. |
| chorSub | matrix | This is a small rounded subset of the C-horizondata. |
| flower | data.frame | This data set consists of 8 characteristics for 18popular flowers. |
| plantTraits | data.frame | This data set constitutes a description of 136 plantspecies according to biological attributes (morphological orreproductive). |
| pluton | data.frame | The pluton dataframe has 45 rows and 4 columns, containing percentages ofisotopic composition of 45 plutonium batches. |
| ruspini | data.frame | The Ruspini data set, consisting of 75 points in 4groups, is popular for illustrating clustering techniques. |
| votes.repub | data.frame | A data frame with the percents of votes given to theRepublican candidates in presidential elections from 1856 to1976. Rows represent the 50 states, and columns the 31elections. |
| xclara | data.frame | An artificial data set consisting of 3, 000 points in 3well-separated clusters of size 1, 000 each. |
codetools
This package provides tools for analyzing R code. It ismainly intended to support the other tools in this package and byte codecompilation. See the help file for more information.
foreign
This package provides functions for reading data stored byMinitab, S, SAS, SPSS, Stata, Systat, dBase, and so forth.
Functions
| Function | Description |
|---|---|
| data.restore | Reads binary data files or data.dump files that were producedin S version 3. |
| lookup.xport | Scans a file as a SAS XPORT format library and returnsa list containing information about the SAS library. |
| read.S | Reads binary data files or data.dump files that were producedin S version 3. |
| read.arff | Reads data from Weka Attribute-Relation File Format(ARFF) files. |
| read.dbf | Reads a DBF file into a data frame, convertingcharacter fields to factors and trying to respect NULL fields. |
| read.dta | Reads a file in Stata version 5–10 binary format into adata frame. |
| read.epiinfo | Reads data files in the .REC format used by Epi Infoversions 6 and earlier and by EpiData. Epi Info is apublic-domain database and statistics package produced by theU.S. Centers for Disease Control and Prevention, and EpiDatais a freely available data entry and validationsystem. |
| read.mtp | Returns a list with the data stored in a file as aMinitab Portable Worksheet. |
| read.octave | Reads a file in Octave text data format into alist. |
| read.spss | Reads a file stored by the SPSS save or export commands. |
| read.ssd | Generates a SAS program to convert the ssd contents toSAS transport format and then uses read.xport to obtain a dataframe. |
| read.systat | Reads a rectangular data file stored by the SystatSAVE command as (legacy)*.sys or, more recently, *.syd files. |
| read.xport | Reads a file as a SAS XPORT format library and returnsa list of data.frames. |
| write.arff | Writes data into Weka Attribute-Relation File Format(ARFF) files. |
| write.dbf | Tries to write a data frame to a DBF file. |
| write.dta | Writes the data frame to file in the Stata binaryformat. Does not write array variables unless they can bedrop-ed to avector. |
| write.foreign | Exports simple data frames to other statisticalpackages by writing the data as free-format text and writing aseparate file of instructions for the other package to readthe data. |
grDevices
This package provides functions for graphics devices andsupport for base and grid graphics.
Functions
Data Sets
| Data Set | Class | Description |
|---|---|---|
| Hershey | list | If the familygraphical parameter (see par) has been set to one of theHershey fonts, Hershey vector fonts are used to render text.When using the text andcontour functions, Hersheyfonts may be selected via the vfont argument, which is a charactervector of length 2. This allows Cyrillic to be selected, whichis not available via the font families. |
| blues9 | character | densCols produces avector containing colors that encode the local densities ateach point in a scatter plot. |
| colorspaces | list | Converts colors between standard color spacerepresentations. This function is experimental. |
graphics
This package contains functions for base graphics. Basegraphics are traditional S graphics, as opposed to the newer gridgraphics.
Functions
grid
This package is a low-level graphics system that providesa great deal of control and flexibility in the appearance and arrangementof graphical output. It does not provide high-level functions thatcreate complete plots. What it does provide is a basis for developingsuch high-level functions (e.g., the lattice package), the facilities forcustomizing and manipulating lattice output, the ability to producehigh-level plots or non-statistical images from scratch, and the abilityto add sophisticated annotations to the output from base graphicsfunctions (see the gridBase package).For more information, see the help files for grid.
KernSmooth
This package provides functions for kernelsmoothing.
Functions
| Function | Description |
|---|---|
| bkde | Returns x andy coordinates of the binned kerneldensity estimate of the probability density of thedata. |
| bkde2D | Returns the set of grid points in each coordinatedirection, and the matrix of density estimates over the meshinduced by the grid points. The kernel is the standardbivariate normal density. |
| bkfe | Returns an estimate of a binned approximation to thekernel estimate of the specified density function. The kernelis the standard normal density. |
| dpih | Uses direct plug-in methodology to select the bin widthof a histogram. |
| dpik | Uses direct plug-in methodology to select the bandwidthof a kernel density estimate. |
| dpill | Uses direct plug-in methodology to select the bandwidthof a local linear Gaussian kernel regression estimate. |
| locpoly | Estimates a probability density function, regressionfunction, or their derivatives using local polynomials. A fastbinned implementation over an equally spaced grid isused. |
lattice
Trellis graphics is a framework for data visualizationdeveloped at Bell Labs by Richard Becker, William Cleveland, et al., extending ideas presented inBill Cleveland's 1993 book Visualizing Data.
Lattice is best thought of as an implementation of Trellisgraphics for R. It is built upon the grid graphics engine and requiresthe grid add-on package. It is not (readily) compatible with traditionalR graphics tools. The public interface is based on the implementation inS-PLUS, but features several extensions, in addition toincompatibilities introduced through the use of grid. To the extentpossible, care has been taken to ensure that existing Trellis codewritten for S-PLUS works unchanged (or with minimal change) in lattice.If you are having problems porting S-PLUS code, read the entry for panelin the documentation for xyplot. Mosthigh-level Trellis functions in S-PLUS are implemented, with theexception of piechart.
Functions
Data Sets
| Data Set | Class | Description |
|---|---|---|
| barley | data.frame | Total yield in bushels per acre for 10 varieties at 6sites in each of 2 years. |
| environmental | data.frame | Daily measurements of ozone concentration, wind speed, temperature, and solar radiation in New York City from May toSeptember of 1973. |
| ethanol | data.frame | Ethanol fuel was burned in a single-cylinder engine.For various settings of the engine compression and equivalenceratio, the emissions of nitrogen oxides were recorded. |
| melanoma | data.frame | This data from the Connecticut Tumor Registry presentsage-adjusted numbers of melanoma skin cancer incidences per100, 000 people in Connecticut for the years 1936–1972. |
| singer | data.frame | Heights, in inches, of the singers in the New YorkChoral Society in 1979. The data is grouped according to voicepart. The vocal range for each voice part increases in pitchaccording to the following order: Bass 2, Bass 1, Tenor 2, Tenor 1, Alto 2, Alto 1, Soprano 2, Soprano 1. |
