Bug Squashing

For this exercise, I again used the sleep study data from https://vincentarelbundock.github.io/Rdatasets/csv/lme4/sleepstudy.csv.   I assigned the values for the numbers of days and the reaction times to the x and y values to matrix ‘x’.  Next, I created the function for tukey.multiple().

> tukey_multiple <- function(x) {
+ outliers <- array(TRUE,dim=dim(x))
+ for (j in 1:ncol(x))
+ {
+ outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
+ }
+ outlier.vec <- vector(length=nrow(x))
+ for (i in 1:nrow(x))
+ {
+ outlier.vec[i] <- all(outliers[i,])
+ }
+ return(outlier.vec)
+ }

To start debug mode, you must first initialize it with > debug(tukey_multiple) and then try executing the function itself with > tukey_multiple(x).  The debug mode spits out an error saying that it can’t find the function tukey.outlier, which is called for in line 5.  In order to find the outliers we must first be able to identify why is and is not an outlier.  In order to do so, we have to make a function to define the quartiles in the set of data:

> quartiles = function(data)
+ {
+ quarts = quantile(data, probs=c(.25,.75))
+ IQR = quarts[2] - quarts[1]
+ return(c(quarts,IQR))
+ }

Now that we have the quartiles, we can make a function to pull out the outliers:

>tukey.outlier = function(q)
+ {
+ if(!is.nutll(dim(q))){
+ return(tukey_multiple(q))
+ } else {
+ quarts = quartiles(q)
+ lowerbound = quarts[1] - 1.5*quarts[3]
+ upperbound = quarts[2] + 1.5*quarts[3]
+ outliers = c(rep(FALSE, length(q)))
+ outliers <- ((q < lowerbound) | (q > upperbound))
+ return(outliers)}
+ }

Moving forward from here, we can try to run >tukey_multiple(x) again.  Doing so gives me an error “Error in tukey.outlier(x[, j]) : could not find function “is.nutll””.  This one is due to a simple typo of is.null.  This is fixed and the debug for tukey_multiple is ran again.  The debug makes it past the previous errors and I proceeded through hitting enter through all 180 values of my data, ending in this:

exiting from: tukey_multiple(x)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[177] FALSE FALSE FALSE FALSE

The current function looks for outliers in both dimensions and will display TRUE if there are outliers in both dimensions.  What we want is for a readout if there’s an outlier in either dimension.  Here, we redefine tukey_multiple() to only output a TRUE for having an outleir in either dimension:

> tukey_multiple = function(x) {
+ outliers <-array(TRUE, dim=dim(x))
+ outliers = apply(x,2,tukey.outlier)
+ outliers.vec = apply(outliers,1,any)
+ return(outliers.vec)
+ }

This debugs smoothly and outputs:

exiting from: tukey_multiple(x)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[177] FALSE FALSE FALSE FALSE

It’s a mess since the data I used is so large, but there looks like there’s a couple outleirs in there at values 10 and 100.  All is good and we can end debug mode with undebug(tukey_multiple).

Comparing plot(), lattice(), and ggplot2()

The source for my data is: https://vincentarelbundock.github.io/Rdatasets/csv/lme4/sleepstudy.csv

Plot()

plotsleepstudy

>plot(sleepstudy) yielded this lovely menagerie of graphs.  it looks like it took all the variables and compared them to each other variable and made a metagraph.  For example, you can look at square 3,3 and see that it compares the number of days going without sleep to the reaction time.  Assigning the Days (D) and Reaction (R) to their own objects allows us to do plot(D,R) to get a larger version of that graph.  Using the aggregate function, you can calculate the mean of each reaction measurement per day, assign them to objects, and plot them (>plot(MD,R, type=”b”).

Lattice()

Similar graphs to the plot() graphs can be achieved with Latttice().  A simple >xyplot(Reaction~Days, sleepstudy) will yield the scatterplot below, and >xyplot(R~D, type=”b”) gives the line chart.  The lattice graph is much more pleasing to look at.  Having the y intervals oriented horizontally,  having the intervals for the x and y values on both sides of the graph, and overall spacing give a much nicer look to the graph.  Also, it’s utilization of the space in the frame is much better compared to the plot() graph, which looks like it has large, empty reserved space for a graph title.

 

 

ggplot2()

 

Ggplot2() adds a layer of complexity that is the idea of layers themselves. The bare bones “give me a graph quickly” plots from ggplot give graphs that are nicer than plot().  You get grid laid down across your graphs, with horizontal interval markers on the y axis.  You also get even better space utilization than the quick-and-dirty lattice() graphs.  The downside to the ggplot2() graphs is that the x intervals are less relatable.  It defaulted to 2.5 day intervals for this particular graph, which makes analyzing the data significantly more abstract than full day or every-other-day intervals.

First, I made a new subset for the females from the original data, Dataset, and named it “females”.

>females <- subset(Dataset, Sex == “Female”)

I made a new object, ag, and used ddply to take the subset “Sex” from the object “females” and appended on the average grade by declaring transform as an argument.  Ddply is part of the plyr package and follows a convenient naming convention that dictates the input and output data types.  The first two letters determine if the respective input and outputs are an array, list, or data frame.  They can also consist of multiple inputs or no inputs.

ag = ddply(females, “Sex”, transform, Average.Grade=mean(Grade))

Next, I saved a file of the data with the write function.  Notably, it saves to your default directory for documents, and without a file extension added onto it.  Using the argument sep=”,” will make the file a CSV file, which I found neat since CSV stands for comma separated values.

write.table(ag, “Sorted_average”, sep=”,”)

I made a new subset, namei, of the original dataset and used the grep() function to add the values from the column “Name”.  Adding an “l” onto grep makes defines each value for the vector as either a match or mismatch (logical).

namei <- subset(Dataset, grepl(“[iI]”, Dataset$Name))

Finally, we can recall namei to see the printout of the table of all those i-named folk.

namei