Package Creation: foibles and follies
Bug Squashing
For this exercise, I again used the sleep study data from https://vincentarelbundock.github.io/Rdatasets/csv/lme4/sleepstudy.csv. I assigned the values for the numbers of days and the reaction times to the x and y values to matrix ‘x’. Next, I created the function for tukey.multiple().
> tukey_multiple <- function(x) { + outliers <- array(TRUE,dim=dim(x)) + for (j in 1:ncol(x)) + { + outliers[,j] <- outliers[,j] && tukey.outlier(x[,j]) + } + outlier.vec <- vector(length=nrow(x)) + for (i in 1:nrow(x)) + { + outlier.vec[i] <- all(outliers[i,]) + } + return(outlier.vec) + }
To start debug mode, you must first initialize it with > debug(tukey_multiple) and then try executing the function itself with > tukey_multiple(x). The debug mode spits out an error saying that it can’t find the function tukey.outlier, which is called for in line 5. In order to find the outliers we must first be able to identify why is and is not an outlier. In order to do so, we have to make a function to define the quartiles in the set of data:
> quartiles = function(data) + { + quarts = quantile(data, probs=c(.25,.75)) + IQR = quarts[2] - quarts[1] + return(c(quarts,IQR)) + }
Now that we have the quartiles, we can make a function to pull out the outliers:
>tukey.outlier = function(q) + { + if(!is.nutll(dim(q))){ + return(tukey_multiple(q)) + } else { + quarts = quartiles(q) + lowerbound = quarts[1] - 1.5*quarts[3] + upperbound = quarts[2] + 1.5*quarts[3] + outliers = c(rep(FALSE, length(q))) + outliers <- ((q < lowerbound) | (q > upperbound)) + return(outliers)} + }
Moving forward from here, we can try to run >tukey_multiple(x) again. Doing so gives me an error “Error in tukey.outlier(x[, j]) : could not find function “is.nutll””. This one is due to a simple typo of is.null. This is fixed and the debug for tukey_multiple is ran again. The debug makes it past the previous errors and I proceeded through hitting enter through all 180 values of my data, ending in this:
exiting from: tukey_multiple(x) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [177] FALSE FALSE FALSE FALSE
The current function looks for outliers in both dimensions and will display TRUE if there are outliers in both dimensions. What we want is for a readout if there’s an outlier in either dimension. Here, we redefine tukey_multiple() to only output a TRUE for having an outleir in either dimension:
> tukey_multiple = function(x) { + outliers <-array(TRUE, dim=dim(x)) + outliers = apply(x,2,tukey.outlier) + outliers.vec = apply(outliers,1,any) + return(outliers.vec) + }
This debugs smoothly and outputs:
exiting from: tukey_multiple(x) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [177] FALSE FALSE FALSE FALSE
It’s a mess since the data I used is so large, but there looks like there’s a couple outleirs in there at values 10 and 100. All is good and we can end debug mode with undebug(tukey_multiple).
Comparing plot(), lattice(), and ggplot2()
The source for my data is: https://vincentarelbundock.github.io/Rdatasets/csv/lme4/sleepstudy.csv
Plot()
>plot(sleepstudy) yielded this lovely menagerie of graphs. it looks like it took all the variables and compared them to each other variable and made a metagraph. For example, you can look at square 3,3 and see that it compares the number of days going without sleep to the reaction time. Assigning the Days (D) and Reaction (R) to their own objects allows us to do plot(D,R) to get a larger version of that graph. Using the aggregate function, you can calculate the mean of each reaction measurement per day, assign them to objects, and plot them (>plot(MD,R, type=”b”).
Lattice()
Similar graphs to the plot() graphs can be achieved with Latttice(). A simple >xyplot(Reaction~Days, sleepstudy) will yield the scatterplot below, and >xyplot(R~D, type=”b”) gives the line chart. The lattice graph is much more pleasing to look at. Having the y intervals oriented horizontally, having the intervals for the x and y values on both sides of the graph, and overall spacing give a much nicer look to the graph. Also, it’s utilization of the space in the frame is much better compared to the plot() graph, which looks like it has large, empty reserved space for a graph title.
ggplot2()
Ggplot2() adds a layer of complexity that is the idea of layers themselves. The bare bones “give me a graph quickly” plots from ggplot give graphs that are nicer than plot(). You get grid laid down across your graphs, with horizontal interval markers on the y axis. You also get even better space utilization than the quick-and-dirty lattice() graphs. The downside to the ggplot2() graphs is that the x intervals are less relatable. It defaulted to 2.5 day intervals for this particular graph, which makes analyzing the data significantly more abstract than full day or every-other-day intervals.
First, I made a new subset for the females from the original data, Dataset, and named it “females”.
>females <- subset(Dataset, Sex == “Female”)
I made a new object, ag, and used ddply to take the subset “Sex” from the object “females” and appended on the average grade by declaring transform as an argument. Ddply is part of the plyr package and follows a convenient naming convention that dictates the input and output data types. The first two letters determine if the respective input and outputs are an array, list, or data frame. They can also consist of multiple inputs or no inputs.
ag = ddply(females, “Sex”, transform, Average.Grade=mean(Grade))
Next, I saved a file of the data with the write function. Notably, it saves to your default directory for documents, and without a file extension added onto it. Using the argument sep=”,” will make the file a CSV file, which I found neat since CSV stands for comma separated values.
write.table(ag, “Sorted_average”, sep=”,”)
I made a new subset, namei, of the original dataset and used the grep() function to add the values from the column “Name”. Adding an “l” onto grep makes defines each value for the vector as either a match or mismatch (logical).
namei <- subset(Dataset, grepl(“[iI]”, Dataset$Name))
Finally, we can recall namei to see the printout of the table of all those i-named folk.
Objects
How do you tell what object system (S3 vs. S4) an object is associated with?
To determine what system an object uses, you are able to use a process of elimination with these commands.
is.object(x) – will return true if the thing you are testing is an object.
> is.object(Inventory_520782772_2016.February.07truncated)
[1] TRUE
isS4(object) – will return true if the object is an S4 object and false if it’s an S3 object.
> isS4(Inventory_520782772_2016.February.07truncated)
[1] FALSE
is(object , “refClass”) – will return true if your object is a reference class.
> is(Inventory_520782772_2016.February.07truncated, “refClass”)
[1] FALSE
How do you determine the base type (like integer or list) of an object?
In order to determine what the base type of an object is, you can use the command typeof() and it’ll return the object type
> typeof(Inventory_520782772_2016.February.07truncated)
[1] “list”
What is a generic function?
A generic function determines what method to call from the input it’s given. A S3 generic can take one argument, but a S4 generic can take multiple arguments.
What are the main differences between S3 and S4?
S3 systems are used the vast majority of the time and are simpler, less formal systems. S4 systems are more complicated, adding more formality to the syntax. S4 has the ability to use method dispatch to pass multiple arguments to a generic function and supports multiple inheritance, allowing you to inherit characteristics from multiple classes.
Matrix manipulation
First, I created matrix A by randomly generating 100 numbers with the values between 0 and 100 and with 10 rows to make it square. It’s been ages since I’ve dealt with matrices, so I had to refamiliarize myself with how basic math functions work with them.
The transpose of A flips the values across the diagonal that connects [1,1] to [10,10] and is easily made with 4 short character strokes:
Next, I tried to create a randomized vector with > a=c(runif(10, min=0, max = 1244),nrow=10), but it ended up not working out, so I just made a numerically sequential vector with >a = c(1:10) to multiply with A. Each row of elements in the matrix are multiplied by the corresponding row in the vector.
I multiplied matrix A with matrix B, which was made in the same randomized manner as matrix A:
We test the determinate of A to make sure it is inverseable and, in turn, solvable. > det(A) gives us a non-zero value, which gives us a green light to run > solve(A) and find the inverse of A.
Boxplots and Histograms
Here’s our starting data:
>Frequency_of_Visits<-c(0.6,0.3,0.4,0.4,0.2,0.6,0.3,0.4,0.9,0.2)
> Blood_Pressure<-c(103,87,32,42,59,109,78,205,135,176)
> First_Assessment<-c(1,1,1,1,0,0,0,0,NA,1)
> Second_Assessment<-c(0,0,1,1,0,0,1,1,1,1)
> Final_Decision<-c(0,1,0,1,0,1,0,1,1,1)
The values are the frequency of each patients’ doctor visits, their blood pressure measurements, first assessment (made by a general doctor), a second assessment (made by an external doctor), and a final decision (made by the head of the emergency unit). With all that loaded into the values, we move onto making some box and whisker plots and histograms. The real, unspoken, champ of this exercise is going to be:
par(mfrow=c(1,3))
This command lets us make a matrix with 1 row and 3 columns to which we will put our boxplots.
> boxplot(Blood_Pressure~First_Assessment, names=c(“good”,”bad”))
> boxplot(Blood_Pressure~Second_Assessment, names=c(“low”,”high”))
> boxplot(Blood_Pressure~Final_Decision, names=c(“low”,”high”))
Here, we should note that “good” blood pressure is also “low priority”. The external doctor (plot 2) has a huge range in which they declare blood pressures to be high priority, and the general doctor seems to mostly consider relatively low values of blood pressure to be dangerous. Next up we’re going to compare the frequency of patients’ visits compared with the status of their blood pressure and we’ll see that first two doctors seem to note that people that visit less often are somewhat in more need of care.
> boxplot(Frequency_of_Visits~First_Assessment, names=c(“good”,”bad”))
> boxplot(Frequency_of_Visits~Second_Assessment, names=c(“good”,”bad”))
> boxplot(Frequency_of_Visits~Final_Decision, names=c(“good”,”bad”))
In order to make this next plot I had to reset my to be one row and column with par(mfrow=c(1,1)). This one will compare blood pressure to the frequency of visits and shows that the results are all over the board and that there’s really not that much data to make a good correlation anyways.
> par(mfrow=c(1,1))
> boxplot(Blood_Pressure~Frequency_of_Visits)
Finally, we make a histogram of the Frequency of visits, which shows that people are more likely to make less frequent visits.
> hist(Frequency_of_Visits, col=”#0898d7″)
Functions: Adventures In Trying To Think About What I Want Done
Functions are great. You type up a little code, execute your function, and then watch as your automation takes over the world. Easy enough, right? Wrong. Well, sort of wrong. If you know what you want to do and how to do it, then it’s easy. If you know what you want to do, but don’t know how to do it, then it’s a little hard. And if you don’t know what you want to do (and probably wouldn’t now how to do it even if you did), then it’s hard. Luckily for me, I have a big fat database of Magic the Gathering cards I own and thought it’d be interesting to try to make a function on it. The easiest thing that i thought I could do would be various analyses on the casting cost of each card.
Knowing what I want to do? Check. The problem for me, though was how my database stored the information. Normally a card has some numbers and symbols showing how much it costs to play, and that cost can be converted to alphanumerical representation. For example, the card on the left has a cost of 6RR. For whatever reason, my database would have displayed that as {6}{R}{R}, which I suppose would be usable if I really knew what I was doing. But I don’t. I tried Googling around for how to split columns, but decided that that was over my head. So I cheated and went into Excel to quickly separate out the values into additional columns, then imported my new data back into R Studio. I used the count function to
Make tables of each column and to give a, well, count of each value. From there I figured that I’d need to combine all 7 tables into one, but they had different column headers so I figured that I’d have to find out how to fix that too. I thought that that may take me a while, so I decided to just get started with my function and have it create all of the tables and figure out how to merge them together later. Well creating the function as essentially a batch command processor turned out to not be as straight forward as anticipated. I kept having the function try to call the series of
x <- count(Inventory_520782772_2016.February.07truncated, “X”)…
to make all of the tables from calling just my manasymbols function. Well that didn’t work. I tried making the tables as a data.frame and as.data.frame. I tried making the function with manasymbols <- function, manasymbols <- function(), and manasymbols <- function(x). After a long struggle of trying different combinations of these, I learned 1. that copy/pasting my code into R Studio doesn’t quite make the function properly and 2. that the function won’t make the object a global with just the x <-. I had to use x <<- in order for the object to be passed from inside of the function to the global level.
Matrices, Data Frames, and the Battle Against Nonnumerical Classes
Warning message:
In mean.default(polls.df):
argument is not numerical or logical: returning NA
I hate this message. It has been the bane of me for far too long. From what I understand, this error gets thrown at you when the console tries to compute something that it can’t. In my case, the column for Name is not numeric.
I read that this is a relatively new problem, as R before version 3.0 would just ignore the illogical request. One way to get the right answer was with
colMeans(polls.df[,2:3])
and
lapply(polls.df, mean, na.rm = TRUE)
Neither of these felt like particularly satisfactory ways to produce the mean. Through this process, I came up with some rather weird outputs that are useless. But I’m sure that looking back at these, from the future, will yield a good “What was I thinking?” moment, so I’ll include them.
Getting ggplot working was a tough hike as well. I spent well over an hour trying to get a plot before I realized that, though I had installed the package, the library wasn’t loaded. Once I was able to get a basic plot working, I decided that I wanted to try to overlay both ABC_political_poll_results and NBC_political_poll_results onto one plot to show how the candidates vary between the two. But try as I might, I was not successful, so I have only two separate, basic graphs to show the candidate’s results for each respective poll.
ggplot(polls.df, aes(x = Name, y = _political_poll_results)) + geom_point() + geom_point(data = polls.df, aes(x = Name, y = _political_poll_results))
Unfortunately I wasn’t really able to get anything else constructive done, though I did just realize that summary() worked fine at calculating the means. That confuses me even more on why mean() wasn’t able to muster.
Objects, Functions, and What the Heck are Vectors anyways?
The first thing I did was import the data with the code:
acs <- read.csv(url(“http://stat511.cwick.co.nz/homeworks/acs_or.csv”))
This imported all the data into an object named acs. I thought it was quite neat to be able to pull the information directly from the web, and that it was easier than importing from a text file. To my understanding, data sets like this are the objects of R. Importing this object went smoothly. The place where I had issues and which is the main reason this post is late is vectors. For whatever reason, my brain isn’t clicking with it. I know vectors in physics and vectors in graphics but vectors in R aren’t meshing with me. The assignment had me access the data as a vector with
acs[1,3]
and it returns to me a value of 62, but that number doesn’t mean anything to me. I guess it’s because the data doesn’t exist as a table like I visualize in my mind. Anyways, next up is the subset function. I was able to make a subset of all the data in acs in which the age of the husband is greater than the age of the wife with
a <- subset(acs , age_husband > age_wife)
Functions in R are the tools that are used to manipulate the data objects. There are some, like subset, that are built into the language, but we can also create our own. Using more of the subset function, I used
w <- subset(acs , income_wife > income_husband)
and
h <- subset(acs, income_husband > income_wife)
to create subsets of the data for when the husband makes more than the wife and for the reverse. I then used the mean function to determine the mean number of children for each case.
I tried to quickly graph the difference in the two, but quickly realized that it would be significantly more involved than simply using the plot function, but I was able to make histograms to compare the two.
The formatting for w$number_children is weird for some reason. I don’t see why there should be empty gaps between the x values, but I’ll chalk that up to the simplicity of the hist function for now.
Edit: It looks like the underscore in the URL doesn’t display properly. I’ll try to look into it later because I’m sure it’ll be a recurring issue. Also, I think I just got vectors working in my head. I had my acs table sorted with decreasing household values.