Select Page

## Boxplots and Histograms

Here’s our starting data:

>Frequency_of_Visits<-c(0.6,0.3,0.4,0.4,0.2,0.6,0.3,0.4,0.9,0.2)
> Blood_Pressure<-c(103,87,32,42,59,109,78,205,135,176)
> First_Assessment<-c(1,1,1,1,0,0,0,0,NA,1)
> Second_Assessment<-c(0,0,1,1,0,0,1,1,1,1)
> Final_Decision<-c(0,1,0,1,0,1,0,1,1,1)

The values are the frequency of each patients’ doctor visits, their blood pressure measurements, first assessment (made by a general doctor), a second assessment (made by an external doctor), and a final decision (made by the head of the emergency unit). With all that loaded into the values, we move onto making some box and whisker plots and histograms.  The real, unspoken, champ of this exercise is going to be:

par(mfrow=c(1,3))

This command lets us make a matrix with 1 row and 3 columns to which we will put our boxplots.

> boxplot(Blood_Pressure~Second_Assessment, names=c(“low”,”high”))
> boxplot(Blood_Pressure~Final_Decision, names=c(“low”,”high”))

Here, we should note that “good” blood pressure is also “low priority”.  The external doctor (plot 2) has a huge range in which they declare blood pressures to be high priority, and the general doctor seems to mostly consider relatively low values of blood pressure to be dangerous.  Next up we’re going to compare the frequency of patients’ visits compared with the status of their blood pressure and we’ll see that first two doctors seem to note that people that visit less often are somewhat in more need of care.

In order to make this next plot I had to reset my to be one row and column with par(mfrow=c(1,1)).  This one will compare blood pressure to the frequency of visits and shows that the results are all over the board and that there’s really not that much data to make a good correlation anyways.

> par(mfrow=c(1,1))
> boxplot(Blood_Pressure~Frequency_of_Visits)

Finally, we make a histogram of the Frequency of visits, which shows that people are more likely to make less frequent visits.

> hist(Frequency_of_Visits, col=”#0898d7″)

## Functions: Adventures In Trying To Think About What I Want Done

Functions are great.  You type up a little code, execute your function, and then watch as your automation takes over the world.  Easy enough, right?  Wrong.  Well, sort of wrong.  If you know what you want to do and how to do it, then it’s easy.  If you know what you want to do, but don’t know how to do it, then it’s a little hard.  And if you don’t know what you want to do (and probably wouldn’t now how to do it even if you did), then it’s hard.  Luckily for me, I have a big fat database of Magic the Gathering cards I own and thought it’d be interesting to try to make a function on it.  The easiest thing that i thought I could do would be various analyses on the casting cost of each card.

Knowing what I want to do? Check.  The problem for me, though was how my database stored the information.  Normally a card has some numbers and symbols showing how much it costs to play, and that cost can be converted to alphanumerical representation.  For example, the card on the left has a cost of 6RR.  For whatever reason, my database would have displayed that as {6}{R}{R}, which I suppose would be usable if I really knew what I was doing.  But I don’t.  I tried Googling around for how to split columns, but decided that that was over my head.  So I cheated and went into Excel to quickly separate out the values into additional columns, then imported my new data back into R Studio.  I used the count function to

Ugly “I don’t know what to do with these” Costs

Less Ugly “I can probably manage these” Costs

Make tables of each column and to give a, well, count of each value. From there I figured that I’d need to combine all 7 tables into one, but they had different column headers so I figured that I’d have to find out how to fix that too. I thought that that may take me a while, so I decided to just get started with my function and have it create all of the tables and figure out how to merge them together later.  Well creating the function as essentially a batch command processor turned out to not be as straight forward as anticipated.  I kept having the function try to call the series of

x <- count(Inventory_520782772_2016.February.07truncated, “X”)…

to make all of the tables from calling just my manasymbols function.  Well that didn’t work.  I tried making the tables as a data.frame and as.data.frame.  I tried making the function with manasymbols <- function, manasymbols <- function(), and manasymbols <- function(x).  After a long struggle of trying different combinations of these, I learned 1. that copy/pasting my code into R Studio doesn’t quite make the function properly and 2. that the function won’t make the object a global with just the x <-.  I had to use x <<- in order for the object to be passed from inside of the function to the global level.

## Matrices, Data Frames, and the Battle Against Nonnumerical Classes

Warning message:

In mean.default(polls.df):

argument is not numerical or logical: returning NA

I hate this message.  It has been the bane of me for far too long.  From what I understand, this error gets thrown at you when the console tries to compute something that it can’t.  In my case, the column for Name is not numeric.

I read that this is a relatively new problem, as R before version 3.0 would just ignore the illogical request.  One way to get the right answer was with

colMeans(polls.df[,2:3])

and

lapply(polls.df, mean, na.rm = TRUE)

Neither of these felt like particularly satisfactory ways to produce the mean.  Through this process, I came up with some rather weird outputs that are useless.  But I’m sure that looking back at these, from the future, will yield a good “What was I thinking?” moment, so I’ll include them.

Getting ggplot working was a tough hike as well.  I spent well over an hour trying to get a plot before I realized that, though I had installed the package, the library wasn’t loaded. Once I was able to get a basic plot working, I decided that I wanted to try to overlay both ABC_political_poll_results and NBC_political_poll_results onto one plot to show how the candidates vary between the two.  But try as I might, I was not successful, so I have only two separate, basic graphs to show the candidate’s results for each respective poll.

ggplot(polls.df, aes(x = Name, y = _political_poll_results)) + geom_point() + geom_point(data = polls.df, aes(x = Name, y = _political_poll_results))

Unfortunately I wasn’t really able to get anything else constructive done, though I did just realize that summary() worked fine at calculating the means.  That confuses me even more on why mean() wasn’t able to muster.

## Objects, Functions, and What the Heck are Vectors anyways?

The first thing I did was import the data with the code:

This imported all the data into an object named acs.  I thought it was quite neat to be able to pull the information directly from the web, and that it was easier than importing from a text file.  To my understanding, data sets like this are the objects of R.  Importing this object went smoothly.  The place where I had issues and which is the main reason this post is late is vectors.  For whatever reason, my brain isn’t clicking with it.  I know vectors in physics and vectors in graphics but vectors in R aren’t meshing with me.  The assignment had me access the data as a vector with

acs[1,3]

and it returns to me a value of 62, but that number doesn’t mean anything to me.  I guess it’s because the data doesn’t exist as a table like I visualize in my mind.  Anyways, next up is the subset function.  I was able to make a subset of all the data in acs in which the age of the husband is greater than the age of the wife with

a <- subset(acs , age_husband > age_wife)

Functions in R are the tools that are used to manipulate the data objects.  There are some, like subset, that are built into the language, but we can also create our own. Using more of the subset function, I used

w <- subset(acs , income_wife > income_husband)

and

h <- subset(acs, income_husband > income_wife)

to create subsets of the data for when the husband makes more than the wife and for the reverse.  I then used the mean function to determine the mean number of children for each case.

I tried to quickly graph the difference in the two, but quickly realized that it would be significantly more involved than simply using the plot function, but I was able to make histograms to compare the two.

The formatting for w\$number_children is weird for some reason.  I don’t see why there should be empty gaps between the x values, but I’ll chalk that up to the simplicity of the hist function for now.

Edit: It looks like the underscore in the URL doesn’t display properly. I’ll try to look into it later because I’m sure it’ll be a recurring issue.  Also, I think I just got vectors working in my head.  I had my acs table sorted with decreasing household values.