Dota 2 is a popular multiplayer online game developed by Valve that pits two teams of five players each. Each player picks one character, or hero, (two players can’t play the same character, regardless of team) from a pool of 110+ characters with the goal of destroying the opposing team’s base. Since 2011 tournaments hosted by Valve have totaled over $80 million in prize pools (Figure 1), making far-and-above the most lucrative e-sports title. The prize pools of tournaments have exploded over time, attracting new players. However, being a new player in any game is difficult, let alone one with a learning curve as steep as Dota 2. The sheer number of heroes in the game, combined with how unique each one is, makes for there to be synergies and combinations that are more successful than others. This causes it is extremely difficult for even professional players to navigate the hero picking process leading to a victory. Using this kaggle dataset on Dota 2 matches provided by opendota.com, a stat tracking website for Dota 2 players, I explore hero picks to gain insight into individual hero success, the success of hero combinations, and hero utility.
We can start off talking about the data so that we have some additional context for the numbers. The data set is a parse of 50,000 games, which is roughly the number of games played in an hour. The entire set is a 434 mb compilation of 18 spreadsheets of different information. I used mainly players.csv, match.csv, and hero_names.csv, which total to 89 attributes. To get the data into a useful form for this project, I had to stitch together several of the provided spreadsheets and clean up values. I was able to toss out 35 games that were missing data and 7,745 games that involved players leaving the game. A player leaving significantly alters the course of the game and it wouldn’t be proper to include those in this analysis. This leaves 42,220 games that include information about 211,100 hero picks and win rates.
From Figure 2. (I suggest looking at the full resolutions of these visualizations), we can see that there is a significant difference in picks across the cast of heroes. The general tendency for people would think that successful heroes are picked more often. It’s easy to jump to the conclusion that Windranger and Shadow Fiend must have very high win rates and that Chen and Elder Titan have very low win rates. This would be a hasty and erroneous conclusion. In Figure 3, we add a secondary axis to show each hero’s win rate, a dotted red line to show the 50% mark, and lower the alpha of the bars to make the points easier to see.
Figure 3 makes it much easier to see how assumptions about the relationship between pick rate and win rate aren’t straight forward. In the top 12 most picked heroes, there is an even 6-6 split between being in the top or bottom half of win rate. On the other hand, for 12 least picked heroes, 8 have win rates less than 50%. Later on we’ll see that this isn’t even a great measure of hero performance. Dota 2 is first and foremost a team game. An individual hero’s performance isn’t a good measure because we need to see and understand how heroes interact with each other. If you’re starting a pickup game of basketball and have a whole bunch of people to pick from, you don’t want to pick 5 people that primarily want to play point. You need to form a balanced team of people that play well together and compliment each other. The same concept is true for Dota 2.
To try to uncover hero synergies and hero utility, or how well a hero plays with others, I generated all unique permutations of 2 hero pairs for each team and made a set of visualizations that pair the heroes together. For example, a 3 person team of ABC would lead to the following permutations: AB, AC, BC. Why focus on hero pairs? Team based games are generally the most fun when you play with friends. The goal with looking at hero pairs is to establish a form for a player to determine how to pick a hero that is likely to be successful with a friend who has picked a hero. Figure 4 aims to really bring out common pairings by comparing them against the median pick rate (for reference, the median pick rate for hero pairs was 78). Note: The empty spaces are hero id’s that were not used at the time of the data collection. For example, Earthshaker and Shadowfiend are paired together roughly 20 times more than the median pick rate. The color scale lets us see streaks of common pick rates. If you look at Windranger’s X and Y axes, you’ll see that they’re significantly lighter than the other squares. This is a reflection of just how much higher her pick rate is compared to other heroes. If we use the same graph type and chart win rates rather than pick rates, then we can see how successful hero pairs are together.
Here, in Figure 5, we chart the win rate of hero pairs. Successful pairs are more yellow and unsuccessful pairs are more blue. You’ll see some black speckles around (ex. Jakiro & Oracle, Omniknight & Elder Titan, Chen & Sandking | Warlock | Enigma). These are all heroes that were played together a single time in the data set. At the time this data was collected, there were 110 playable heroes. This means that there are 5,995 unique 2-pair combinations of those heroes (the formula for unique 2-pair permutations is n(n-1)/2). It’s pretty notable that some of these combinations weren’t picked at all, when you consider that a single game will have 20 unique hero pairs. Similar to the black spots, the bright yellow spots are of particular interest. If you remember from earlier, Chen had a very low overall pick rate and his axes for the pick per median are practically dark, but here he has 5 bright yellow spots to indicate a fantastic win rate. This is because I didn’t discriminate these tables by numbers of picks. This leads Chen has a 100% win rate with Outworld Devourer, Enchantress, Oracle, Elder Titan, and Techies although all of these pairing combined only amount to 16 co-occurrences.
Another way to visualize hero pairs is by chord diagram. Figure 6 is an unlabeled chord diagram showing every hero pair from the data set. Now, this particular diagram isn’t useful due to the sheer number of hero pairs available, but if we start setting rules for it, patterns will start to emerge.
Figure 6. A chord diagram of all Hero pairs.
I started off by only allowing pairs that were at least as frequent as the first quartile (37 games) of overall pick frequency, and then picking pairs that had a minimum of a 60% win rate to get Figure 7. For reference, these combinations represent the top 6% winningest combinations of recorded hero pairs. This figure still mostly looks like a bowl of rainbow spaghetti, but we’re at least able to trace some of links. Additionally, the size of each hero’s arc shows us how many other heroes they pair up with at 60% win rate or higher. Omniknight, Abaddon, Wraith King, Ursa, and Spectre all have a lot of links, implying that they have more utility with successful pairs and can be fit into more team combinations than heroes with smaller slices, like Visage. An interesting thing to note here is that Shadow Fiend (near the 11 o’clock position), the second most popular hero, has a tiny sliver and has just 2 other heroes with whom he shares a 60+% win rate with. Additionally Windranger, the most popular hero, isn’t even represented in this grouping.
Figure 7. Chord diagram of 60+% win rate hero pairs
If we narrow the selection of hero pairs down even further, to win rates 65% and higher, we get Figure 8. This is top 2% winningest combinations for hero pairs. Here, the dominance of Omniknight, Ursa, and Wraith King grow while Abaddon stays about the same and Specter becomes less common. The heroes here have displayed are significantly represented in picks, and win a lot with their links. The heroes with more links win a lot with more different teams. Coincidentally, the heroes here with the biggest arc lengths are also, generally, easy to play. This makes them great choices for beginners. If you’re already knowledgeable about Dota 2 heroes, then you can use the diagram to draw conclusions about why these pairs work well together. For example, Lich is paired with Wraith King, Disruptor, Spectre, and Slardar. Lich synergizes very well with all of these heroes due to his abilities being able to slow and stun, either covering some of his partner’s weaknesses or pairing with their abilities to get more bang for your buck.
Figure 8. Chord diagram of 65+% win rate hero pairs. Top 2% of hero-pair win rates.
It’s important to adjust for hero success when looking for these links. If you were to go through the same method for Figure 8, but instead of looking at win rate, you look at just hero pick frequency, then you get Figure 9. Windranger and Shadow Fiend reappear, representing nearly 1/3rd of the pairs. The mean win rate of Figure 9 is pretty much a coin toss, weighing in at just 50.5%. For comparison, the mean win rate for the pairs in Figure 8 is 67.5%.
Figure 9. Chord Diagram of top 2% of pick frequency.
The chord graph makes it difficult to see how hero-pairs are related to each other. Figure 10 solves this by showing the intra-pairing relationships by laying out each hero into a network. The color is scaled between lower (red) and higher (blue) win rates and the width of each connection is scaled by how many times each pair was picked. Heroes can then be related to each other by seeing who they have in common or how far removed they are from everyone else. For example, Brewmaster is way out on the fringe, only connected to one person (Mirana) who is also only connected to one person (Lycan). An example of indirect similarity is Ursa and Wraith King. They don’t have any direct connection, but they share connections to each other through Medusa, Omniknight, Venomancer, and Zeus. It’s hard to see in Figure 10, so I singled them out in Figure 11. Relationships like this should mean that each indirect hero-link alludes to the heroes having similar roles.
Figure 10. Network of hero pairs with 65+% win rate.
Figure 11. Relationship between Wraith King and Ursa from Figure 10.
Through this project, I’ve created a framework to explore and assess the relationships and success between heroes with visualizations. In these visualizations, I’ve explored time series analysis, part to whole and ranking analysis, correlation analysis, multivariate analysis, and a couple different types of network analysis. For future work, I’m also working with this data to try to use machine learning techniques to establish predictions for victory depending on hero pairs.
The code I used can be seen at my github.
I think that Alan Smith and David McCandless both had significant impacts on how I view data visualization. In Smith’s TEDx talk he shows how numeracy skills are surprisingly lacking, how bad we are at perceiving statistics about society, and about how he used strategies to represent numbers through icons rather than presenting fractions and percentages. McCandless’s TED talk shows how relatively simple charts, with context of other data, can uncover interesting insights. I think that both of these talks hint that visualization creation should take the audience into consideration. Smith wanted to create an engaging tool to show us how our perceptions about our areas compare to actual facts about it, but was aware of how bad many people are at basic math skills and adjusted accordingly. McCandless talks about how bad we are at putting big numbers in context and adjusts his examples by making comparisons and normalizing data. As always, context is key. This is true for both extracting meaning data as well as determining the skills and literacy of your prospective readers.
Ultimately, it has seemed to me that Tufte’s principles revolve around an overarching format that aims for visualizations to stand as their own, independent, sources of information. While talking about his principle of the integration of evidence he says, “words, numbers, pictures, diagrams, graphics, charts, tables belong together” and explains that all of these tools are to be integrated to make a comprehensive visualization. This pairs with his last principle, content counts most of all, in that the medium for the answer for his proposed question “What are the content-reasoning tasks that this display is supposed to help with?” are these information rich visualizations.
I think this is different from Few’s principles as Few, so far, has aimed at simplicity and readability. For example, Chapter 4 of Now You See It is focused almost entirely on simplicity and readability. Things like Sorting, Scaling, and aggregating ensure that visualizations are presented to us in ways that are logical, intuitive, and easy to interpret. Tufte says, “Perhaps the numbers or data points may stand alone for a while, so we can get a clean look at the data, although techniques of layering and separation may simultaneously allow a clean look as well as bringing other information into the scene”. To Tuft, you may temporarily use some data in an isolated way, but he then goes on to say that you can probably just use some layering techniques to be able to view that data in the context of everything else. I love beautiful representations of complex data, and the examples that Tufte presents in Beautiful Evidence are stunning, but I can’t help but feel that it is just better to isolate some data sometimes. Tufte’s theme is great and is sure to produce some fantastic visualizations, but I feel like parts of his principles are good for creating visualizations for the visual analytics crowd and not necessarily for creating quick, easy to grok, visualizations for everyday users.
This graph is a good example of what I’m talking about as it’s the champion of Tufte’s chapter “The Fundamental Principles of Analytical Design”. A lot of my classes so far have all talked about the lack of rigor or willful ignorance displayed by end users of information. People generally won’t read emails, they won’t read beyond abstracts of papers, they won’t follow instructions if they’re too long, the PEBCAK issue in IT, etc. These apparently general tendencies of humans makes me question how much effort we can expect end users to put into understanding a visualization. Can we expect them to read a paragraph explaining how the graph works before even looking at the data to understand the information encoded in it? How many won’t notice that these troop flows are transposed over the topography of western Russia? What other information will be lost in translation?
In order to be able to use the corrgram package, I had to manually install TSP, registry, and dendextend. I’m not sure why that is, but if anyone has issues running corrgram(), take careful not of the errors in the console. So this style of graph is exactly what i was imagining for my data for my final project. My thought was to have a split matrix to show multiple visualizations.
As you can see from this massive image, there only half of the box is used. I’d like to render another set of visualizations in the other half.
This brings us to correlograms. corrgram() makes it easy to generate a split visualization for multiple variables. The graph below was produced simply with this:
corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
main="Car Milage Data in PC2/PC1 Order")
Using that same code, but with my data instead of mtcars, you get the chart that you see below (I even left the header). The first thing to notice about the corrplots is that it compares the variables to each other, whereas my big chart plots the value of one attribute (win_percentage) and uses other attributes as the x and y axis (Hero1 & Hero2).
The Hmisc and corrplot packages may have something more for me to be able to utilize, but it looks like I will need to reshape my data as Hmisc seems to only use matrices.
Here is a scatter plot by @KenSteif on twitter that shows the distribution of the price of LEGO sets. On top of the graph, Ken has super imposed a regression line to show the trend of the cost. From the graph you can see that the data is ungrouped and that there are a few outliers in the top left and far right areas. However, the scatterplot is generally homoscedastic.
ggplot(df, aes(x=weight, fill = Diet)) + geom_histogram(aes(y = ..density..)) + stat_function(fun=dnorm, args=list(mean=mean(df$weight), sd=sd(df$weight)))
I tried to make a histogram out of the ChickWeight dataset by looking at the weights on the final day and coloring them by their diet. However, adding the normal curve didn’t go well. I wanted the graph to display the counts and have the normal curve scaled to an appropriate Y value, but I didn’t make any significant progress. Most of the responses I saw on how to do this reset the Y axis to density, then plot the density curve. It really bugs me that I didn’t have a quick or easy solution to this given my experience with R. Ggplot2 is a system that I am aware I need to work on, and am actively trying to work on.
Here’s a Pareto chart (made with this guide) I made with a subset of data from this Kaggle, which I’m using for another project. I’m working on understanding lookup tables to be able to label the x axis with actual names instead of hero_ids, but that’s for another time. This chart is kind of hard to read and shows that the frequencies for these top 10 most frequently picked heroes are generally very close. In order do get the exact values for each bar, you have to do some math with the cumulative counts. I think it’d be better with cumulative percentages labeled on the right side Y- axis, regular counts on the left side, and have the chart be rescaled so that the line chart isn’t so visually dominating. Alternately making the chart wider would help a little. Furthermore, the header needs to be fixed to be relevant to the data. The RStudio guide was good starting point, but it definitely could use some refinements
Here is a time-series graph I made of some data I collected from automated, neural network runs of Super Mario World. Plot.ly makes this super easy by allowing you to quickly select the axes and add new variables. Here, I will be discussing some of Stephen Few’s best practices for time-series analysis and how they apply to this graph:
- Aggregating to Various Time Intervals
I’ve chosen to display this data with the time interval being each “generation” of the neural net. Increasing the time resolution to each species and genome creates a huge amount of noise in the data, in which trends get lost. Going by each generation makes it much, much easier to see progression.
- Viewing Time Periods in Context
Including the data from the start to the finish of each level allows us to see the entire progression of each generation of the neural network. It’s easy to narrow down the time scale to make it look like there has been no progress made. Plot.ly makes this very easy zoom into smaller time periods, but also allows easy access to the full context of the graph.
- Optimizing a Graph’s Aspect Ratio
The problem here might be how I have my website set up (I’m working on a redesign), but this graph is very cramped. Viewing it on Plot.ly’s site its much more comfortable. Yoshi’s Island 4 was completed relatively quickly, but it’s hard to see on my site due to its location.
- Stacking Line Graphs to Compare Multiple Variables
I’ve combined several levels of data to see how progression and the number of generations to completion vary. Stacking the line graphs makes it much easier to read than having a set of individual graphs, though organizing it can create other readability troubles.
- Expressing Time as 0-100% to Compare Asynchronous Process
I did not utilize this in my graph but I think it would be interesting to see. Doing so could possibly uncover some level design patterns. The Super Mario games are known for being designed in a way that introduces specific level mechanics early on, generally in a safe way, then ramps up the implementation of it. I suspect that this would be evident when viewing the levels in a manner of percentage of completion.
Here is my first visualization using plot.ly. I had also wanted to add a pie chart below each user’s name to show what percentage of their total followers came from Facebook and Twitter, but I’m not sure if plot.ly allows that. Importing the data was mostly easy, except I had issues with headers in xlsx files not being imported as headers. I solved this by just using a csv version of my the data.
I created the the Total variable as an aggregate of the Facebook and Twitter values in order to help give re-express the comparison between each user. The data is sorted by the highest number of Total followers, which makes it easier to read. The Y axis is scaled linearly so that the differences are clear and unambiguous. Each value is color coded which highlights each grouping and makes it easier to follow values across users. Additionally, each axis is labeled/annotated and the legend reflects the color codes of each value.
I quite enjoy infographics and I often see creative visual implementations regularly. Social media outlets I follow to see visualizations are the subreddits r/dataisbeautiful, r/infographics, and r/dataisugly. Dataisugly is a great contrast to the other two and shows a ton of great (in the bad way) examples of how bad data visualization can be. I also look at hockeyviz.com regularly, which does a fantastic job at showing advanced visualizations, for example their relationships between player pairings and success. Another website that I follow is spaghettimodels.com, which shows visualizations for a huge variety of weather data. Actual applications that I have experience with are R and Excel. Others that I haven’t much of or any experience with, but I am aware of, are plot.ly, Gapminder, Tableau, and SAS.
For R, I like how “in charge” of your visualizations you are. Excel can, many times, quickly generate a plot or chart, but many times you aren’t able to make some specific changes to the visualization. Plot.ly is nice because of its accessibility, but I don’t know much about its capability. Gapminder looks great from Hans Rosling’s TED Talks, which are some of my favorite TED Talks, but I don’t know anything about it in the terms of applicability, usability, or capability.
I have a set of 359,043 tweets (object named zq) with 27 variables from 8/25 -9/5 that have the word “zika” in them. After removing any tweets that have duplicate tweet_id’s and tweets that are scraped/parsed incorrectly, I was left with 358,613 tweets. I would like to determine how different hash tags effect how often a tweet with an image gets retweeted.
From looking at the data previously, I noticed that bees were a hot topic and not something we normally think about within the context of zika. I suspect that, within the data set, using bees as a hashtag will be associated a higher amount of retweets per tweet than some other tweets. I’m going to compare #bees with #zika and #mosquito.
H0: # of retweets per tweet ( bees = zika = mosquito )
H1: # of retweets per tweet ( bees > zika | mosquito )
To start off, we need to isolate the tweets that have embedded images. I have determined that the easiest and most accurate way to identify an embedded image is to look for “photo/1” in the parsed_media_url field of the data set:
zqimage <- zq[grep("photo/1", zq$parsed_media_url),]
This leaves us with 83,266 tweets that have images embedded in them. Now, the hashtags are stuck within the “hashes” column, and are delimited by semicolons. Using tidyr, we can separate those out into more columns:
zq1 <- separate(zqimage, col = hashes, into = c("hash1", "hash2", "hash3",
"hash4", "hash5", "hash6", "hash7", "hash8", "hash9", "hash10", "hash11",
"hash12", "hash13", "hash14", "hash15", "hash16", "hash17", "hash18", "hash19",
"hash20"), sep = ";", extra = "merge", fill = "right")
This adds 19 more columns, which should then either have their own hashtag, or a N/A value. From here, we can use apply to count up the frequency of each hashtag and create a table to show us what the 30 most popular hashtags are:
hashes <- apply(zq1[,24:43], 2, FUN=function(x) plyr::count(x))
hashes <- do.call(rbind.data.frame, hashes)
hashes <- plyr::ddply(hashes,"x",numcolwise(sum))
hashes <- hashes[!is.na(hashes[,1]),]
hashes <- head(arrange(hashes,desc(freq)), n = 100)
The hashtags I want to compare are “bees” and “mosquito” which are the 14th and 30th most common hashtags, respectively. In order to do further analysis, I need to subset the rows to only include the max value of retweet_count for each unique tweet. I’m going to do it to zqimage instead of the hashes object so that I can use grep on the non separated hashes column later:
zqimage <- zq[grep("photo/1", zq$parsed_media_url),]
zqimage <- zqimage %>% group_by(text) %>% filter(retweet_count == max(retweet_count),
favorite_count == max(favorite_count))
zqimage <- arrange(zqimage, desc(retweet_count))
I want to compare the contributions between hash tags, so I can get rid of any rows that don’t have any hashtags
zqimage <- zqimage[!is.na(zqimage$hashes), ]
This leaves us with 9,651 unique tweets that have hashtags and have embedded images. In order to conduct an ANOVA test we need to categorize the tweets. In order to do so, we need to create a new column that says if the tweets have “bees”, “mosquito”, or “zika” in them
zqimage$hashhash <- ifelse(grepl("bees", zqimage$hashes, ignore.case = TRUE), "bees",
ifelse(grepl("mosquito", zqimage$hashes, ignore.case = TRUE), "mosquito",
ifelse(grepl("zika", zqimage$hashes, ignore.case = TRUE), "zika", "Other")))
From here, we can apply the ANOVA test
fit <- aov(zqimage$retweet_count ~ hashhash, data = zqimage)
Making a boxplot of the data shows something very telling
It looks like there’s an oppressive amount of tweets with abnormally high values of retweets that are making it hard to see what’s going on, so lets look at a log scale of only the tweets that have been retweeted.
zqn0 <- zqimage[!zqimage$retweet_count == 0,]
boxplot(log(zqn0$retweet_count) ~ zqn0$hashhash
From there, we can see clearly that hashtags with bees performs better than mosquito, but it looks almost neck-and-neck with zika. If we look at the values of fit$coefficients, we can see that bees edges out zika in performance too.
(Intercept) hashhashmosquito hashhashOther hashhashzika
7.1780822 -4.9327290 -4.8003600 -0.9336632
I used the granovagg package to create the above graphic which again shows that bees outperforms mosquito, even though mosquito had nearly an order of magnitude fewer hashtags than case-insensitive “bees”, and it also outperforms case-insensitive “zika” despite the fact that zika hashtags had over 22,000 more hashtags.
The p value for the hashes is very small, causing us to fail to reject the null hypothesis and accept the alternate hypothesis that the bees hashtags perform better than the mosquito or zika hash tags.