Visual Analytics Final Project: Dota 2 Hero Analysis

Dota 2 is a popular multiplayer online game developed by Valve that pits two teams of five players each.  Each player picks one character, or hero, (two players can’t play the same character, regardless of team) from a pool of 110+ characters with the goal of destroying the opposing team’s base.  Since 2011 tournaments hosted by Valve have totaled over $80 million in prize pools (Figure 1), making far-and-above the most lucrative e-sports title.  The prize pools of tournaments have exploded over time, attracting new players.  However, being a new player in any game is difficult, let alone one with a learning curve as steep as Dota 2.  The sheer number of heroes in the game, combined with how unique each one is, makes for there to be synergies and combinations that are more successful than others. This causes it is extremely difficult for even professional players to navigate the hero picking process leading to a victory.  Using this kaggle dataset on Dota 2 matches provided by opendota.com, a stat tracking website for Dota 2 players, I explore hero picks to gain insight into individual hero success, the success of hero combinations, and hero utility.

Figure 1.

We can start off talking about the data so that we have some additional context for the numbers.  The data set is a parse of 50,000 games, which is roughly the number of games played in an hour.  The entire set is a 434 mb compilation of 18 spreadsheets of different information.  I used mainly players.csv, match.csv, and hero_names.csv, which total to 89 attributes.  To get the data into a useful form for this project, I had to stitch together several of the provided spreadsheets and clean up values.  I was able to toss out 35 games that were missing data and 7,745 games that involved players leaving the game.  A player leaving significantly alters the course of the game and it wouldn’t be proper to include those in this analysis.  This leaves 42,220 games that include information about 211,100 hero picks and win rates.

From Figure 2. (I suggest looking at the full resolutions of these visualizations), we can see that there is a significant difference in picks across the cast of heroes.  The general tendency for people would think that successful heroes are picked more often.  It’s easy to jump to the conclusion that Windranger and Shadow Fiend must have very high win rates and that Chen and Elder Titan have very low win rates. This would be a hasty and erroneous conclusion.  In Figure 3, we add a secondary axis to show each hero’s win rate, a dotted red line to show the 50% mark, and lower the alpha of the bars to make the points easier to see.

Figure 2.

 

Figure 3.

 

Figure 3 makes it much easier to see how assumptions about the relationship between pick rate and win rate aren’t straight forward.  In the top 12 most picked heroes, there is an even 6-6 split between being in the top or bottom half of win rate.  On the other hand, for 12 least picked heroes, 8 have win rates less than 50%.  Later on we’ll see that this isn’t even a great measure of hero performance.  Dota 2 is first and foremost a team game.  An individual hero’s performance isn’t a good measure because we need to see and understand how heroes interact with each other.  If you’re starting a pickup game of basketball and have a whole bunch of people to pick from, you don’t want to pick 5 people that primarily want to play point.  You need to form a balanced team of people that play well together and compliment each other.  The same concept is true for Dota 2.

Figure 4.

 

To try to uncover hero synergies and hero utility, or how well a hero plays with others, I generated all unique permutations of 2 hero pairs for each team and made a set of visualizations that pair the heroes together.  For example, a 3 person team of ABC would lead to the following permutations: AB, AC, BC.  Why focus on hero pairs?  Team based games are generally the most fun when you play with friends.  The goal with looking at hero pairs is to establish a form for a player to determine how to pick a hero that is likely to be successful with a friend who has picked a hero.  Figure 4 aims to really bring out common pairings by comparing them against the median pick rate (for reference, the median pick rate for hero pairs was 78).  Note: The empty spaces are hero id’s that were not used at the time of the data collection.  For example, Earthshaker and Shadowfiend are paired together roughly 20 times more than the median pick rate.  The color scale lets us see streaks of common pick rates.  If you look at Windranger’s X and Y axes, you’ll see that they’re significantly lighter than the other squares.  This is a reflection of just how much higher her pick rate is compared to other heroes.  If we use the same graph type and chart win rates rather than pick rates, then we can see how successful hero pairs are together.

Figure 5

 

Here, in Figure 5, we chart the win rate of hero pairs.  Successful pairs are more yellow and unsuccessful pairs are more blue.  You’ll see some black speckles around (ex. Jakiro & Oracle, Omniknight & Elder Titan, Chen & Sandking | Warlock | Enigma).  These are all heroes that were played together a single time in the data set.  At the time this data was collected, there were 110 playable heroes.  This means that there are 5,995 unique 2-pair combinations of those heroes (the formula for unique 2-pair permutations is n(n-1)/2). It’s pretty notable that some of these combinations weren’t picked at all, when you consider that a single game will have 20 unique hero pairs. Similar to the black spots, the bright yellow spots are of particular interest.  If you remember from earlier, Chen had a very low overall pick rate and his axes for the pick per median are practically dark, but here he has 5 bright yellow spots to indicate a fantastic win rate.  This is because I didn’t discriminate these tables by numbers of picks.  This leads Chen has a 100% win rate with Outworld Devourer, Enchantress, Oracle, Elder Titan, and Techies although all of these pairing combined only amount to 16 co-occurrences.

Another way to visualize hero pairs is by chord diagram.  Figure 6 is an unlabeled chord diagram showing every hero pair from the data set.  Now, this particular diagram isn’t useful due to the sheer number of hero pairs available, but if we start setting rules for it, patterns will start to emerge.

Figure 6.  A chord diagram of all Hero pairs.

 

I started off by only allowing pairs that were at least as frequent as the first quartile (37 games) of overall pick frequency, and then picking pairs that had a minimum of a 60% win rate to get Figure 7.  For reference, these combinations represent the top 6% winningest combinations of recorded hero pairs.  This figure still mostly looks like a bowl of rainbow spaghetti, but we’re at least able to trace some of links.  Additionally, the size of each hero’s arc shows us how many other heroes they pair up with at 60% win rate or higher.  Omniknight, Abaddon, Wraith King, Ursa, and Spectre all have a lot of links, implying that they have more utility with successful pairs and can be fit into more team combinations than heroes with smaller slices, like Visage. An interesting thing to note here is that Shadow Fiend (near the 11 o’clock position), the second most popular hero, has a tiny sliver and has just 2 other heroes with whom he shares a 60+% win rate with.  Additionally Windranger, the most popular hero, isn’t even represented in this grouping.

Figure 7.  Chord diagram of 60+% win rate hero pairs

If we narrow the selection of hero pairs down even further, to win rates 65% and higher, we get Figure 8.  This is top 2% winningest combinations for hero pairs.  Here, the dominance of Omniknight, Ursa, and Wraith King grow while Abaddon stays about the same and Specter becomes less common.  The heroes here have displayed are significantly represented in picks, and win a lot with their links.  The heroes with more links win a lot with more different teams.  Coincidentally, the heroes here with the biggest arc lengths are also, generally, easy to play.  This makes them great choices for beginners.  If you’re already knowledgeable about Dota 2 heroes, then you can use the diagram to draw conclusions about why these pairs work well together.  For example, Lich is paired with Wraith King, Disruptor, Spectre, and Slardar.  Lich synergizes very well with all of these heroes due to his abilities being able to slow and stun, either covering some of his partner’s weaknesses or pairing with their abilities to get more bang for your buck.

Figure 8.  Chord diagram of 65+% win rate hero pairs. Top 2% of hero-pair win rates.

 

It’s important to adjust for hero success when looking for these links.  If you were to go through the same method for Figure 8, but instead of looking at win rate, you look at just hero pick frequency, then you get Figure 9.  Windranger and Shadow Fiend reappear, representing nearly 1/3rd of the pairs.  The mean win rate of Figure 9 is pretty much a coin toss, weighing in at just 50.5%.  For comparison, the mean win rate for the pairs in Figure 8 is 67.5%.

Figure 9. Chord Diagram of top 2% of pick frequency.

 

The chord graph makes it difficult to see how hero-pairs are related to each other.  Figure 10 solves this by showing the intra-pairing relationships by laying out each hero into a network.  The color is scaled between lower (red) and higher (blue) win rates and the width of each connection is scaled by how many times each pair was picked.  Heroes can then be related to each other by seeing who they have in common or how far removed they are from everyone else.  For example, Brewmaster is way out on the fringe, only connected to one person (Mirana) who is also only connected to one person (Lycan).  An example of indirect similarity is Ursa and Wraith King.  They don’t have any direct connection, but they share connections to each other through Medusa, Omniknight, Venomancer, and Zeus.  It’s hard to see in Figure 10, so I singled them out in Figure 11.  Relationships like this should mean that each indirect hero-link alludes to the heroes having similar roles.

 

Figure 10. Network of hero pairs with 65+% win rate.

 

Figure 11.  Relationship between Wraith King and Ursa from Figure 10.

Through this project, I’ve created a framework to explore and assess the relationships and success between heroes with visualizations.  In these visualizations, I’ve explored time series analysis, part to whole and ranking analysis, correlation analysis, multivariate analysis, and a couple different types of network analysis.  For future work, I’m also working with this data to try to use machine learning techniques to establish predictions for victory depending on hero pairs.

The code I used can be seen at my github.

 

Other voices about visual analytics

I think that Alan Smith and David McCandless both had significant impacts on how I view data visualization.  In Smith’s TEDx talk he shows how numeracy skills are surprisingly lacking, how bad we are at perceiving statistics about society, and about how he used strategies to represent numbers through icons rather than presenting fractions and percentages.  McCandless’s TED talk shows how relatively simple charts, with context of other data, can uncover interesting insights.  I think that both of these talks hint that visualization creation should take the audience into consideration.  Smith wanted to create an engaging tool to show us how our perceptions about our areas compare to actual facts about it, but was aware of how bad many people are at basic math skills and adjusted accordingly.  McCandless talks about how bad we are at putting big numbers in context and adjusts his examples by making comparisons and normalizing data.  As always, context is key.  This is true for both extracting meaning data as well as determining the skills and literacy of your prospective readers.

Tufte vs Few and complexity for modern users

Ultimately, it has seemed to me that Tufte’s principles revolve around an overarching format that aims for visualizations to stand as their own, independent, sources of information.  While talking about his principle of the integration of evidence he says, “words, numbers, pictures, diagrams, graphics, charts, tables belong together” and explains that all of these tools are to be integrated to make a comprehensive visualization.  This pairs with his last principle, content counts most of all, in that the medium for the answer for his proposed question “What are the content-reasoning tasks that this display is supposed to help with?” are these information rich visualizations.

I think this is different from Few’s principles as Few, so far, has aimed at simplicity and readability.  For example, Chapter 4 of Now You See It is focused almost entirely on simplicity and readability.  Things like Sorting, Scaling, and aggregating ensure that visualizations are presented to us in ways that are logical, intuitive, and easy to interpret.  Tufte says, “Perhaps the numbers or data points may stand alone for a while, so we can get a clean look at the data, although techniques of layering and separation may simultaneously allow a clean look as well as bringing other information into the scene”.  To Tuft, you may temporarily use some data in an isolated way, but he then goes on to say that you can probably just use some layering techniques to be able to view that data in the context of everything else.  I love beautiful representations of complex data, and the examples that Tufte presents in Beautiful Evidence are stunning, but I can’t help but feel that it is just better to isolate some data sometimes.  Tufte’s theme is great and is sure to produce some fantastic visualizations, but I feel like parts of his principles are good for creating visualizations for the visual analytics crowd and not necessarily for creating quick, easy to grok, visualizations for everyday users.

This graph is a good example of what I’m talking about as it’s the champion of Tufte’s chapter “The Fundamental Principles of Analytical Design”.  A lot of my classes so far have all talked about the lack of rigor or willful ignorance displayed by end users of information.  People generally won’t read emails, they won’t read beyond abstracts of papers, they won’t follow instructions if they’re too long, the PEBCAK issue in IT, etc.  These apparently general tendencies of humans makes me question how much effort we can expect end users to put into understanding a visualization.  Can we expect them to read a paragraph explaining how the graph works before even looking at the data to understand the information encoded in it?  How many won’t notice that these troop flows are transposed over the topography of western Russia?  What other information will be lost in translation?

Correlation in R

In order to be able to use the corrgram package, I had to manually install TSP, registry, and dendextend.  I’m not sure why that is, but if anyone has issues running corrgram(), take careful not of the errors in the console.  So this style of graph is exactly what i was imagining for my data for my final project.  My thought was to have a split matrix to show multiple visualizations.

As you can see from this massive image, there only half of the box is used.  I’d like to render another set of visualizations in the other half.

This brings us to correlograms.  corrgram() makes it easy to generate a split visualization for multiple variables.  The graph below was produced simply with this:

corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
 upper.panel=panel.pts, text.panel=panel.txt,
 main="Car Milage Data in PC2/PC1 Order")

Using that same code, but with my data instead of mtcars, you get the chart that you see below (I even left the header).  The first thing to notice about the corrplots is that it compares the variables to each other, whereas my big chart plots the value of one attribute (win_percentage) and uses other attributes as the x and y axis (Hero1 & Hero2).

The Hmisc and corrplot packages may have something more for me to be able to utilize, but it looks like I will need to reshape my data as Hmisc seems to only use matrices.

Distribution Analysis

Here is a scatter plot by @KenSteif on twitter that shows the distribution of the price of LEGO sets.  On top of the graph, Ken has super imposed a regression line to show the trend of the cost.  From the graph you can see that the data is ungrouped and that there are a few outliers in the top left and far right areas.  However, the scatterplot is generally homoscedastic.

 

ggplot(df, aes(x=weight, fill = Diet)) + geom_histogram(aes(y = ..density..)) + stat_function(fun=dnorm, args=list(mean=mean(df$weight), sd=sd(df$weight)))

I tried to make a histogram out of the ChickWeight dataset by looking at the weights on the final day and coloring them by their diet.  However, adding the normal curve didn’t go well.  I wanted the graph to display the counts and have the normal curve scaled to an appropriate Y value, but I didn’t make any significant progress.  Most of the responses I saw on how to do this reset the Y axis to density, then plot the density curve.  It really bugs me that I didn’t have a quick or easy solution to this given my experience with R.  Ggplot2 is a system that I am aware I need to work on, and am actively trying to work on.

Part to Whole and Ranking Analysis

Here’s a Pareto chart (made with this guide) I made with a subset of data from this Kaggle, which I’m using for another project.  I’m working on understanding lookup tables to be able to label the x axis with actual names instead of hero_ids, but that’s for another time.  This chart is kind of hard to read and shows that the frequencies for these top 10 most frequently picked heroes are generally very close.  In order do get the exact values for each bar, you have to do some math with the cumulative counts.  I think it’d be better with cumulative percentages labeled on the right side Y- axis, regular counts on the left side, and have the chart be rescaled so that the line chart isn’t so visually dominating.  Alternately making the chart wider would help a little.  Furthermore, the header needs to be fixed to be relevant to the data.  The RStudio guide was good starting point, but it definitely could use some refinements