Earlier last week, I published a post exploring heat maps for the Portland Thorns’ April 2016 matches, with a focus on Tobin Heath’s performance. Those heat maps are in an Excel spreadsheet, which you can download here.
In this post, I’ll summarize how they work, and how you can create your own for the matches for which we currently have location data. You’ll need a basic understanding of how to use R and how to modify Excel spreadsheets.
How to create your own heat maps
You will only be able to create heat maps for the matches in the WoSo Stats database for which we have location data. You see which ones they are by going to the database.csv file in the WoSo Stats GitHub repository and seeing which matches have a “yes” in the “location.data” column, which will mean they have complete location data for virtually every action that was logged.
If you’d like to help us get more location data logged for more of these matches and you’ve got a couple of hours to spare, you can help!
Getting the data
Anyways, first things first, open up R or R Studio or whatever you use to work in R, and run this code to source the “getting-data.R” and “create-location-stats-table.R” code. The first file will create a data frame in your working directory for the aforementioned database.csv file and a getMatchCsvFiles() function. The second file will create various functions, but the two we’ll be working with will be createMultiLocStatsTabs() and writeFiles().
Now it’s time to pick the matches you want with the getMatchCsvFiles function. This function has the following arguments:
1. competition.string: The name of the competition you want to analyze as it is written in the database’s “competition.string” column. This MUST match exactly what is written in the column, and this argument MUST be written. For the NWSL 2016, you’d write in competition.string = “nwsl-2016”. If you’d like to pick from every single match in the database, then just write in competition.string = “database”
2. The data range you’d like to pick, written as one of several possible arguments. You can pick a specific “round” (such as a week in NWSL play), a set of various “rounds” (such as multiple weeks in NWSL play), or a specific month. These arguments are the following:
- round: The “round” of the competition, written as round = “nameOfRound” For the NWSL 2016 season, “rounds” are weeks of the season; week 1, for example, would be written as round = “week-1”.
- multi_round: A vector of different “rounds” of a competition (for the NWSL this would be weeks), written as multi_round = c(“X”, “Y”, “Z”). If you wanted weeks 1 through 3, and week 4, you’d write this as multi_round = c(“week-1”, “week-2”, “week-3”, “week-4”).
- month_year: The month and year of the matches you’d like, written as MM_YY. For example, matches from May 2016 would be written as month_year = “05_2016”.
- For now, you can only pick one of these at a time. For example, you can only pick April 2016 matches or Week 1 through Week 3 matches, not all matches from Week 1 through Week 3 that happened in April 2016.
- You can also just leave this argument blank, in which case you’ll pull everything in the database, according to any further filters you set based on the next few arguments.
- team: This is optional. This is the abbreviation for the team whose matches are the only ones you want, written as team = “TeamAbbreviation”. The abbreviation is based on our list of abbreviations for club teams and based on FIFA’s country codes. Double-check the database to make sure the team you want is actually in our database – beyond the NWSL 2016 teams, we only have a bunch of international teams and one random PSG-Lyon match (as of this writing).
- location_complete: This is also optional, and is set to default as location_complete = FALSE. What that means is that, by default, you will get all matches, regardless of they have completed location data. For the purposes of this blog post, we will want to set this as location_complete = TRUE
Feel free to play around with this (and let me know if you run into any bugs), but here are some examples of how this function works:
To get all Sky Blue 2016 matches for which we have any data:
getMatchCsvFiles("nwsl-2016", team = "SBFC")
To get all Washington Spirit 2016 matches from the month of June, for which we have complete location data:
getMatchCsvFiles("nwsl-2016", month_year = "06_2016", team = "WAS", location_complete = TRUE)
To get all USWNT matches from 2016 SheBelieves cup, for which we have complete location data:
getMatchCsvFiles("shebelieves-cup-2016", team = "USA", location_complete = TRUE)
For this blog post, we’re going to focus on the code I ran to get all Portland Thorns matches from the first 3 weeks of the season. We already know we have location data for these matches, so specifying location_complete isn’t necessary; however, let’s specify it anyways just in case you weren’t sure.
getMatchCsvFiles("nwsl-2016", multi_round = c("week-1", "week-2", "week-3"), team = "PTFC", location_complete = TRUE)
You should now have a match_list list (a very large one, too) with 3 elements, one for each match spreadsheet, and a match_names vector with 3 elements, one for each matchup name.
Getting the location-based data
The next few steps are pretty simple. Call the createMultiLocStatsTabs() function; set the match_list argument as “match_list” and the match_stat argument as the stat you’re looking for (more on this in the next paragraph); and assign it to variable stats_list. This will create for each match a table with each player in one row and their location-based stats in the columns.
When calling this function, one of the arguments is the match_stat, which is the type of location-based stat you want. As of this writing, you can only get 11 different location-based stats, listed below with the string you need to write in as the argument shown in parentheses. If you wanted to get the largest table possible with columns for each stat (this creates a table with 181 columns), just write match_stat = everything
Or, assign one of these to the match_stat argument:
1. Attempted pases (attempted-passes)
2. Completed passes (completed-passes)
3. Passing completion percentage (pass-comp-pct)
4. Take ons won (take-ons-won)
5. Take ons lost (take-ons-lost)
6. Aerial duels won (aerial_duels-won)
7. Aerial duels lost (aerial-duels-lost)
8. Tackles (tackles)
9. Dispossessions of Opp (opp-dispossess)
10. Opp Poss Disrupted (opp-poss-disrupted)
11. Pressure/Challenges (pressure)
12. Recoveries (recoveries)
13. Interceptions (interceptions)
14. Blocks (blocks)
15. Clearances (clearances)
16. Opp Ball Disrupted (opp-ball-disrupted)
For the set of Portland Thorns matches we are working with, this is the code we would run:
createMultiLocStatsTabs(match_list, match_stat = "everything")
Once you run this, you’ll have a list assigned to the variable stats_list that will have a stats table for each of the three Portland Thorns matches.
Then, write these stats tables as .csv files in your working directory, by running the following. Each stats table’s file name will be determined by the match_stat, which you have to specify again (the data won’t be affected, so you could really name this whatever you want) and by the string values in the match_names vector that was created when we ran the getMatchCsvFiles() function .
writeFiles(stats_list, match_names = match_names, match_stat = "everything")
Run this and, staying with our Portland Thorns April 2016 example, your working directory will now have three .csv files.
To review, here is all of the code that was run since the beginning of this blog post to create the three .csv files that are now in your working directory (the code can also be found here:
source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/getting-data.R") source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/create-location-stats-table.R") getMatchCsvFiles("nwsl-2016", multi_round = c("week-1", "week-2", "week-3"), team = "PTFC", location_complete = TRUE) createMultiLocStatsTabs(match_list, match_stat = "everything") writeFiles(stats_list, match_names = match_names, match_stat = "everything")
Create the heat maps
Now the tricky part: creating the heat maps with the data in the .csv files. First, download the Excel template for the heat maps (click on “View Raw” to download), which is just the Portland Thorns April 2016 heat maps, and open it.
Let’s pretend we had this Excel spreadsheet but without the data that’s shown to the right of the heat map, starting with the PTFC-ORL match. Highlight everything in columns “AC” through “HA” from row 1 down to row 29 (as shown in the images below) and clear the contents (DO NOT delete the columns, though).
The heat map will be blank, regardless of what you write into the “Enter name here:” and “Enter stat here:” cells, and the stat info to the right of the heat map and below the cells where you enter the Player and Stat you want will either be zeroes of NAs. This means that the formulas in all those different cells, including the ones that make up the heat map, are looking for data in those columns that we just cleared, but it’s calculating nothing but blanks and errors as there’s nothing there anymore, for now.
Let’s say we didn’t want to re-create the PTFC-ORL match, but instead wanted to use that space we cleared for a BOS-PTFC heat map. Open the “BOS-PTFC-everything.csv” file that you created in your working directory (the following will only work with the “everything” versions of the stats tables) and highlight only the cells that aren’t blank (for this match it’s 27 player rows plus the header row for a total of 29 rows, times the 181 columns, for a total of 5,068 cells you have to highlight). This will look like this in Excel.
Copy those highlighted cells and paste them into the cell at row 1 and column AC, which will fill in the space that was previously taken up by the PTFC-ORL stats. But wait, you’re not done yet! One thing is left to correct, and that’s the team totals.
See that “PTFC” and “ORL” row of numbers in the lower right below the player stats? Those are the total for those stats for each column, which are referred to when creating heat maps for an overall team view. I like to keep the home team on top, so in this example, change “PTFC” in cell AC31 to “BOS” and change “ORL” in cell AC32 to “PTFC”. Then, highlight cells ADH31 (where the totals start) through cell HA31 and search and replace “PTFC” with “BOS”; this changes the formulas in each cell so that they’re now looking for stats from the right team. Do the same for the rows below, searching and replacing “ORL” with “PTFC”. The totals should now be correct.
Finally, in cell Y17 under “Name entered is a team?” is a formula that reads what’s being written into the “Enter name here:” cell and determines, based on the team abbreviations you’ve given it in an OR() formula, if it’s a team that’s been input. Right now this formula is still looking “ORL” as one of the two teams. Change the cell contents from =OR(B5=”PTFC”,B5=”ORL”) to =OR(B5=”PTFC”,B5=”BOS”).
And you’re done! The heat map should work now.
- It’s easiest to create the heat maps with the template I’ve provided if you had created stats tables with match_stat set as “everything.” I added in the option to create smaller stats tables for the future when the “everything” version of a stats table is far, far bigger than 181 columns. For now, though, it makes more sense to work with the “everything” stats tables as 181-column spreadsheets shouldn’t slow down your computer.
- You can ignore the passing percentage heat maps for the overall team views, as those are the sum of the percentages for each player. I haven’t yet figured out a way to get the average for the percentages that can account for whether a 0.0 passing pct is there because there were no attempts at all.
- Creating an “NWSL Season” heat map for “per 90 minute” stats. This would be calculating from a table close to 200 rows long, and would also require creating that table somehow.
- Read more about how to create heat maps for “per 90” stats here.
Made it this far? Maybe you can help us out a little more. We need help logging this data. This data only happens because of fans like you who have put hours of their free time into logging data onto Excel spreadsheets. But we need more people helping out, as right now we are very low on volunteers and will be lucky to finish the 2016 season by the time the 2017 season even starts! If you’re interested, read more here about how to help and either send a DM on Twitter to @WoSoStats or email me at firstname.lastname@example.org to get started. All it takes is a couple of hours of your free time, a willingness to learn, and knowing a thing or two about Excel.