Putting this project on hiatus

Hi everyone,

After much reflection, I have decided to put this WoSo Stats project on indefinite hiatus.

There are a few minor things related to ongoing tasks that I will finish in the next few weeks such as adding data for some older matches that are in the middle of being logged. After that, I will no longer be logging any other matches; looking for new volunteers; creating new posts, data, or visualizations; or creating new code to analyze our match data. All the data currently in the WoSo Stats GitHub repository and the WoSo Stats Shiny app will remain available to anyone for free – as it always has been and always will be. In the future if anyone has questions about the data we’ve logged, I will still be reachable at wosostats.team@gmail.com.

This was not an easy decision for me, but it was a necessary one. I had a more free time in the past to dedicate to this project. There was the training and keeping up with volunteers, the developing of the match-logging workflow, the developing of the code for extracting data from the match spreadsheets, the maintenance and updating of documentation on the GitHub repo, the creation of content that went up on this blog and on the Twitter account, and the actual logging of matches. Unfortunately, after trying to convince myself otherwise for the longest time, there is now much less free time in my personal life compared to when I first started this project due to far more pressing and important matters in my life, and it was not going to be enough to focus on even one of those tasks mentioned above. So, instead of doing this half-assedly and dragging my emotional wellbeing down by wondering when I was going to be able to get to the next thing I needed to do for this project, I’m going to let it go and give myself a break.

I’ve been humbled by the innumerable hours of work dozens of volunteers put into this project over the past 2+ years, and I’m incredibly proud of the work we’ve been able to do. We did something no one else had done for women’s soccer, and made it free and publicly accessible. We logged data and managed to extract advanced stats out of, by my count, 151 matches, including the entire NWSL 2016 season. Without the volunteers, none of that would have been possible, and for that I’m an deeply grateful.

I might return to this project some time in the future. Hopefully, in time, the sport will be big enough where this won’t be needed as the only public resource for women’s soccer advanced stats. I started following this sport seven years ago thanks to the 2011 Women’s World Cup (and that’s another story…), and for someone who grew up with soccer it’s been like watching the sport as a little kid all over again. Every year, the sport keeps growing, and my hope is that it’ll one day give back to its players, coaches, staff, fans, and writers much like it has for the men’s game. I hope that, with each passing year, less and less people are getting left out of this beautiful game. I’ll be watching.

-Alfredo

Advertisements

Advanced Passing Stats: USWNT vs. Germany – SheBelieves Cup 2018

For this summary of passing stats from the USA-Germany SheBelieves 2018 match, I’m only going to look at open play passes. Open play passes excludes passes from dead ball scenarios – throw-ins, free kick passes, goal kicks, and corner kick passes are all discounted.

FormationsUSA-433

The United States lined up in a 4-3-3, with O’Hara and Smith as the fullbacks; Davidson and Dahlkemper as the centerbacks; Ertz as the defensive midfielder; Horan and Lloyd as the two other center midfielders, Rapinoe as the left forward, Morgan as the center forward, and Pugh as the right forward.

Germany lined up in a 4-2-3-1, with Faisst and Maier as the fullbacks; Peter and Hendrich as the centerbacks; a midfield trio of Kemme, Dabritz, and Marozsan; Dallmann as the left winger, Popp as the center forward, and Huth as the right winger.

GER-4231

Germany’s midfield, and even Popp’s role, was fluid throughout the match, with Kemme’s role being the most solidified as a defensive midfielder for most of the game (until she played as a fullback later in the match). Dabritz and Marozsan would often switch roles, with Popp dropping deep several times.

The Centerbacks

The two USA centerbacks – Davidson and Dahlkemper – had very high open play passing completion percentages and a high number of open play passes attempted. Davidson finished with the highest open play passing completion percentage of the game (minimum 10 passes) at 92.6%. Dahlkemper had a lower open play passing completion percentage, 82.4%, but she also had more open play passes under pressure – 17.6% compared to 7.4% for Davidson. Whether or not that was due to passes to Dahlkemper already going to her while under pressure, or whether German players got to apply pressure to her before she managed to get off a pass attempt requires further analysis. Sonnett, in her 10 minutes on the field, did not register an open play pass attempt.

The three German centerbacks – Hendrich, Peter, and Goessling  – each had high passing completion percentages and a higher percentage of their passes going forward. Hendrich, Peter, and Goessling’s open play pass attempts went forward 64.4%, 53.6%, and 73.7% of the time, respectively, compared to Dahlkemper’s 58.8% and Davidson’s 44.4%. They each were also under pressure much more often.

Screen Shot 2018-03-04 at 6.52.02 AM

Open play passing stats for USA & GER centerbacks

The Fullbacks

Taylor Smith – matched up on the right wing against Germany’s Dallmann and Faisst – found her open play pass attempts under pressure more often than O’Hara, 68.4% of the time compared to 35.7%. O’Hara – matched up on the left against Germany’s Huth and Maier – had a higher passing completion percentage of 78.6% compared to Smith’s 73.7%. The two combined for 3 open play cross attempts that were not completed. Short only registered 5 open play pass attempts in her 16 minutes on the field.

As for the Germans, the starting fullbacks were Faisst and Maier, with Kemme playing as a rightback late in the match. However, due to being unable (for now) to split up Kemme-as-a-midfielder stats from Kemme-as-a-fullback’s stats, I’ll treat her as a midfielder later on. Compared to their American counterparts, Faisst and Maier were more involved in the German passing game, with 37.5 and 37.4 open play passes attempted per 90 minutes, respectively, compared to O’Hara’s 31.5 and Smith’s 17.8. Their completion percentages were all lower, though, with Faisst completing 72.5% of her open play passes and Maier completing the lowest of the fullbacks, at 65.6%. The two combined for 4 open play cross attempts, which, just like O’Hara’s and Smith’s, and likely thanks to the strong winds that night, went nowhere.

Screen Shot 2018-03-04 at 7.06.13 AM.png

Open play passing stats for USA & GER fullbacks

The Midfielders

Ertz’s passing game was stellar in the midfield, with 28.8 open play passes attempted per 90 minutes and a 91.3% completion percentage – the second-highest in the game. The two other USA midfielders with significant open play passing numbers (at least 10 attempts), Horan and Lloyd, had lower passing completion percentages (75.6% and 78.9%, respectively), but were also under pressure far more than Ertz (61.0% and 52.6% of all open play pass attempts, respectively, compared to Ertz’s 39.1%) due to their higher position up the field.

For the Germans, Dabritz and Magull stood out for their high open play passing completion percentages, 86.1% and 86.7% respectively. Magull only played for 27 minutes but finished with the 50.0 open play passes attempted per 90 mins, the highest in the game. Marozsan was the most involved in Germany’s passing game throughout the entire game, with the most open play passes attempted, 47, out of anyone on the field, although she finished with a completion percentage of only 76.6%. Kemme, meanwhile, struggled with an open play passing completion percentage of only 65.9%.

Screen Shot 2018-03-04 at 7.33.18 AM.png

Open play passing stats for USA & GER midfielders

 The Wingers & Forwards

I considered Rapinoe and Pugh more as forward wingers, and Huth and Dallmann more as midfield wingers in a slightly deeper role, but I figured it would be worthwhile combining the two roles together in this part, including Alex Morgan, too, who primarily was a center forward for the entire match.

Out of all the wingers, Rapinoe’s open play pass attempts were under the most pressure, at 68.2%, compared to everyone else who was between 62% and 64%. Her passing game, matched up against Maier, struggled even more, completing only 58.3% of her open play pass attempts, compared to Pugh on the other side who completed 88.0%. Both Rapinoe and Pugh attempted a similar number of open play passes per 90 mins (23.0 and 24.7) and a similar number of crosses completed/attempted (1/2).

Huth’s passing game similarly struggled like Rapinoe’s, completing only 58.3% of her open play pass attempts, compared to Dallmann’s 72.4%. Dallmann had a slightly higher number of open play passes attempted per 90 mins, at 37.8 compared to Huth’s 33.8. The large differences in completion percentages can partially be explained by Huth’s persistent yet ineffective crossing game, completing only 1 cross attempt out of 7, compared to Dallmann’s 1 cross completion out of only 2.

Finally, the two forwards, Morgan and Popp, who had the highest percentage of open play pass attempts under pressure out of anyone in the game, at 79.3% and 76.3%, respectively. Popp attempted more passes, 38, compared to Morgan’s 29, and finished with a significantly higher completion percentage of 76.3% compared to 69.0%. Popp, however, was not as fixed in her role as the center forward as Morgan was, dropping back into her half several times to help defend and receive the ball. Morgan’s more constant presence higher up the field might be reflected in her percentage of pass attempts that went backwards – 41.4%, the second-highest in the game to Dabritz – suggesting numerous instances where she was holding up the ball and dropping it back for a teammate facing the German goal.

Screen Shot 2018-03-04 at 7.37.20 AM

Open play passing stats for USA & GER wingers and forwards

 

Passing networks for the Portland Thorns’ 2016 season

We have all 21 of the Portland Thorns’ 2016 matches logged in match spreadsheets like these, which when they’ve got location data I’ve combed through for some valuable location data. I wanted to see how much more passing data I could get out of these match spreadsheets, even without any location data, which only a few of our matches have. Below is a quick look at the numbers for a sort of “passing network”, but without the graphics and lines and instead with just tables and some useful formatting.

The R code I created to generate a table of shared passes and a table of shared minutes is here on the WoSo Stats GitHub repo. There are comments in the code that are hopefully enough to explain how it works but I’ll delve into that in greater detail in a future blog post. For now, let’s at that data for the Portland Thorns to get a better look at how the ball was being passed around.

The Excel spreadsheet shown below, based on tables you can create from the R code mentioned above, can be downloaded here from the WoSo Stats GitHub repo (click on the “Download” button).

I’m just going to briefly go over what we see when use some Excel formulas and conditional formatting, and what it can quickly tell us about how the Thorns were passing around the ball.

Screen Shot 2017-05-08 at 9.04.07 PM

Here’s the first sheet of the Excel workbook, and there’s a few important things to understand that will be true for the following sheets as well.

First of all, the rows are the players passing the ball, and the columns are the players receiving a completed pass. So, let’s look at the bottom-left cell.”Weber” is the row, so she’s the player passing the ball, and “Betos” is the column, so she’s the player receiving the pass; therefore, that cell represents the number of passes that Weber completed to Betos during the entire 2016 season. So, just one.

Second, each cell only represents completed passes. This is extremely important, because we’re missing out on data about how many times a player was actually targeted by another teammate. This data is missing because, well, it can get extremely hard, if not outright impossible, to determine both from looking at the match spreadsheet and even during a match where a missed/blocked/cleared/intercepted pass was supposed to go. Maybe in the future we, or someone else, can go back through all these matches or future matches and figure out how to do that, but for now we’re going to have to go without that. But at the very least understand that these passing numbers only represent completed passes. So, remember that value of 1 that was where the “Weber” row meets the “Betos” column? For all we know, maybe Weber tried passing the ball back to Betos another 10 times and they were missed (probably not, because forwards usually aren’t passing the ball back to their goalie that much, but you get the idea).

Finally, the darker the green, the higher the value of the cell, just in case it isn’t obvious. The whiter the cell, the closer to zero it is. The darker the cell, the closer to the highest value it is.

Okay, now that we’ve got all that out of the way, what’s going on here? There are some extremely dark pockets in this spreadsheet, but they’re not taking into account the fact that some players were on the field together way more than other pairings. Take Amandine Henry, for example – finished the season with 48.4 passes attempted per 90 minutes and 38.3 passes completed per 90 minutes, but her row and column of shared passes is way lighter than other Thorns players simply because they played more minutes and had more time to pass to each other.

We need another table that has the number of minutes a player shared with each teammate, which is below. Writing up the code to generate this was a pain in the ass, so please admire it just for a few seconds.

Screen Shot 2017-05-08 at 9.14.44 PM

This table is diagonally symmetric and, for the purposes of this analysis, will mainly be used to calculate the per 90 passing numbers below.

Screen Shot 2017-05-08 at 9.23.35 PM.png

You may have noticed the following players are missing: Berryhill, Lofton, Pratt, Skogerboe, Williamson, and Fitzgerald. This is because for this spreadsheet I hid the columns for players who never were on the field with any teammates for 270 or more minutes. This is to exclude any extremely high passing per 90 numbers that may show up merely because a few passes were exchanged during very limited minutes.

So, now we’re looking at the, for lack of a better term, the “passes completed by the row player to the column player per 90 minutes.” Remember that “Weber to Betos” cell we were looking at, the one in the bottom left? Now it reads as 0.13 passes completed by Weber to Betos every 90 minutes.

I also added each players overall passing completion percentages for the season at the end of each row and column, and the black lines are meant to block out different position players. Finally, the grey boxes are values that had less than 270 minutes. For example, look back to Weber – she was on the field with Betos for at least 270 minutes, so that 0.13 value appears, but she was only on the field with Franch for 91 minutes, so that cell value gets greyed out.

There’s a lot to dig into here, but one thing I like looking at is how defenders move the ball to the midfielders, how midfielders move the ball to the forwards, and how the goalkeepers and defenders try to get straight to the forwards. By looking at the defender rows, it looks like Klingenberg-to-Heath and Klingenberg-to-Horan are by far the most fruitful midfielder-to-defender passing relationships. The only other defender-to-midfielder to relationship that happens as much is Sonnett-to-Henry, and keep in mind Henry only played half the season.

In the midfielder rows, where they meet the forward columns, there’s less darker colors because it’s just harder to pass the ball to the forwards, so that section of the table is just naturally going to be a lighter shade most of the time. One stat that stands out to me is how the high number of passes Shim completed to Raso, 5.08, higher than any other midfielder-to-forward combo, especially considering they were only on the field together for 536 minutes.

Now, let’s look at this table with the highlighting done a little differently. Below is the same numbers as above, but with each row highlighted individually.

Screen Shot 2017-05-08 at 9.43.30 PM.png

Look at the Betos row, for starters. The higher value in that row is the 7.19 completed passes to Menges, so that’s going to be the darkest cell in the row. Meanwhile, the lowest value of 0.13 completed passes to Weber is the lowest, whitest cell. A few rows down, Sonnet’s highest value of 5.62 completed passes to Betos is the darkest cell, while her 0.85 completed passes to Heath is the lowest.

This table will probably make most sense if you look at the columns and look for which players have a high number of very dark cells. Menges appears to have been a very frequent passing target for almost every defender. Heath and Henry had a relatively high number of completed passes from defender, midfielders, and forwards. Nadim had a high number of completed passes from midfielders and other forwards, and Sinclair looks like she was deeper down the field and had a relatively high number of completed passes from midfielders and defenders.

Finally, let’s look at this highlighted flipped around. Now, each column’s highest values are highlighted.

Screen Shot 2017-05-08 at 9.58.20 PM.png

Take a quick look at the rows and see which players were more likely to be the origin of a completed pass. Klingenberg, across the board from goalkeepers all the way up to forwards, appears to have been the origin of a relatively high number of completed passes for many teammates. Farther down the table, Allie Long and Amandine Henry were the origin for a great deal of completed passes for several defenders, midfielders and forwards.

There’s more to dig into here, and especially when we compare these raw numbers to another team’s passing network. There are three other ones I’ve created for the Seattle Reign, Western New York Flash, and the Houston Dash that can be found here on the WoSo Stats GitHub repo. In a later blog posts, I’ll look compare these to each other to see just how wildly different a team can pass the ball around. For now, I hope you’ve enjoyed seeing the rich data we can glean into passing relationships from the data we’ve got.

Morgan Brian and Sarah Killion: Using stats to differentiate midfielders

Two weeks ago, I touched a bit on open play passing stats for Ali Krieger by breaking down attempts and completion percentage by thirds of the field. Since then, I challenged myself to see how much I could dig into passing stats to try to find some differences between two players who on the face of it look very similar – Morgan Brian and Sarah Killion. They’ve both played primarily as defensive midfielders, they both pass the ball a similar amount of times, and they have almost the same passing completion percentage.

The following data is also only for 40 out of 103 NWSL 2016 matches that we’ve logged with complete location data. To see the list of matches this data represents, see the database in the WoSo Stats Github and look for all the matches with “yes” in the “location.complete” column.

As you read through the post below, please consider that this data is only possible to hard work from fans like you who have been logging matches over the past year. The WoSo Stats project needs your help to log more stats and location data for the NWSL 2016 season, for USWNT matches, and beyond. The more data we get, the better we’ll be able understand the sport. If you’re interested in logging data for matches , read more here and email me at wosostats.team@gmail.com or send me a DM at @WoSoStats on Twitter. All the data logged with be publicly available on the WoSo Stats Github repo.

Getting the passing stats

If you’re not interested in the coding aspect of this or how to get this data yourself, feel free to skip ahead to the next section. All the data used is available to download from this Tableau visualization.

The instructions for how to use the creating-stats.R file are here in the WoSo Stats Github repo. If you’re familiar with R, first things first, source this R file and then run the getStatsInBulk function with the arguments shown below:

your_stats_list <- getStatsInBulk(competition.slug = “nwsl-2016”,location = “thirds”,location_complete=TRUE,section=”passing”)

This will take about a minute. Then run the mergeMatchList function with the following arguments to get the stats table as a data frame named “your_stats”:

your_stats <- mergeStatsList(stats_list = your_stats_list,add_per90 = TRUE,location = “thirds”,section=”passing”)

In there are columns for open play passes, which in the columns are called “opPass.” Open play passes are defined as all passes that aren’t one of the following – namely, dead ball plays:

  • Throw-ins
  • Corner kicks
  • Goal kicks
  • Free kicks
  • Drop kicks or throws by the goalkeeper

A change from previous posts is the “section” argument. Instead of creating a massive stat table with all sorts of stats you may not be interested in, you can now just create a stats table for a specific type of stats (attacking, passing, possession, defense, goalkeeping). For this analysis, we’ll only need to look at passing stats, so we can just assign “passing” to the section argument.

The “your_stats” data frame is the stats table that is behind the Tableau visualization that has all the charts shown below. The Tableau viz was created with Tableau Public, and you should be able to download it yourself. For now, let’s have a look at the data.

Overall Passing Stats

For starters, let’s look at how Brian and Killion look if we just look at two very basic stats – open play passes attempted per 90 and open play passing completion percentage, sorted by total open play passes attempted per 90.

Screen Shot 2017-03-12 at 4.15.54 PM

Both Brian and Killion have nearly the same stats. Brian has 52.1 open play passes attempted per 90 minutes with an 82.3% passing completion percentage. Killion has 53.3 open play passes attempted per 90 minutes with an 84.7% passing completion percentage.

There’s a lot that could be happening deeper underneath those stats, so let’s look at that bar chart, broken down by open play passes attempted per 90 for each third of the field (defensive, middle, and attacking). Here, we begin to see some differences in where Brian and Killion’s passes are happening, and some big similarities as well compared to the players around them.

Screen Shot 2017-03-12 at 4.03.00 PM.png

Killion, per 90 minutes, attempts a couple more open play passes in the middle 3rd. Brian, meanwhile, per 90 minutes, has a few more open play passes in the attacking 3rd. Brian seems slightly more attacking-minded and Killion attempts more of her passes out of the midfield. Killion, quite simply, with the matches we have that have location data logged data, attempts more open play passes out of the middle 3rd of the field, per 90 minutes, than anyone else in the league.

Compared to almost every other played visible here, they pass the ball in open play out of the middle 3rd more times than anyone else except for Barnes, who is only ahead of Brian. They both have a very high percentage of their passes coming out of the midfield.

Now, what about the passing percentages? Below is a chart stacking, for each player, their open play passing completion percentages in each third of the field. Almost everyone’s passing completion percentage drops as they get closer to the opponent’s goal, so here relative differences are what’s interesting to look at.

Screen Shot 2017-03-12 at 4.37.37 PM

Recall that Killion had more open play passes attempted out of the middle 3rd. Now we can see that she also has a significantly higher passing completion out of the middle 3rd, 85.7%, than Brian – and almost everyone else in this list of top-16 most open play passes attempted per 90, except for Little, who has an astonishing 90.1%, and Fletcher, with whom she’s tied.

Brian, on the other hand, has a significantly higher passing completion percentage out of the attacking 3rd, 77.5% and nearly 12 points higher than Killion – and also tied with Buczkowski for highest out of everyone visible here. Do the math against Brian’s 8.2 open play passes attempted per 90 out of the attacking 3rd, and she’s good for at least 6 completed passes in that third of the field for any given game.

We’ll break down these middle 3rd and attacking 3rd passes further by breaking them down in two different ways – by the direction of the pass (backwards, sideways, or forwards) and by how many were through balls, launch balls, or crosses. That’ll help us better understand what might be behind the differences in passing percentages and how they might differ in the types of passes they attempt.

Open Play Passes by Direction

Below are bar charts now for only Killion and Brian, showing the percentage of their open play pass attempts that went forward, sideways, and backwards, for each third of the field.

Screen Shot 2017-03-12 at 6.23.57 PM

Brian and Killion have virtually the same distribution of open play passes by direction in the middle 3rd, so any differences we can glean from our stats aren’t quite going to be found here. Killion’s open play passing direction in the attacking 3rd, however, is massively different. 71% of her open play passing attempts in the attacking 3rd are going forward, compared to Brian’s 40%. It’s not clear yet, although it might be a smart guess, if these forward pass attempts are what’s bringing down her passing completion percentage. Also recall that this represents about 5.4 and 8.2 open play pass attempts per 90 in the attacking 3rd for Killion and Brian, respectively. Do the math and this means that, even with less attempts in the attacking 3rd, Killion comes out at about 3.8 forward open play pass attempts per 90 compared to Brian’s 3.3. It’s a difference of 1 more forward pass attempt every other game for Killion.

Numbers for attempts by direction are good and give insight into how Brian and Killion are trying to move the ball around but we also have data on passing completion percentages. Below are bar charts breaking down open play pass attempts by direction in the middle 3rd. Each pair of bar charts is for a different direction – backwards, sideways, and forward. The red is incomplete pass attempts, and the orange is complete pass attempts.

Screen Shot 2017-03-12 at 6.39.06 PM.png

Recall that Killion had a couple more pass attempts per 90 in this third of the field, and a significantly higher passing completion percentage, but as far as distribution of direction of passes (the previous chart) they were both very similar. Now Killion and Brian have very similar numbers of pass completed per 90 minutes for backwards and sideways passes, but there’s a significant change for forward passes. Killion is good for almost 3 more completed forward passes in the middle third.

Now let’s look at this same chart, but for the attacking 3rd where there were big differences in the distribution of passes by direction and where Brian had a significantly higher passing completion percentage.

Screen Shot 2017-03-12 at 6.48.31 PM.png

The differences in completed passes are barely above 1, but they do add up, especially considering the total number of pass attempts in this third of the field for both players are in the single digits. So that difference of 0.9 more forward pass incompletions per 90 isn’t massive, but it is chipping away at Killion’s passing completion percentage.

At this point it’s worth noting that the past few charts mean different things depending on how much a “forward pass completion,” a higher “passing completion percentage,” or more “pass attempts per 90” means to you. It intuitively seems to make sense that more of each is good, but with these two players they’ve each had higher numbers in different areas – no one appears to be significantly higher across all stats. Killion in the middle 3rd has a few more forward passes completed, a higher completion percentage, and more open play pass attempts per 90. Brian in the attacking 3rd, however, has slightly more forward passes completed, a higher completion percentage, and more open play pass attempts per 90. If you’re going to get into a discussion about which midfielder is “better” based on these stats, you also need to talk about what you expect out of a defensive midfielder. How good to you expect them to be at passing in the midfield, and – assuming attacking duties aren’t their primary responsibilities – how good do they have to be in the attacking 3rd to make up for a difference compared to someone else in the middle 3rd?

And then there’s the question of how much passing numbers should be adjusted given a team’s players, formation, tactics, and overall performance. If Killion’s passing numbers in the middle 3rd on the face of it are good enough, is there something about the way Brian’s team, the Houston Dash, plays and performs that may forgive lower numbers? The same goes for the attacking 3rd – Brian’s numbers look better, but is there something about Killion’s team, Sky Blue FC, that when taken into consideration makes her a more valuable player than Brian in the attacking 3rd? And, as far as this project is concerned, how much of this extra information is in all the data we’ve already tracked and can thus analyze ourselves?

Some of this additional information is likely sitting in all the match spreadsheets that have been logged for this WoSo Stats project – there’s the potential for further insights if we could get data on passing networks, on situations such as when a team is trailing, on matchups based on the type of players and teams a player is going up against, and likely much more.

For now, let’s look at two more types of passing data. We’ll look at completed passes that go across different thirds of the field, and special types of passes – launch balls, through balls, and crosses in the middle 3rd and attacking 3rd.

Passing Range

The chart below shows the top players by open play passes attempted per 90, with passes completed from the middle 3rd into different thirds of the field (and within the middle third) and with passes completed from the attacking 3rd back into the middle 3rd and within that attacking 3rd. We only have data for completed passes because sometimes it’s not reliably possible to figure out where an incomplete pass was trying to go – such as when it’s blocked right in front of a player trying to pass the ball and it’s not clear just how far down the field the ball was supposed to go.

Screen Shot 2017-03-12 at 7.32.28 PM

Killion overall is completing more passes within and out of the midfield, close to 5 more. The great majority of those are passes that stay within the middle 3rd, and the same is true for Brian. Brian has a few more passes completed within the attacking 3rd. Overall, there doesn’t appear to be a whole lot here to differentiate the two. They’re both obviously distinct from a lot of other players visible here, but it looks like all we can tell from this is that Killion completes more passes per 90 minutes within the middle 3rd than Brian.

Through Balls, Launch Balls, and Crosses

Finally, a look at through balls and launch balls out of the middle 3rd, and through balls and crosses out of the attacking 3rd. Numbers for both players here per 90 minutes end up being small. In the red is incomplete open play pass attempts, and in the orange is complete open play pass attempts.

 

Screen Shot 2017-03-12 at 7.52.30 PM

Screen Shot 2017-03-12 at 7.52.37 PM

Killion in the middle 3rd appears better at launching the ball forward and completing a through pass, with more completions per 90 and a higher completion percentage for each type of pass.

There’s less to see in the attacking 3rd for either player. Killion and Brian barely complete any through balls from the attacking 3rd, likely because by the time they’re in the attacking 3rd from deep in the midfield most of the opposing team’s defense is already well situated in front of the goal. Killion attempts a negligible amount of crosses, and Brian completes about one cross every other game.

Next steps

These two players were an interesting case study because of how similar they are in playing style and how good they are. I had to explore quite a bit of stats as on the face of it they were quite similar with regards to passing attempts and completion percentage, even when broken down by thirds of the field.

In the future, I’d like to do this with other NWSL players who are also considered defensive midfielders – players like Buczkowski and Winters, and others – to see just how alike everyone who plays this type of midfielder role really is. I touched on this briefly, but something like a passing network visualized, showing just who is getting all these passes, could also shed light on not just where players like Killion and Brian distribute the ball, but who they’re passing it to. Are they passing it off to mostly defenders, wingers, attacking midfielders, or straight to the forwards? There’s also curious cases where each player has lined up not quite in the defensive midfielder role but maybe somewhere further up the midfield or outside the wing – it could be possible to account for those matches. And I haven’t even added any stats related to defending, which is a whole ‘nother aspect of being a defensive midfielder that is arguably just as important as how well they pass the ball.

This is all beyond the scope of this blog post, and I hope to revisit another time. Or feel free to go after it yourself, as the data is all there in the WoSo Stats GitHub repo. For now, I hope you’ve enjoyed a look at how the data we’ve logged can dig into the differences – and similarities – between two very good players who, with very few goals and assists, don’t show up prominently on traditional stats sheets based on goals and assists but, with the stats we’ve got, show up as vital parts of the midfield.

One last thing, and one last time, the WoSo Stats project needs your help! If you’re interested in logging data for matches , read more here and email me at wosostats.team@gmail.com or send me a DM at @WoSoStats on Twitter. All the data logged with be publicly available on the WoSo Stats Github repo.

It’s been a year!

It’s been a year since this WoSo Stats project went live! To be precise, it’s been a year and two days since this tweet, when I first went public with this project.

A year later, we have over 100 matches in this project’s database, and are 75% complete with logging the NWSL 2016 season.

None of this would have been possible without the incredible, hard work of the dedicated volunteers behind this project. There have been dozens that have helped out, some for one or two matches, and some for far more, but each of them have helped us better understand this beautiful game.

It’s been a humbling experience seeing how eager fans have been to help do something that hasn’t been done before in women’s soccer. I truly believe that the growth of women’s soccer could be one of the next generation’s most interesting, fascinating stories in sports. In the meanwhile, we’ve got an NWSL season to finish, and even bigger hopes for this year.

-Alfredo

How to create “per 90” heat maps

I added a new sheet in to the “heat-maps-template.xlsx” Excel spreadsheet for “per 90” heat maps, which is going to make it way easier to look at how a player performed across different matches without having to scroll through various heat maps. With a few quick fixes, described below, you can also account for how many minutes the players in the database you’re looking at have played.

If you’re not already familiar with heat maps based on WoSo Stats data, read more about how to create your own heat maps here. Following those instructions, so long as you assign “TRUE” to the “per_90” argument in createMultiLocStatsTabs() function, you should have a .csv file in your working directory named “overall-p90-everything.csv” (if you also assigned “everything” to “match_stat”, which is what the following in this blog post will assume).

In the heat-maps-template.xlsx Excel spreadsheet, like with the stats tables in the “match tables” sheet, in the “per 90 table” sheet you’ll need to copy and paste your “overall-p90-everything.csv” table over the large stats table to the right of the heat map. For now, let’s just look at the per 90 stats table that’s already in the template. It’s a stats table for Sky Blue FC’s 2016 matches from Weeks 1 through 5, and Week 7 (Week 6 is incomplete, for now), and it has some differences with the individual match heat maps beyond the number of players it represents.

Look a the cells I’ve highlighted below in orange. You’ll see that the maximum stat open play pass attempts (“opPass.Att”) for an individual is set at 10.09. But over on the right Caroline Casey has 11.4 open play pass attempts in her own 18-yard box (the D18 zone). This is higher than the individual maximum shown below, but that’s by design.

Screen Shot 2016-11-16 at 8.14.35 PM.png

The formula in that “Ind. Max” column only covers all the rows with players that have played more than 270 minutes. Casey only played in two games, so she missed the cut.

Screen Shot 2016-11-16 at 8.19.23 PM.png

This is a quick and inelegant way to account for some way-too-high per 90 stats for players who played very few minutes. This way, a player who only played for 10 minutes and passed the pall three times out of her defensive third’s left wing (the DL zone) doesn’t jack up the individual maximum with her 27 pass attempts per 90 stat.

This is important because the heat map’s color spectrum is determined by the individual maximum. It starts and zero and ends at the maximum, and if outliers weren’t accounted for then the map would look very light for some very good players. Take Sarah Killion, who at 571 minutes is tied with Rampone for the most minutes played out of any player in this set of Sky Blue FC matches. If the individual maximum was being calculated from all players, regardless of minutes, her open play pass attempts per 90 heat map would look very, very light. She has 10.09 open play pass attempts per 90 out of the defensive middle’s center, but it barely stands out because there’s a player with an individual maximum of 33.75 open play pass attempts per 90 that’s throwing everything off!

screen-shot-2016-11-16-at-8-26-11-pm

Go back to setting the individual maximum based on players who played at least 270 minutes (where now the individual maximum for open play pass attempts is 10.09), and Killion’s volume of passing attempts per 90 from the middle of the field stands out way more.

Screen Shot 2016-11-16 at 8.29.30 PM.png

The above example is for a very specific dataset. What if you created a per 90 stats table for every Portland Thorns match with location data, and what if that table had many more rows? And what if, unlike in the example above where there were 11 players who had played at least 270 minutes, there were 13 players and you had the change the number of rows the “Ind. Max” column is looking up?

The solution, for now, until I find a better solution, is a huge pain in the ass but it works. First find, the row number for the last player above the 270 minute threshold. For the example above, it was row 12 – but let’s say it was actually at row 14. Then, highlight the “Ind. Max” column, search for the number adjacent to the “$AW__” value, in this case it’s 12, and replace that with the row number which in our hypothetical scenario would have been 14.

 

Screen Shot 2016-11-16 at 8.55.23 PM.png

That’s the short of the per 90 heat map. I haven’t yet touched the “Team Max” column, but I will in a later post. Coming soon, I will just make one per 90 heat map for the entire season and update it as I get more location data. I will also work on making it easier to copy and paste over stats tables so that you won’t have manually change any formulas ever.

How to create your own heat maps for NWSL advanced stats

Earlier last week, I published a post exploring heat maps for the Portland Thorns’ April 2016 matches, with a focus on Tobin Heath’s performance. Those heat maps are in an Excel spreadsheet, which you can download here.

In this post, I’ll summarize how they work, and how you can create your own for the matches for which we currently have location data. You’ll need a basic understanding of how to use R and how to modify Excel spreadsheets.

How to create your own heat maps

You will only be able to create heat maps for the matches in the WoSo Stats database for which we have location data. You see which ones they are by going to the database.csv file in the WoSo Stats GitHub repository and seeing which matches have a “yes” in the “location.data” column, which will mean they have complete location data for virtually every action that was logged.

If you’d like to help us get more location data logged for more of these matches and you’ve got a couple of hours to spare, you can help!

Getting the data

Anyways, first things first, open up R or R Studio or whatever you use to work in R, and run this code to source the “getting-data.R” and “create-location-stats-table.R” code. The first file will create a data frame in your working directory for the aforementioned database.csv file and a getMatchCsvFiles() function. The second file will create various functions, but the two we’ll be working with will be createMultiLocStatsTabs() and writeFiles().

source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/getting-data.R")
source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/create-location-stats-table.R")

Now it’s time to pick the matches you want with the getMatchCsvFiles function. This function has the following arguments:
1. competition.string: The name of the competition you want to analyze as it is written in the database’s “competition.string” column. This MUST match exactly what is written in the column, and this argument MUST be written. For the NWSL 2016, you’d write in competition.string = “nwsl-2016”. If you’d like to pick from every single match in the database, then just write in competition.string = “database”
2. The data range you’d like to pick, written as one of several possible arguments. You can pick a specific “round” (such as a week in NWSL play), a set of various “rounds” (such as multiple weeks in NWSL play), or a specific month. These arguments are the following:

  • round: The “round” of the competition, written as round = “nameOfRound” For the NWSL 2016 season, “rounds” are weeks of the season; week 1, for example, would be written as round = “week-1”.
  • multi_round: A vector of different “rounds” of a competition (for the NWSL this would be weeks), written as multi_round = c(“X”, “Y”, “Z”). If you wanted weeks 1 through 3, and week 4, you’d write this as multi_round = c(“week-1”, “week-2”, “week-3”, “week-4”).
  • month_year: The month and year of the matches you’d like, written as MM_YY. For example, matches from May 2016 would be written as month_year = “05_2016”.
  • For now, you can only pick one of these at a time. For example, you can only pick April 2016 matches or Week 1 through Week 3 matches, not all matches from Week 1 through Week 3 that happened in April 2016.
  • You can also just leave this argument blank, in which case you’ll pull everything in the database, according to any further filters you set based on the next few arguments.
  1. team: This is optional. This is the abbreviation for the team whose matches are the only ones you want, written as team = “TeamAbbreviation”. The abbreviation is based on our list of abbreviations for club teams and based on FIFA’s country codes. Double-check the database to make sure the team you want is actually in our database – beyond the NWSL 2016 teams, we only have a bunch of international teams and one random PSG-Lyon match (as of this writing).
  2. location_complete: This is also optional, and is set to default as location_complete = FALSE. What that means is that, by default, you will get all matches, regardless of they have completed location data. For the purposes of this blog post, we will want to set this as location_complete = TRUE

Feel free to play around with this (and let me know if you run into any bugs), but here are some examples of how this function works:

To get all Sky Blue 2016 matches for which we have any data:

getMatchCsvFiles("nwsl-2016", team = "SBFC")

To get all Washington Spirit 2016 matches from the month of June, for which we have complete location data:

getMatchCsvFiles("nwsl-2016", month_year = "06_2016", team = "WAS", location_complete = TRUE)

To get all USWNT matches from 2016 SheBelieves cup, for which we have complete location data:

getMatchCsvFiles("shebelieves-cup-2016", team = "USA", location_complete = TRUE)

For this blog post, we’re going to focus on the code I ran to get all Portland Thorns matches from the first 3 weeks of the season. We already know we have location data for these matches, so specifying location_complete isn’t necessary; however, let’s specify it anyways just in case you weren’t sure.

getMatchCsvFiles("nwsl-2016", multi_round = c("week-1", "week-2", "week-3"), team = "PTFC", location_complete = TRUE)

You should now have a match_list list (a very large one, too) with 3 elements, one for each match spreadsheet, and a match_names vector with 3 elements, one for each matchup name.

Getting the location-based data

The next few steps are pretty simple. Call the createMultiLocStatsTabs() function; set the match_list argument as “match_list” and the match_stat argument as the stat you’re looking for (more on this in the next paragraph); and assign it to variable stats_list. This will create for each match a table with each player in one row and their location-based stats in the columns.

When calling this function, one of the arguments is the match_stat, which is the type of location-based stat you want. As of this writing, you can only get 11 different location-based stats, listed below with the string you need to write in as the argument shown in parentheses. If you wanted to get the largest table possible with columns for each stat (this creates a table with 181 columns), just write match_stat = everything

Or, assign one of these to the match_stat argument:
1. Attempted pases (attempted-passes)
2. Completed passes (completed-passes)
3. Passing completion percentage (pass-comp-pct)
4. Take ons won (take-ons-won)
5. Take ons lost (take-ons-lost)
6. Aerial duels won (aerial_duels-won)
7. Aerial duels lost (aerial-duels-lost)
8. Tackles (tackles)
9. Dispossessions of Opp (opp-dispossess)
10. Opp Poss Disrupted (opp-poss-disrupted)
11. Pressure/Challenges (pressure)
12. Recoveries (recoveries)
13. Interceptions (interceptions)
14. Blocks (blocks)
15. Clearances (clearances)
16. Opp Ball Disrupted (opp-ball-disrupted)

For the set of Portland Thorns matches we are working with, this is the code we would run:

createMultiLocStatsTabs(match_list, match_stat = "everything")

Once you run this, you’ll have a list assigned to the variable stats_list that will have a stats table for each of the three Portland Thorns matches.

Then, write these stats tables as .csv files in your working directory, by running the following. Each stats table’s file name will be determined by the match_stat, which you have to specify again (the data won’t be affected, so you could really name this whatever you want) and by the string values in the match_names vector that was created when we ran the getMatchCsvFiles() function .

writeFiles(stats_list, match_names = match_names, match_stat = "everything")

Run this and, staying with our Portland Thorns April 2016 example, your working directory will now have three .csv files.

To review, here is all of the code that was run since the beginning of this blog post to create the three .csv files that are now in your working directory (the code can also be found here:

source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/getting-data.R")
source("https://raw.githubusercontent.com/amj2012/wosostats/master/code/version-2/create-location-stats-table.R")
getMatchCsvFiles("nwsl-2016", multi_round = c("week-1", "week-2", "week-3"), team = "PTFC", location_complete = TRUE)
createMultiLocStatsTabs(match_list, match_stat = "everything")
writeFiles(stats_list, match_names = match_names, match_stat = "everything")

Create the heat maps

Now the tricky part: creating the heat maps with the data in the .csv files. First, download the Excel template for the heat maps (click on “View Raw” to download), which is just the Portland Thorns April 2016 heat maps, and open it.

Let’s pretend we had this Excel spreadsheet but without the data that’s shown to the right of the heat map, starting with the PTFC-ORL match. Highlight everything in columns “AC” through “HA” from row 1 down to row 29 (as shown in the images below) and clear the contents (DO NOT delete the columns, though).


The heat map will be blank, regardless of what you write into the “Enter name here:” and “Enter stat here:” cells, and the stat info to the right of the heat map and below the cells where you enter the Player and Stat you want will either be zeroes of NAs. This means that the formulas in all those different cells, including the ones that make up the heat map, are looking for data in those columns that we just cleared, but it’s calculating nothing but blanks and errors as there’s nothing there anymore, for now.

Let’s say we didn’t want to re-create the PTFC-ORL match, but instead wanted to use that space we cleared for a BOS-PTFC heat map. Open the “BOS-PTFC-everything.csv” file that you created in your working directory (the following will only work with the “everything” versions of the stats tables) and highlight only the cells that aren’t blank (for this match it’s 27 player rows plus the header row for a total of 29 rows, times the 181 columns, for a total of 5,068 cells you have to highlight). This will look like this in Excel.


Copy those highlighted cells and paste them into the cell at row 1 and column AC, which will fill in the space that was previously taken up by the PTFC-ORL stats. But wait, you’re not done yet! One thing is left to correct, and that’s the team totals.

See that “PTFC” and “ORL” row of numbers in the lower right below the player stats? Those are the total for those stats for each column, which are referred to when creating heat maps for an overall team view. I like to keep the home team on top, so in this example, change “PTFC” in cell AC31 to “BOS” and change “ORL” in cell AC32 to “PTFC”. Then, highlight cells ADH31 (where the totals start) through cell HA31 and search and replace “PTFC” with “BOS”; this changes the formulas in each cell so that they’re now looking for stats from the right team. Do the same for the rows below, searching and replacing “ORL” with “PTFC”. The totals should now be correct.

Finally, in cell Y17 under “Name entered is a team?” is a formula that reads what’s being written into the “Enter name here:” cell and determines, based on the team abbreviations you’ve given it in an OR() formula, if it’s a team that’s been input. Right now this formula is still looking “ORL” as one of the two teams. Change the cell contents from =OR(B5=”PTFC”,B5=”ORL”) to =OR(B5=”PTFC”,B5=”BOS”).

And you’re done! The heat map should work now.

Warnings

  • It’s easiest to create the heat maps with the template I’ve provided if you had created stats tables with match_stat set as “everything.” I added in the option to create smaller stats tables for the future when the “everything” version of a stats table is far, far bigger than 181 columns. For now, though, it makes more sense to work with the “everything” stats tables as 181-column spreadsheets shouldn’t slow down your computer.
  • You can ignore the passing percentage heat maps for the overall team views, as those are the sum of the percentages for each player. I haven’t yet figured out a way to get the average for the percentages that can account for whether a 0.0 passing pct is there because there were no attempts at all.

Next steps

Help us!

Made it this far? Maybe you can help us out a little more. We need help logging this data. This data only happens because of fans like you who have put hours of their free time into logging data onto Excel spreadsheets. But we need more people helping out, as right now we are very low on volunteers and will be lucky to finish the 2016 season by the time the 2017 season even starts! If you’re interested, read more here about how to help and either send a DM on Twitter to @WoSoStats or email me at wosostats.team@gmail.com to get started. All it takes is a couple of hours of your free time, a willingness to learn, and knowing a thing or two about Excel.