Previously I’ve written about examining conversion rates and shots as a way of examining which areas an offense or defense excels at or is struggling with. Shots can be a crude estimation for opportunities and conversion rate and estimation of how well a team executes on those opportunities. I had looked at offense and defense separately in the past, but decided to combine the two to see if any interesting patterns emerged. I represented the difference as a vector, with the magnitude (length of the line) representing how much of an advantage a team had and the angle representing how much opportunities or execution contributed to that advantage. Since one of the emerging stories of the season is the high number of shots Manchester United is conceding, I thought it would be interesting to see how this season stacks up compared to the previous season.
Archive for Offensive Production
As I’ve previously posted, I had the chance to speak at the New England Symposium on Statistics in Sports. They’ve now posted the videos and slides from all the presentations. I’ve posted my video below as well as the slides and original blog post so that all the content is in one place. Originally I wanted to title my talk “Cool Shit You Can Do With Markov Chains in Soccer” but toned it down a bit to “A framework for tactical analysis and individual offensive production assessment in soccer using Markov chains“.
This weekend I had the privilege of speaking at the New England Symposium on Statistics in Sports. It is a much more technical conference than the Sloan Sports Analytics Conference so I felt a bit like a duck out of water given my background in computer science and not hardcore statistical methods (and these guys were hardcore!). Originally I had planned to do a write up, similar to the one I did for SSAC, but there was too much going on for me to take adequate notes. I really enjoyed chatting with a lot of people who are similarly passionate about their respective sports and take the time to sit down and produce cool stuff. The panel discussion was also fascinating. Some of the themes that were discussed during SSAC carried over such as:
I am thrilled to announce that I will be speaking at this year’s New England Symposium on Statistics in Sports (NESSIS) on September 24th. Earlier this year, StatDNA announced a Soccer Analytics research competition and my paper was selected as the winning entry. I’ll be giving a talk titled “A framework for tactical analysis and individual offensive production assessment in soccer using Markov chains”. Catchy, right? Well, if that didn’t grab your attention, Chris Stride from the University of Sheffield will be giving a talk called “Cheating in football: Team culture, player behavior,or question of circumstance?” and there are several soccer related posters as well. If you’re attending NESSIS, drop me a line at firstname.lastname@example.org or come say hi after my talk. I’ll also be attending the post conference drink-up at Porter Square’s Tavern in the Square. For those that can’t attend the conference, below is my abstract. You can find the others here.
A FRAMEWORK FOR TACTICAL ANALYSIS AND INDIVIDUAL OFFENSIVE PRODUCTION ASSESSMENT IN SOCCER USING MARKOV CHAINS
Markov Chains are an effective way to model transitions between states. Assuming that the current state is independent from the previous state, Markov Chains can be used to model the set of state transitions that make up a possession in soccer. The transitions are used to determine the probability a possession ends in one of two final states; scoring a goal or relinquishing possession to the opposing team. Once the final probabilities are known foreach state, they can be used to determine game situations from which goals are more likely to develop, team strengths and weaknesses and metrics for assessing the offensive contributions of players.
Using this framework on the sample data set, we found that teams are more likely to score from taking long corners than short corners, with the notable exception of Tottenham Hotspur who excel at short corners. The top 3 teams most likely to score from a long corner are: Arsenal, Newcastle and Stoke. The top 3 teams most likely to concede from a long corners are: Everton, Arsenal and Newcastle. The framework can also be used to look at various game situations like building from the back, counter-attacks, free kicks, and entries into the final third, for example.
Additionally the transition probabilities can be used to determine which individuals are best at receiving the ball in situations with a high probability of scoring and which individuals are best at moving the ball to an improved state with a higher probability of scoring than their current state. The top 3 players for increasing the probability of scoring are Tim Cahill, Yaya Toure and Cesc Fabregas. The 3 most wasteful players who decrease their teams probability of scoring the most are Darren Bent, Peter Odemwingie and Gael Clichy. The top 3 players who receive the ball in the most advantageous states are Dimitar Berbatov, Nile Ranger and Benjani Mwauruwari.
Soccer By the Numbers has a great post on the value of corners which raised an interesting point about the importance of different statistical measures. One of the problems with trying to build regression models for soccer is that few of the variables are independent. For example, if you want to look at the relationship corners and shots have on wins, it gets a bit tricky. It’s likely a corner was awarded because of deflected shot or goalkeeper tip and it’s also likely that the corner itself will lead to a shot. As the number of shots go up, you can expect the number of corners to go up. As the number of corners go up you can expect the number of shots to go up. It’s a mess. How can you determine the effect of corners and shots on a team’s success when they are so intertwined?
If you build a regression model, you’ll arrive at a coefficient for each feature in your model and you can look at the incremental effect of each feature. Using the previous 5 years of data from the EPL, I built a linear regression model to predict the number of points a team earns in a season based on Offensive and Defensive Production.
Points = 64.39+0.06095*Shots+26.16*ConversionRate-0.0797*ShotsConceded-29.73*Opponent’sConversionRate
The R-Squared for this model is 0.9469. Incredible, but it begs some questions. What is more important, offense or defense? Creating chances of finishing chances? You can look at the coefficients of the features and sort of say that the defensive coefficients are higher than offensive ones so maybe defense wins more games, but what about chances versus finishing?
There is a technique called LMG (named for Lindeman, Merenda and Gold) that quantifies each feature’s relative importance in a linear model.
Using LMG on the model you see the features are similarly important, with defensive features slightly more important than offensive (whether or not it’s a significant difference is another story). Fair enough, but there are lots of factors that contribute to shots and goals, so the next question is, can we create a model that includes some of these and what will that tell us?
I created a kitchen sink model with a handful of features that intuitively I thought would impact a team’s success (remember we are looking at points earned over a season and not the results of individual matches). I included (both for the team and it’s opponents):
The R-squared of the model is 0.9531 and all of the coefficients of the features had the expected directionality (positive for the team in question and negative for their opponent, meaning taking a shot is good and conceded a shot is bad) for models with only one feature.
Goals scored and conceded are the most important features in the model which makes sense, but what was surprising is that clean sheets are almost as important as goals scored. When you think about it, though, it isn’t as surprising. A clean sheet means a team is guaranteed at least a point and being held to a clean sheet means a team can earn at most a point. Clean sheets explain 13.35% of a team’s points earned in a season. Corners? They are almost as important as shots. Soccer By The Numbers looked at the number of goals scored from corners (not a lot) so I was surprised to see corners with such a high relative importance. It might be that there is a missing feature from the model that better explains points earned in a season and is related to corners. If this missing feature were to be added, LMG would decompose the relationship accordingly and corners would have a lower relative importance. Cards? Cards are practically insignificant. Yellow cards’ impact on matches is fairly small. You could argue that perhaps a defender on a yellow card is more cautious but the number of events that are altered because of that is pretty small. Red cards are extremely infrequent so while their impact on a single match is high, their impact on the entire season is insignifcant.
Similar to the graph showing the offensive production of teams, this graph shows their defensive production. I’ve reversed the direction of the axes so that the upper right quadrant in both graphs is the preferred location for teams. A key difference between the two is that I haven’t generated a model to determine estimated points based on defensive production, so both color and bubble size indicate a team’s actual points. Again, we see the top teams congregating in the upper right corner and Toronto seems to be the only team that is somewhat out of place based on the number of points they’ve earned so far this season. NY and Philadelphia have both had their offenses stuttering this season but they’ve been able to pick up points thanks to their ability to keep clean sheets in most matches.
An interesting match up this week is DC United hosting the Seattle Sounders. Seattle is second in the league in Shots Per Game while DC has the worst opposing conversion rate. New York recently put 4 past DC (including this nice one from Juan Agudelo). Can Seattle do something similar?
Most teams have now played 7 matches and should be settling into what their offensive production will look like for the rest of the season. What is odd is that there are very few teams in the magic quadrant. Previously we had seen the best teams in the upper right quadrant, but currently only Sporting KC is firmly established there, with Chicago, Vancouver and the NY Red Bulls on the border. Is this going to be a season of extreme parity in MLS?
One of my projects for the season is to track the evolution of MLS teams’ offensive production. The last MLS update can be found here. I’ve changed how I am presenting the results slightly. In the previous version, size and color were for Points Per Game earned. From here on out, size is actual Points Per Game earned and color is estimated Points Per Game based on a simple model (the color change from red to blue is set to 1.1 PPG which is a rough, rough estimate for making the playoffs).
When I published the offensive production numbers for the MLS Season so far, a lot of people brought up the issue that we are only a few matches into the season and it is too soon to make conclusions. First, the point of publishing offensive production numbers isn’t to predict where a team will finish at the end of the season (it is far too simple of a model to do that and yes, we only have 2 data points for most teams). The value in it is that it teases apart some of the factors that contribute to a team’s success. It’s obvious that the Sounders have had trouble finishing which has cost them points, but what about a team like the Red Bulls? As of last week they had picked up a respectable 4 points from 2 matches, but their offensive production shows they have trouble both creating chances and finishing. If that trend continues, they could find themselves in real trouble.
However, the issue of when do you have enough data to make decisions is a crucial one. At this stage in the season, a good game or two can push a team into the “magic quadrant”. So when do you know if you have enough data?
I ran my previous offensive production model over 5 years of EPL data and the results are displayed in the diagram below. Teams with larger text and bluer color earned more points and smaller text and red color fewer points. The big four end up in the upper right quadrant and the teams with the lowest points in the bottom left which is what we’d expect. The one thing I found interesting about this plot is that while the big four tend to separate themselves from the rest of the league, the bottom teams do not show much of a separation.
What’s interesting about the clustering at the bottom of the league is that it highlights the shortcomings of this very simple model. If you look at conversion rate=0.13, shots=9, Bolton and Birmingham have almost the exact same offensive production, yet their point totals varied by 21 points.