As I’ve previously posted, I had the chance to speak at the New England Symposium on Statistics in Sports. They’ve now posted the videos and slides from all the presentations. I’ve posted my video below as well as the slides and original blog post so that all the content is in one place. Originally I wanted to title my talk “Cool Shit You Can Do With Markov Chains in Soccer” but toned it down a bit to “A framework for tactical analysis and individual offensive production assessment in soccer using Markov chains*“. *

**A Framework for Tactical Analysis and Individual Offensive Production Assessment in Soccer using Markov Chains**

**Charlie Adam, a fantastic player who, for some reason, insists on taking a shot from 40 yards out every game. From a fan perspective, it drives me crazy because in almost every instance, all it accomplishes is giving the ball back to the other team. He never scores and rarely comes close to even troubling the keeper from these long range shots. From an analytics perspective, it got me thinking: how much of an opportunity is Charlie Adam wasting with these shots? Can we estimate how likely a team is to score from a given game state (position of the ball, defensive pressure and defensive shape)? Given those estimates, what does that tell us about teams’ tendencies and individual performances? With the ball at midfield, a team is very unlikely to score from a shot, but they could pass it around searching for a better opportunity and eventually the team will either score or turn the ball over to the other team. My aim was to determine how likely those two outcomes are. I decided to use Markov Chains with absorption states to model possessions. Drive by Football has a good explanation of Markov Chains if you aren’t familiar with them. Basically they are a way of modeling an outcome based on the probability of transitioning from one state to another. In this example, the states would be a combination of position on the field, defensive pressure and the shape of the defense. The transitions would be an action performed by the players (pass, shoot, dribble, tackle, etc.). One of the keys to Markov Chains is that they require that the current state is independent from the previous state, meaning, it doesn’t matter how we got here, every time we are in the state, things should be the same. This is a big assumption to make in soccer, but given the defensive metadata that StatDNA provides, we are able to group situations that are more similar than if we were just using position (for example we can isolate situations where the player is 1-on-1 with the keeper in the box versus only knowing the player was in the box, but not knowing if there were several defenders in their way or not). The first order of business was determining what my game states were going to be. I**

*wanted*to divide the field up into a fine grid but that meant my transition matrix was going to contain several million elements. Instead I settled on the following grid system based on the different characteristics of events that happen (see diagram below). Most shots occur in Zones 2+5, most goals come from Zone 5, Zones 1+3 are early crosses, etc. Along with a zone, each state also has defensive pressure and defensive shape associated with it. For example, 2 states could be “Zone 5, behind the defense, no pressure” and “Zone 5, behind the defense, under lots of pressure”.

Additionally I defined states for set pieces because of their unique characteristics in the game: long and short corners, long and short free kicks, deep and shallow throw-ins and penalties. Overall there were 37 different states the ball could be in, plus the two absorbing states: goal and turnover to the other team. With the states defined, the next step was to calculate the transition probabilities. For each state, I wanted to know how likely the ball was to be moved to each one of the other states. The great thing about Markov Chains is that once we have the transition probabilities, we can calculate the probability of the ball ending up in one of the absorbing states after an infinite number of moves. The states are called absorption states because once the ball is in that state it doesn’t leave, the possession is over. By looking at an infinite number of moves, it makes no difference if the ball ends up in the transition state after 1, 5, 10 or 100 transitions. Possessions of arbitrary length are handled nicely because of this trait. We can easily look at all the different possible ways the possession can unfold and calculate how likely a team is to score from a given starting state. I did this not just for the entire league to see general trends, but also for each individual team’s offense and defense.

**Short versus Long Set Pieces**

**The same technique can be used to examine how teams defend corners. Below is a graph that shows each team’s probability of conceding from both types of corners. Not surprisingly, Arsenal is one of the worst teams at defending long corners. Manchester United is notably worse at defending short corner than they are at defending long corners. These bits of info could be valuable when planning a team’s in-game strategy.**

This type analysis can be done for any of the game states that were defined and can be used to look at whether a team is good at counter attacking, whether they are better under pressure or if they need more space to operate, or whether throw-ins are advantageous, for example.

**Individual Offensive Contribution**

We can also examine who is the most wasteful with the ball by looking at who has the lowest offensive contributions. Goalkeepers are colored in grey in the diagram below. The strong presence of goal keepers among the worst contributors should be a red flag for most teams, as it possibly indicates significant room for improvement in the keeper’s distribution. Darren Bent is far and away the most wasteful player outfield player in the dataset. The sample isn’t representative of his season as he scored 17 goals last year, but only one of those goals was present in the sample set. However, in the set he had 19 opportunities where he received the ball in a state with a probability of scoring greater than 10% (the average probability of these chances was 22%). Darren Bent only converted one of these chances and his offensive contribution for these high probability chances was -0.263. Imagine how high he’d be ranked if he could have finished some of these chances. There are loads of additional questions that you can start to try to answer using this framework. The data can be sliced and diced in all sorts of interesting ways. Currently the model doesn’t account for the quality of the opposition, which would be a good next step in developing this framework further.