dulcolax en espaol

The Importance of Data Quality

Over the weekend controversy was abundant in Spain as Dani Alves was in a questionable offsides position prior to his pass to David Villa leading to Barca’s first goal.  Was this another example of Villarato?  Of course, all of the paper’s had to weigh in on whether or not it was offside.  Thanks to modern technology and the ability to analyze the video, it should be fairly easy to come to the correct conclusion, right?  Nope.

Is Dani Alves Offsides?

Is Dani Alves Offisides? Depends on who you ask.

This is an extreme example, but it highlights an important point.  Humans can view the same event and come to different conclusions.

There are numerous hurdles to making progress in soccer analytics. Ask anyone involved and they will give you a laundry list of why it’s difficult if not impossible. I’m in the camp that it’s difficult, but not impossible and while I agree with most of the usual reasons given as to why it’s difficult (lack of data, hard to isolate discrete events, etc), I think one reason that isn’t given enough importance is the quality of the data. My day job involves dealing with a lot of data and we spend a lot of time and energy assessing the quality of our data and making sure it’s good enough to consider. Garbage in, garbage out, right?

When you look at the current systems for collecting stats, they fall into two categories: automated video analysis(ex: ProZone) or human judgement (ex: Opta).  For the automated systems, the stats are collected by software with minimal human interaction.  Data quality issues could arise due to precision issues with the software, collisions, equipment malfunction, etc.  Equipment malfunction should be easy to detect and relabeling players after a collision should be a straightforward task.  That leaves most of the potential error to be systematic, meaning that whatever issues with precision are in the software would affect all of the data in the same way.

For the human collected stats services, they are much more prone to the types if errors that are present in the Alves offsides decision.  Looking at how Opta collects their stats, you can see that there is a huge potential for human error and bias.

There are ways to combat some of these issues, like looking at inter-rater reliability, but how much of that is occurring? It’s a problem that can be managed, but it needs to be addressed.  AS apologized for their error, but only because it was obviously incorrect. The danger lies in situations where people don’t realize there could be a problem.

One comment

  1. Chris Baker says:

    An interesting article, but ProZone is by no means automated.

    The coding is done in a very similar way to Opta.

    For companies with the full ProZone install (8 cameras around the stadium I believe), fitness data on each of the players is gathered automatically, but as far as I know all the on-the-ball data is gathered the old fashioned way!