I decided to write this blog while thinking of a project idea I had, to try and analyze soccer statistics and make predictions. I found a problem in that soccer statistics in general are not great. The base statistics that everyone talks about, argues over, are things like goals and assists. Which are not a great measure of value by themselves. Obviously, the whole point of the game is to score so goals are not entirely dismissible but there is so much more that goes into the value of a player than just how many goals they score. But I wanted statistics that go beyond the basic stats entirely, not just goals, assists and tackles etc. Let us say an attacker is making a great run and drags a defender along with him which opens space for another player to score. That first player would technically have no statistical impact on the game but his move lead to a goal. The idea of sports analytics in general is to capture those impacts to properly judge the value of players.
While reading I came across the Shapley Value, which is a mathematical concept in game theory that may be able to help with soccer statistics. But before getting into that I wanted to explain what Shapley’s Value is. I think the best way to do this would be to give an example. Let’s say there are 3 people that that are trying to get home and we want to find the best way to share the fair for a taxi based on how long they take to get home.
We can see person A gets home the fastest and it would cost them $6, then person B, and it would cost them $12 and finally person C takes $42 to get home. Using Shapley’s Value, we can determine what each person should pay to make it a fair distribution. How we do that is by taking the permutations of how they would get home if they all shared a taxi. The Shapley Value for each person is then defined as average of the combination of each person.
You may be wondering why this game theory may matter to machine learning and the reason is that what Shapley’s Value can really help with is interpretability of features in a model. So, I am going to use something we as a class have done before which is modeling housing prices, in the case below its an apartment but same concept.
Looking at the picture above you can see the predictions of 2 different apartments using 3 features. The only difference is that one allows a cat and costs $320,000 and the other does not and costs $310,000. What Shapley’s Value will do is that the data it has and try to explain the difference between the output in a model and the mean value of the predictions and see what features caused this difference. In the case above where we only have 2 data points. What we would do is take the mean of those 2 points, which is $315,000 and check the differences to find where the discrepancy is. For something like this it is simple. The only difference between the 2 is cat which means that allowing cats/not allowing cats changes the value by $5,00. And you can use this idea over a massive dateset with a ton of features. The only problem is that this takes very long as the number of features increases. Shapley’s Value will take every single combination of features available and their outputs to find the impact of each feature on the final answer.
So now we get back to our original question, how can we use this for soccer statistics. Well the idea is simple, just like we can use it in models to find the value of a specific feature we can use it for players on a team to see which players contribute the most to goal difference. if we take for example Barcelona. Most people can tell you Messi is the most important player on the team, and don’t get me wrong I agree, but the question would be why outside of just goals and assists. Using Shapley’s Value we can take a bunch of iterations of of the 11 players on the field and see what the goal difference of each combination is. So if the goal difference whenever Messi is on the field is 1.2 and whenever he is not on the field it is -.3 then we can say that Messi has a very large impact on the game. And we can do this for all types of players and this would be most pivotal to find players who have a very strong impact where stats don’t show it.
The idea, in theory is good and could give us a metric with which to see the value of players. But the issue, most importantly for soccer is variance. Soccer unlike other sports has minimal substitutions, only 3 players are allowed to sub in during a game. It also has a very low scoring rate. These 2 factors make it very hard to find actual combinations with which to find the Shapley Value. And what I mean by this is if Barca win every game by an average of 1 goal that’s very good. So in a season where they play 40 games if Messi only misses 1 game and in that game they win by 4 because the team they played was not good. Then the Shapley Value for Messi’s impact could decrease where in reality it was likely just 1 anomalous game. This is obviously an issue in sports prediction in general but especially so because of the variance in soccer. This is why 2 things are important. Firstly, getting as much data as possible, having more data-points will always help reduce variance in models. And secondly, we could add a penalization method similarly to what is used when doing ridge regression. This will help smooth out our predictions for player values in the future and likely help avoid over fitting to wild variance in the data. Then when applying Shapley Value to get a value for each player we can hopefully avoid bad data.