Basics of Probability and Linear Regression in Sports Betting

Antonio Hila
5 min readApr 26, 2020

Sports betting has always been a big part of sports, whether it’s just between friends or betting on Vegas lines. I’m not a big gambler myself but I always like to predict which teams will win against the spread. Before I get into that some information on Vegas lines for those that don’t know. The spread is the score by which Vegas predicts one team will win over another. So if the Knicks are favored (just an example obviously) by 3.5 points, then to win back the same amount of money you put in the Knicks would have to win the game by more than 3.5 points. So knowing that I have always wondered, how do Vegas make their lines so well. It’s always very hard to guess who is going to win by the Vegas spread because it’s always so well done. It was not always like this but with the boom in data science the numbers and lines have continued to get better and they use linear regression to do it. They use data they have collected about the team from games they have played in that season or in the prior seasons (if the team is similar) to judge how they will do in the future.

Creating predictions

So creating predictions with other forms of betting is simple. If you want to guess what card is coming next in a deck, you know you have a 1/52 chance. But if you want to know which team will win in a certain game it’s a lot less black and white. That’s where linear regression and probability comes in. The most basic method is to use a team’s current win percentages as the model. So if team A won 50% of their games, and team B won 55% than you would pick team B. Obviously that’s not the case because there is much more to a game than which team has won more in the past.

The base model that is most commonly used is logistic regression analysis. It is used to provide a probability percentage for a given variable and for sports it uses mv, or the margin of victory. This has always been the best indicator of a good team. They adapted this even further into net rating, which is a teams margin of victory over 100 possessions, so teams that play faster or slower get rated at the same scale.

The problem with the logistic regression system is that it only accounts for the 1 variable of margin of victory. But an actual game depends on many variables such as, matchups, players in the game, players that might be injured, history between the 2 teams prior to the season, home field advantage, etc. A lot of the time these things have an impact on the game and finding how much of an impact that can be is what the oddsmakers are trying to figure out. The method that is used to cover this is multiple regression analysis, as the name suggests this uses various pieces of information about the team in the past to see how it will affect outcomes in the future. To use a multiple regression analysis you need a dependent variable you are trying to find and multiple independent variables that relate to the dependent variable. The information that someone will pass through the regression model may be, the current win percentages of the team, their record against each other, and the point differential. And now using the new model with multiple parameters you can get a much better sense of the data and who will win.

But there are problems with multiple regression analysis. One large problem is when analyzing the data it is important to know whether there is correlation or causation to the variables that are being used. Because the multiple regression model needs the independent variables to relate to the defendant variable if the user takes in variables that correlate to the defendant variable but it doesn’t actually have a large effect on the outcome it can mess up the prediction model. Take for example a matchup between the Celtics and the lakers from earlier this year. In this game Kemba Walker on the Boston Celtics, was playing against Lebron James on the lakers. Before this game LeBron had a 28–0 record against Kemba in their career matchups. Looking at the record at face value will lead almost anyone to believe it is a very important stat that you would add on to a multiple regression model to take into account for the game. And honestly even beyond face value it’s a little hard to argue because a 28 game sample size is fairly large in the NBA. But when you dig into it you realize for almost all of those games Kemba Walker was on a much worse Charlotte Bobcats/Hornets team that in almost every game would be a heavy underdog to whatever team Lebron was on. So now that he’s on a much better team it may have less of an impact on the prediction of the result than one would have first thought. These are the type of things that they look out for when using the regression models for games. Every game is different and all things have an affect on the game no matter how small, the model just takes the most important information and tries to get a fair number out of it that betting companies can then use as the spread for a game.

Closing thoughts

As with almost anything in life these predictions can always get better to get even more even lines for the future. A lot of the time it’s when the betting companies lose money on a certain type of bet that they begin to adapt and get better. People are always trying to find variables that can be looked at and used to get more accurate predictions. But in the end there is only a certain point they can reach as sports have a variability to them that is, at least in my opinion, impossible to fully predict.

--

--