March Madness Analysis Part 2

Wow the tournament has been absolutely wild so far, exciting games and ruined brackets everywhere. It was an exciting year for me having a perfect bracket for all of an hour and a half, then the first game ended and with it my perfect bracket dream. But what can we pull out of this weekend of games from the first 2 rounds statistically.

As I was talking about in part 1, sports in general are very hard to predict because people play them and not robots and people are impossible to predict. But millions of people try anyway, myself included. And this weekend was probably the best example of why this is so hard. There were only 21 perfect brackets after the first 16 games, from the millions that were sent in. And that was down to 0 by the time the U-Conn game was over. This year has the highest ever average seed going into the Sweet 16, 5.88. The question is why is this year so crazy in college basketball. And my best answer is because the year is crazy everywhere.

Covid-19 has had an affect on all these teams in some way. Whether it be players getting it and missing games or feeling after affects that are impossible to quantify. Forcing all the teams into a bubble which some teams would handle better than others. Not having large crowds at the game which again is harder to quantify. Or even more general just dealing with it through a family member etc. Whatever the case I think all of this will affect these games in ways that are hard to evaluate. But nonetheless I tried to do it anyway with my similarity matrix in the first edition. And now having seen how the games went and how crazy they were I wanted to see how effective the similarity matrix was before diving into modeling for next week.

First I wanted to go into the 2 different similarity matrices I had generated, and visualize them using Tableau.

For reference:

  • Championship = 7
  • 2nd Place = 6
  • Final 4 = 5
  • Elite 8 = 4
  • S16 = 3
  • Round of 32 = 2
  • Round of 64 = 1

So the first one above, in my opinion the weaker of the 2, was a comparison of each team from this season to their most similar team (by Euclidean distance) from all seasons between 2013 and 2019. As you can see just from the results this result had very little variance in the data which made me skeptical. The problem with this data is there is just a massive discrepancy between the number of teams on the higher end and the teams on the lower end. Since 32 teams lose in round 1 that means there are 222 teams who lose in round 1 from the dataset compared to only 28 teams who make the final 4 or better. So the chances are that most teams will fall in the 1st round exit category, which is a bad representation of the overall strength of the teams. But will work better for modeling where I can balance the data set through some different models.

The second and more effective model in my opinion is where I compared each active team to the average statistics of teams from 2013–2019 based on their post season finish. So averaging all teams who finished in the round of 64, 32, etc. all the way to the average of all the champions. This was my way of accounting for outliers, like Virginia who was a 1 seed that lost in round 1, being comparable to a 1 seed now and saying they will lose in the round of 64. This one gave me results that were at least more reasonable.

So where did I go wrong and where did I go right compared to the actual results. I’m going to start with the good because there is so little of it. Loyola Chicago was by far the best result here in terms of the matrix, for both actually. For whatever reason they correlate very strongly to the equivalent of a team that finishes 2nd. They clearly have some good statistical cases that didn’t show up in their regular season record because they were put in as an 8 seed. But they have played to the level of a finalist, beating one of the more highly rated teams in Illinois. The other one that has done well is Creighton. But that I take with a grain of salt as they only beat a 12 and 13 seed which according to the data they should beat. Even if the 13 seed beat a highly ranked 4 seed which we will get to next.

So where did it do poorly. There are 3 bad choices here, the first is Ohio State who was a 2 seed and lost to a 15. The point of credit I will give to the matrix is it had them ranked a bit lower than the teams around them so it did see something in the stats that it didn't favor but nonetheless losing to a 15 is hard. The second is Iowa which by most statistics should've at least made the elite 8 match against Gonzaga as no team outside of them ranked highly in their path to the elite 8. But they lost to Oregon in round 2. And the last one is the most easily explainable, Virginia. The problem with Virginia was injury. They lost their best player going into the tournament so whatever the stats said about them going into the tournament those were out the window by the end.

So yeah this was definitely not the best predictive method I could use but in the end there were definitely some positives within it. Especially with Loyola Chicago but also with teams like Kansas, Texas and Ohio State who were all rated more poorly than most of their peers in the same seeding group. But to be even better would require taking into account outside factors beyond just the stats. For the next part I hope to build some models that I can run to give me another set of predictions for the data. As well as to go over some more of the games and see if in fact teams like Loyola Chicago and Creighton can live up to their rating in the similarity matrix.

https://www.linkedin.com/in/antonio-hila/