March Madness Analysis Pt. 1

March Madness, so many people’s favorite time of the year. Everyone who enjoys college basketball and even some who don’t always have something to root for. Whether it’s the great underdog stories, or someone rooting for their alma mater or if you’re like me, trying to beat your friends in that March Madness bracket challenge without having really watched any college basketball throughout the year. But not to worry that’s where data analysis comes into play. So that’s what I did and am going to show over a series of posts here.

First let me start with some quick background, I’m just some guy trying to get a tiny bit of an edge. The odds of a perfect bracket are about 1 in 9.2 quintillion or 1 in 120 billion if you’re knowledgeable about college basketball, which I am not. But what I do know is basketball in general and data analysis.

So I took multiple approaches throughout all this to try and determine some applications for the data I had. For starters, the data I got was from kaggle, the College Basketball Dataset by Andrew Sundberg. It has all the information of teams from 2013–present, their stats, where they finished or were seeded, etc. That was the primary source of data for my analysis.

First thing I wanted to do was get a general similarity matrix of the data. So the basics behind what I wanted to accomplish is to take the data from all the teams who made the NCAA tournament from 2013–2019 (2020 was excluded as there was no tournament). Using SQLite3 I organized the data and got the averages of the teams grouped by where they finished in the tournament. I ignored the first 4 teams that don’t make it as that doesn’t have bearing on the actual tournament brackets that I’m targeting. So using that data, I found the average stats for all teams based on where they finished in the tournament. That would be 1 side of the similarity matrix, 7 rows of data with the average of each stat for that postseason finish.

This would give me a good base to compare the current teams to which is the other side of the matrix. My idea is simple, for every team in the current tournament, which post season rank most accurately represents them. Is this a perfect measure of teams, definitely not, there are so many more factors that go into it. Also postseason ranks 5, 6 and 7 which correlate to the final 4 or better have much less data than the other teams so it might be a little less accurate overall. But this is my starting off point that I will later try to backup or refute.

For the similarity I decided to use a Euclidean distance, a value going from 0 to 1. And when running my final code what I found was that nothing was really that accurate. The highest accuracy I had was Texas University to a round of 32 team (which is weird as they are very highly rated in general by the college basketball analyst) with a rating of .1325. So while this is pretty nice and gave me some interesting insights the differences in the euclidean distance values between teams was sometimes negligible. For example, Virginia ranked most closely to a Championship loser, aka 2nd best team in the tournament. But the value for them being a 1st round exit and the 2nd best team had a difference of .006.

So while this is nice to see and gives me some fun information like Loyola Chicago, an 8 seed, being most closely correlated to a Final Four team I dont think its the best evaluator for the data yet.

For more information, a look at the actual data etc, please feel free to take a look at the github for this mini project I’m doing. Obviously this won’t actually be affecting my bracket much for this year as I will be updating and working on this during the tournament but I want to use it as a guide to see how well I can actually predict the winners or at least teams that may succeed in the tournament using data. Thanks to everyone for reading and I am hoping to update this on a weekly basis (github more frequently).

https://www.linkedin.com/in/antonio-hila/