Final 4 time in the NCAA Tournament and my next step was to create a model to give me another set of predictions for this years tournament. See how it does with the data I have. The last part will come next week after the completion of the tournament where I will try and find the best model, in some cases using less statistics based on some feature analysis, and then finding the best parameters for the model. But for this one I wanted to get a base model down.
For my model I went with an XGBoost model. This is a strong model in general but the main reason I went with this is that it handles class imbalance well which is a big problem with the current data. As explained previously I took stats from all the teams over the last 7 years of tournaments. This meant I would end with 7 champions and 7 runner ups but 224 teams who lost in the round of 64. That large class imbalance would make it harder to use a different model without adjust parameters.
So after creating my model I trained it on the data of the 7 years of tournaments I had. And when applying my model to the entire set of those 7 years the results looked ok. I know this doesn’t really show much since it was trained on most of the data in this set but I just wanted to see if the distribution of each round was ok and for the most part it was.
Then came time to test it on the new data which was this years results. Before I get into the different teams results more specifically. We can see here that the distribution, while not perfect by any means is not bad. The distribution should go as follows; 36 Round of 64 teams (the data set was used before the completion of the First Four games so the original 68 is used), 16 Round of 32 games, then 8, 4, 2, 1, and 1 champion. So for almost all of these they were fairly close. The biggest discrepancy was the Sweet 16 teams, having only 5 instead of 8 and the fact that there were no Second Place teams.
Then we get to the actual distribution by each team. And we can see a fairly appropriate curve in the data. Starting with most of the lower seeds being knocked out early, and most of the higher seeds making it further. Some further analysis needs to be done on why teams like Michigan didn’t perform well with the data when they performed well in the tournament, finishing in the Elite 8. As was the case in this model as well as the similarity matrix from part 1, one thing that they both agree on is Gonzaga. They are by far the best performer in the model with no-one else even being in the Runner up spot and only 1 other team, Houston who did in fact make the Final 4, in the Final 4 spot.
Gonzaga have been great all season, rank highly on both sides of the ball, especially offensively, and went undefeated so they aren’t a surprise. Houston is a bit as they were a very strong defensive team but many analyst had them as one of the higher seeds that wouldn’t do well. Houston has done well beating everyone in front of them, but in fairness to the college analysts one argument against Houston is that they haven’t had to beat a single digit seed yet in the tournament. And even though those teams performed well in their tournament games to get to Houston they were ranked as a double digit seed for a reason.
In the end I think the model is definitely lacking more nuance to separate these teams even if it is doing ok for the most part. For the next part I will work to get a better performing model with more readable analysis of the teams rankings. Analyzing the feature importance for the data to find what has the largest affect on these rankings. This will help me better understand what statistics should be taken into account when analyzing these teams for future tournaments.