So the NCAA tournament has rolled around again and I have entered into a small pool with some family and friends (its a tradition in my family). However there are two things that set this year’s tournament apart from others.
1. Binghamton made the big dance. It was pretty cool to see us play on national television, but we had an unfortunate first round matchup against Duke, who almost always makes it to my final four. Regardless of my school spirit, I had to go against them.
2. I wrote an algorithm to help me choose teams. I got a bunch of statistics on all division 1 teams from the NCAA website (they provide 16 stats, like points scored, points against, average margin of victory, average turn overs, etc) and was planning on using those statistics and a schedule of win/losses for the year to create a Naieve Bayes Classifier. The basic idea is I train the classifier on all of the regular season games, where each data point is a vector of pairs, where each pair is the ith statistic for each team. So for Duke vs. Binghamton it would have been a data point like this
< (ppg for Duke, ppg for Bing), (apg for Duke, apg for Bing)….>
and then include a win or lose label for each point. Based on that training data, for the games in the tournament, I create the data point and then see if that point is more likely given a win (this is the bayesian part of it) or more likely given a loss. I choose the label which maximizes the probability of that data point. There would have been some more details to work out, but that is the basic idea. But only after I had written all the code to parse the statistic files, I found that I could not find a schedule for the season in plain text, so I couldn’t parse that information and making everything I did pretty much useless.
Being that I had already collected all the statistics for each team, I decided that I would simply cluster all the teams using k-means just to get an idea of what teams were statistically similar. Then using those clusters and my limited knowledge of the NCAA teams, I could infer the quality of a team. For instance, I know that Duke is always a good team (and were also ranked #2), so I can reasonably infer that teams in the same cluster as them will have similar stats and will probably be good teams as well. There are a number of flaws in that reasoning, but I did this all in an afternoon and this algorithm was meant to provide basis to my picks, which would otherwise be arbitrary. So how did this work out? I set k = 33 since there are roughly 330 division 1 teams and I figured that if it were evenly distributed, 10 teams per cluster was a good amount. Well 7/8 of the top ranked teams in the tournament were placed in the same cluster (UConn being the odd one out), along with some other teams in the tournament, notably gonzaga and LSU. Once I saw the resulting clusters, I ordered them based on their euclidean distance from the “best” cluster, or the one that contained all the top seeds. Then, when making picks, I would choose the team that was in a higher-ranked cluster.
So what are the problems with this algorithm? Most obviously, I have no way to choose between teams in the same clusters, which is a problem since I should expect that the later rounds will come down to teams in the same cluster (assuming that a 1 or 2 seed will be the winner of each regional conference, this algorithm is useless in choosing final four games). The next problem, which is less obvious but more important, is that a cluster being closer to the best cluster does not actually imply that is a better cluster. For instance, lets assume that we only consider average points scored and average points allowed. Lets say the best cluster has an average of 80 points scored per game and lets in an average of 70. Well a cluster whose median team scored 75 points per game and allows 75 points per game will be “closer” than a cluster whose median team scores 90 points a game and allows 60 points against. Clearly, the second cluster would be a better choice for winning, but the first cluster would be ranked higher since it has a smaller euclidean distance. I haven’t yet thought of how to fix this problem, but I assume its possible if I used a different distance metric. Of course, this whole algorithm is extremely ad hoc and based upon weak theoretical foundations, so trying to fix just that problem would be like rearranging deck chairs on the titanic.
So, in practical terms, how is this algorithm doing? My algorithm didn’t choose too many upsets (an upset being defined as a higher seed beating a lower seed, so a 9 seed beating an 8 seed), but about as many as a human would. Well I am near last in my bracket, but there were some true upsets and there were two games that my algorithm choose a 12 seed over a 5 seed and got it right (I choose Wisconsin over Florida St and West Kentucky over Illinois). Many people had those 5 seeds advancing to the sweet sixteen, so I will make up some points there hopefully. In terms of the 8-9 games, I got two right and two wrong, which is about what most people did. It also has some irony that I did no better than flipping coins since I created this algorithm for the purposes of choosing games like that. The big triumph of my algorithm will be if LSU beats UNC, since an 8 seed would knock out a favorite to win the entire tournament. Looking at it after I made that choice, it might make some sense that LSU has a chance since UNC is an entirely offense based team (averaging about 90 points a game), but doesn’t seem to do too much on defense (let’s in about 72 points again which puts them somewhere in 200’s in terms of defensive rankings).
Whether this whole thing turns out good or bad, it was a fun project to do for an afternoon, and I plan on expanding the project for the next season. One fundamental difference between my algorithm and others that I have seen is that my algorithm doesn’t really look to take a chance factor into play. Since I always chose the better ranked team, there is not much difference between what I did and what someone who always chooses the better seeded team to win. So by following this algorithm, I basiscally assume that I have come up with a better ranking than the NCAA selection committee, which is an interesting idea. I feel that better way to do it would be to estimate probabilities of each team winning and then running a simulation, so that I can actually take into account the chance factor. There is also a lot I could do with making the statistics more meaningfully, such as weighting more recent statistics more heavily. Whatever I do, its a cool application for some stuff that I have learned and maybe I can make a research project out of it next year.
Let’s go LSU!