March Madness and ClusterLove Alpha

The NCAA men’s basketball tournament (lovingly called March Madness) has been great this year. There have been a ton of upsets, really great games, and one of the least likely set of Final Four teams in history (2 seed, 3 seed, 8 seed, 11 seed). It has been a lot of fun to watch.

And adding to the fun for me is a bracket-prediction algorithms competition I have going on with two of my labmates. We downloaded a bunch of statistics that we could find for all the NCAA teams and built statistical models that would predict the outcomes of each game. Part of the fun is being able to share what we did, so here is my write-up.

Name: ClusterLove Alpha

Why the Name: Because I use a technique called clustering, a variable called alpha, and the term clusterf*ck is too obscene.

The idea: In my mind, the NCAA tournament presents two challenges…

1. Compute the probability that team A beats team B. Pretty standard.

2. Given those probabilities, choose some sort of “optimal” bracket.

To explain how I computed the probabilities, imagine the following scenario: The only two teams in the entire league are Duke and UNC. They play each other 100 times per season and Duke won 70 of those times. Based on that information, I would guess that Duke has a 70% chance of winning their next game against UNC. Sounds reasonable, right? Ideally, this is how I would compute every probability but there are problems with this idea in reality. But this is at the heart of how I get the probabilities.

Now what if Duke and UNC had only played once and Duke won. Would we say that Duke has a 100% chance of winning in their next game? Definitely not, the sample size is simply not large enough to draw the conclusion. And in fact, very few of the college teams that will play each other during the tournament actually played each other during the regular season. So I will need some way to expand the sample size for each match-up.

Now imagine that Kentucky is playing UNC in the first round but they have never played each other before. However, Duke played UNC and won 70 out of 100 games. I also know that Duke and Kentucky are extremely similar teams. They score about as many points on average, they sink as many 3 point shots on average, they block the same number of shots, etc. Since Kentucky and Duke are so similar and Duke beats UNC with a 70% chance, I will say that Kentucky will beat UNC with a 70% chance. This is the second idea that is at the heart of my algorithm. I want to know what teams are very similar to each other, label them by arbitrary types (Kentucky and Duke would both be type 1 teams, UNC a type 2 team, etc) and then with that information, I will compute the above probabilities by saying “type 1 teams beat type 2 teams 70% of the time in the regular season.”

I used a technique called clustering (I used k-means specifically) to sort all the teams into different types so that any two teams with the same type were very statistically similar. The statistics that I used were the 16 aggregate stats the NCAA collects for all men’s basketball teams (you can find them here). In my program, I ended deciding that there were 14 different types of teams. I arrived at this number through trial-and-error (by comparing how well my predictions would have done during the regular season). Once my algorithm assigned each team a label, my program would go back through the regular season and compute how many times each type beat every other type. I would then apply that type data to compute the probability of any college team beating any other college team. For example, if I wanted to see compute the probability of Gonzaga beating Wisconsin, my algorithm would see that Gonzaga was type 3 and Wisconsin was type 10 and type 3 teams only beat type 10 teams 39% of the time during the regular season. Thus I would conclude that Gonzaga only had a 39% chance of beating Wisconsin.

The real magic in this part is trying to get the clustering algorithm to really work right. I am far from being good at this part and there is probably a lot I don’t quite understand yet. I have a feeling that I had too many useless variables that were mucking up the results of the clustering algorithm. There is also a problem in the distance metric I was using given the different types of data. For example, say Kentucky and Kansas were similar in every way except Kansas scored 3 more points on average per game. To you and me, that is pretty insignificant. Now say Kansas and Pitt were similar in every way except Kansas shot 80% from the field and Pitt shot 50%. To you and me, that is a huge difference.  But according to my clustering algorithm, Kansas and Pitt are 10 times more similar than Kansas and Kentucky. Finally, I am looking for a better way to include strength-of-schedule. I had a lot of problems with bad teams that came from very non-competitive divisions being declared similar to really good teams that came from competitive divisions just because they scored similar points per game, etc. One way to get around that is strength-of-schedule but I still have not found a good way to add that in. One idea is to use SOS as a weighting factor for every other stat, another idea is to make another stat that I cluster by but I really have no idea which is better.

In fact there are problem many many more problems. But this is essentially how I computed my game probabilities.

 

The next problem I tackled is how to actually choose a good bracket (given these probabilities). The straightforward way to go about it is to choose the team that has the better chance of winning in each game and just continue doing that until you have filled out the entire bracket. But consider the following example; Duke and Kentucky play in the first round as do Pitt and Kansas and the winners of those games play each other. Imagine Duke has a 60% chance of beating Kentucky but only a 10% chance of beating Pitt or Kansas. Meanwhile, Kentucky has a 40% shot of beating Duke but a 75% shot of eating either Pitt or Kansas. Since the later round game is worth more, I would rather choose Kentucky over Duke in this scenario because they have a better shot at winning both games. So to choose what team will win each game, I compute the average chance they have against all teams that they could possibly play in that game and then choose the team that maximizes that average times the chance that the team will make it to that round.

Disclaimer: I am not claiming this actually satisfies any meaningful type of optimality or that this is even a particularly good idea. I have yet to really think about this theoretically (although it is on my to-do list). I’m not sure it even makes logical sense if I really thouht about it. My only reason for following this algorithm is that I happen to like it.

And that about sums up the high-level overview. Last year, the previous iteration of this algorithm placed 2nd in family’s pool (out of 10 or so I believe). This year, it placed 2nd out of the three computer models that my labmates and I built and 5th amongst my family (out of 12 or 13). Technically, the tournament isn’t over yet but most people have 0 potential points left. I did manage to call the Kentucky over Ohio upset. My algorithm got pretty destroyed in the first two rounds but did manage to do well in sweet 16 round, where I picked up most of my marginal points (the points that other people didn’t get). I actually would have won the computer competition and taken 1st in my family’s pool if UNC had managed to beat Kentucky. Oh well, there is always more Madness next year.

 

Advertisement

About thisaintnogame

I'm a first year PhD student in algorithmic game theory at Northwestern University.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s