Archive for March, 2009

More Madness…

March 22, 2009

I found some articles having to do with strategies for choosing the NCAA basketball tournament brackets. In particular, I thought this article was pretty good. The article basically states that a contrarian strategy is a solid strategy if all you care about is making money on pool entries in the long run. With the contrarian strategy, you find teams which have a higher chance of winning and moving into later rounds than most people give them credit for and choose them. Notice, this doesn’t imply that they are favorites, even by any statistic, but as long as most people will choose against them, there will be a high payoff if they do win, since you will move to the top of your pool.

This makes the most difference when choosing teams that will make a win over a favorite and move on the top the Final Four. For an example from this year, Gonzaga will be playing UNC in the sweet sixteen. Many people have UNC winning the entire tournament and a large percentage of entries on ESPN have UNC making it to the final four. However, Gonzaga has a higher chance than those numbers suggest. Based on this fact, choosing Gonzaga over UNC will offer a huge advantage if Gonzaga does win since all those people just lost the opportunity to make points in the later rounds, where each game is worth more points (each game is worth twice the amount of points they were worth than the previous round, ensuring that in every round there is 320 possible points, assuming that the first round games were worth 10).

One of the interesting parts of this approach is that it completely contradicts the basic wisdom I was taught, which was to randomly choose some upsets in the first rounds, but then to go with who you thought the winner was going to be in the later rounds. This advice makes sense because of the relatively small information we usually know about the more obscure teams in the 8-9, 7-10 games, etc compared to the fact that we usually recognize the names of the #1 and #2 seeds and might even know some of their stats. But if we assume that everyone else in our pool follows that strategy, then the winner out of them would be largely based on the random happenings of the first round games, because their later round brackets would be too similar for one person to gain any significant point advantage. We also did this since we knew that losing a Final Four team almost guaranteed a near-last place finishing. But in reality, if you’re not in the money, does it really matter where you were?  Knowing that,  you would probably be better off simply betting the better seeded teams in the early rounds and then starting to make some “upsets” in the sweet sixteen when the differences between seeded teams will be smaller. By doing this, we increase the likelihood and the relative values of our upsets.

Looking at it this briefly with some math, choosing Gonzaga to win their division and make the Final four is worth total of 150 points (10 for their first round win, then 20 for the next, then 40, then 80). If they do make it, then against someone who chose UNC to make it to the Final Four (which most people are betting on), you would have a 120 point advantage, which is the equivalent of calling 12 first round upsets correctly. So basically, if you are looking to make some ridiculous calls and go against the grain, it makes more sense to call a later round game than an earlier round game. But it is important to note that this situation only applies when you have good reason to believe that the popular chances of the underdog winning are lower than they really are, which goes back to the root of problem of estimating probabilities of teams winning.

In the article I linked to above, there is another link to an academic paper that someone published on the matter and there was some other discussion on the blog of the guy who wrote Freakonomics. Two things there…

1. When I was looking early for literature on NCAA strategies, I should have known to look to economists rather than mathematicians since economists are much more likely to be cool and care about this type of thing.

2. All this literature illustrates one of the reasons that I want to remain around thinkers for the rest of my life; Even common events become extremely interesting when approached in an intelligent manner and reading the ideas of others just makes it one thousand times more interesting.

p.s. Apparently the problem of determing whether or not you still have a chance in the NCAA pool is NP-complete if we don’t fix the number of teams that enter it. Pretty cool.

March Madness

March 21, 2009

So the NCAA tournament has rolled around again and I have entered into a small pool with some family and friends (its a tradition in my family). However there are two things that set this year’s tournament apart from others.

1. Binghamton made the big dance. It was pretty cool to see us play on national television, but we had an unfortunate first round matchup against Duke, who almost always makes it to my final four. Regardless of my school spirit, I had to go against them.

2. I wrote an algorithm to help me choose teams. I got a bunch of statistics on all division 1 teams from the NCAA website (they provide 16 stats, like points scored, points against, average margin of victory, average turn overs, etc) and was planning on using those statistics and a schedule of win/losses for the year to create a Naieve Bayes Classifier. The basic idea is I train the classifier on all of the regular season games, where each data point is a vector of pairs, where each pair is the ith statistic for each team. So for Duke vs. Binghamton it would have been a data point like this

< (ppg for Duke, ppg for Bing), (apg for Duke, apg for Bing)….>

and then include a win or lose label for each point. Based on that training data, for the games in the tournament, I create the data point and then see if that point is more likely given a win (this is the bayesian part of it) or more likely given a loss. I choose the label which maximizes the probability of that data point. There would have been some more details to work out, but that is the basic idea. But only after I had written all the code to parse the statistic files, I found that I could not find a schedule for the season in plain text, so I couldn’t parse that information and making everything I did pretty much useless.

Being that I had already collected all the statistics for each team, I decided that I would simply cluster all the teams using k-means just to get an idea of what teams were statistically similar. Then using those clusters and my limited knowledge of the NCAA teams, I could infer the quality of a team. For instance, I know that Duke is always a good team (and were also ranked #2), so I can reasonably infer that teams in the same cluster as them will have similar stats and will probably be good teams as well. There are a number of flaws in that reasoning, but I did this all in an afternoon and this algorithm was meant to provide basis to my picks, which would otherwise be arbitrary. So how did this work out? I set k = 33 since there are roughly 330 division 1 teams and I figured that if it were evenly distributed, 10 teams per cluster was a good amount.  Well 7/8 of the top ranked teams in the tournament were placed in the same cluster (UConn being the odd one out), along with some other teams in the tournament, notably gonzaga and LSU. Once I saw the resulting clusters, I ordered them based on their euclidean distance from the “best” cluster, or the one that contained all the top seeds. Then, when making picks, I would choose the team that was in a higher-ranked cluster.

So what are the problems with this algorithm? Most obviously, I have no way to choose between teams in the same clusters, which is a problem since I should expect that the later rounds will come down to teams in the same cluster (assuming that a 1 or 2 seed will be the winner of each regional conference, this algorithm is useless in choosing final four games). The next problem, which is less obvious but more important, is that a cluster being closer to the best cluster does not actually imply that is a better cluster. For instance, lets assume that we only consider average points scored and average points allowed. Lets say the best cluster has an average of 80 points scored per game and lets in an average of 70. Well a cluster whose median team scored 75 points per game and allows 75 points per game will be “closer” than a cluster whose median team scores 90 points a game and allows 60 points against. Clearly, the second cluster would be a better choice for winning, but the first cluster would be ranked higher since it has a smaller euclidean distance. I haven’t yet thought of how to fix this problem, but I assume its possible if I used a different distance metric. Of course, this whole algorithm is extremely ad hoc and based upon weak theoretical foundations, so trying to fix just that problem would be like rearranging deck chairs on the titanic.

So, in practical terms, how is this algorithm doing? My algorithm didn’t choose too many upsets (an upset being defined as a higher seed beating a lower seed, so a 9 seed beating an 8 seed), but about as many as a human would. Well I am near last in my bracket, but there were some true upsets and there were two games that my algorithm choose a 12 seed over a 5 seed and got it right (I choose Wisconsin over Florida St and West Kentucky over Illinois). Many people had those 5 seeds advancing to the sweet sixteen, so I will make up some points there hopefully. In terms of the 8-9 games, I got two right and two wrong, which is about what most people did. It also has some irony that I did no better than flipping coins since I created this algorithm for the purposes of choosing games like that. The big triumph of my algorithm will be if LSU beats UNC, since an 8 seed would knock out a favorite to win the entire tournament. Looking at it after I made that choice, it might make some sense that LSU has a chance since UNC is an entirely offense based team (averaging about 90 points a game), but doesn’t seem to do too much on defense (let’s in about 72 points again which puts them somewhere in 200’s in terms of defensive rankings).

Whether this whole thing turns out good or bad, it was a fun project to do for an afternoon, and I plan on expanding the project for the next season. One fundamental difference between my algorithm and others that I have seen is that my algorithm doesn’t really look to take a chance factor into play. Since I always chose the better ranked team, there is not much difference between what I did and what someone who always chooses the better seeded team to win. So by following this algorithm, I basiscally assume that I have come up with a better ranking than the NCAA selection committee, which is an interesting idea. I feel that better way to do it would be to estimate probabilities of each team winning and then running a simulation, so that I can actually take into account the chance factor. There is also a lot I could do with making the statistics more meaningfully, such as weighting more recent statistics more heavily. Whatever I do, its a cool application for some stuff that I have learned and maybe I can make a research project out of it next year.

Let’s go LSU!