Mid-Semester Updates

October 23, 2009 by thisaintnogame

Well, where to start after so long?

Unfortunately, my ACM ICPC (International collegiate programming contest) team missed out on a free trip to China by one place. The top 2 teams advanced and alas, we were third (a team from Cornell placed first and team from Columbia placed second). However, it is hard to really be angry at ourselves considering that we performed our best and were simply outbested, as opposed to our performance at the regional last year where we did not perform our best and could have taken second if we had. Contestants are ranked by number of questions correct and then by least accumalated time amongst teams that all have the same number correct. There were 9 questions this year and I believe that we were in first place after submitting our 7th question. But we stumbled on the last two questions and Cornell/Columbia kept going strong to take the win. We did as well as we could have hoped to do and only encountered minor problems with two questions, but that proved to be the determining factor. It is dissapointing that we won’t be heading to the International competition held in Harbin, China but we were certainly able to walk away with our heads held high. I still have one more chance to compete as a first year grad student (both teams that beat us had a grad student), but I am not sure that I will find a team that had such synergy as we had.

I am competing in the IEEE X-treme programming contest tonight, but I really have no idea what to expect for that contest. My teammates are both really strong coders and problem-solvers, so we should be able to do pretty well, but I dont know how fast we have to be to get a nice prize. At the very least, I am hoping to get a T-shirt out of this business. Plus I got an IEEE membership sponsored by the school.

Now that the ICPC is out of the way, I should really really buckle down and get my grad school stuff done. On my list right now is studying for the general GRE (which happens to be less than a week away), writing my personal statement (or at least a rough draft), applying for the NSF fellowship (or any other that might apply) and then finally mail out the apps (which I have a bit of time for).

I also got a cool new trackball mouse. It is very good for when you are lazy and want to be able to scroll with minimal movement from your body, but its a little imprecise when you have a lot of caffeine in you and your hands are shaking terribly. But clicking things has never been so interesting as it is now.

Arg, so much shit to do so little time.  I dropped the “research” that I was doing this semester because I really should be devoting my spare time to grad school prep and the grad student I was working with was expecting too much from me (things that I would have had difficulty delivering on if I worked on the project full time; no less with a full course load and other stuff). Also, I dont find data mining to be particularly interesting, especially the stuff that we are doing.

Wow, I really have to stop writing now and do something. What a crappy, substance-less post. Hopefully I will have a clearer mind when I write next time (which will be circa 2012).

August 1, 2009 by thisaintnogame

Well I have hit a new low in my life; I am sitting at a coffee shop, updating my blog. I swore to myself that I would never do this, but I guess I gave in. I’m pretty disappointed in myself, but it seemed like such a convenient solution to the problem of wanting to drink coffee and surf the interwebz.

So I am nearing the end of this summer’s REU and what will be my last one. I feel like I have almost made a mini-career out of these programs, especially when I consider the fact that most people only do one of these. To some extent, I feel I may be able to justify this summer’s program. My first REU at Binghamton was good, but it came at a time where I didn’t really know much of anything in computer science. As a result, it was difficult to really take anything advanced away from the program. The summer that I spent at UMass-Amherst had a potential to be a good experience, but turned out to be a not-so-good experience.  During this program (at USC) I had a very good relationship with my advisor, I enjoyed my work, and I believe I finally did something that is above the undergraduate level (maybe not incredibly above it, but still above it).

One decision that I will need to come to very soon is whether or not I would like to pursue pure CS theory, which is what I did this summer. From my understanding, there are three main topics under CS theory: computability, complexity, and algorithms. I am really not sure how much research is done in computability currently, but I do know that there are thriving communities in both complexity and algorithms. Both are interesting and challenging with many open questions, but I certainly lead more towards algorithms. However, that doesn’t really answer the question of whether or not I would like to pursue theory. Here are some of my thoughts on that.

- It is interesting and challenging. That kind of speaks for itself.

- The people in theory are (generally) very intelligent, so there is a lot to be learned from being around other theoreticians.

- The work can be frustrating. In programming, progress on a problem is usually a function of how much effort is spent on that problem, so one can always feel like they are making progress. It has been my experience this summer that I have spent a lot of time being stuck on a problem, regardless of the large amounts of time I devoted to it.

- The “so what?” factor. While I enjoyed my work, I realize that the results really don’t matter all that much. It is cool, enjoyable, and I get a lot of satisfaction when I prove something, etc but maybe there could be something more to it. Ideally, I would like to enjoy what I do and help humanity out while I am doing it. I know there is the whole “adding to human knowledge” argument, but I would like something more tangible.

- Getting a job. This concern is certainly a long way off, but I would guess that there is probably less jobs and less money floating around for theoretical people then there is for other types of research.

This summer, I saw a talk by a caltech professor who was examining some theoretical aspects of power-saving algorithms. The idea behind a power-saving algorithm is that a computer can adjust the speed at which it runs at in order to cut down on power. This is particularly important for huge data centers since the cost of power dominates the cost of running the facilities (about 60% according to his talk). So if there are times of the day where we can loosen the constraints on throughput for jobs, we can reduce the power needed to run the system. In his talk, he showed a relationship between this problem and CPU scheduling and proved that an online algorithm is 2-competitive (meaning it can approximate the offline algorithm within a factor of 2). This type of work is of a theoretical nature, but motivated by applications and was pretty cool. I am sure that there are examples of this all over the place in CS, so perhaps it is a matter of finding which fields look cool and have open theoretical problems.

In particular, I have developed this notion that it would be really cool to work in computational biology. To be honest, I have a very limited basis for developing this idea, but I am certainly going to look into it and talk to a few people in the field who have a CS background. Its a young field (which means its easier to make significant progress), there are many open algorithmic problems, and there might be some benefit towards humanity. Also, I have become a little more interested in some of the sub-fields of biology such as genetics and evolution (admittedly, those interests mostly come from a visit to the San Diego Zoo), so there is definitely a potential to learn a lot more about them. One thing that has me worried is having to learn the biology side of the field. It has been about eight years since I have looked at bio stuff and I dont remember liking it very much.

I realize that the decision of what field I want to enter is a little ways off, but I should have some idea of it to serve as a basis of what grad schools I apply to and what I write my statement about. I would not want to decrease my chance at some grad school because they accept less theory people than they do comp bio people or something like that.

That is enough rambling for one day, hopefully I will have a better topic to write about next time.

Friday Mornings…

July 10, 2009 by thisaintnogame

For some reason, I am more tired than usual this morning and it seems that I am not able to wake myself up and actually do work. So far I have gotten iced coffee (which never tastes as good as regular coffee but it is too hot to purchase a very hot drink), checked email five thousand times, and started writing this post. I did change something in a program that I am working on, but that change caused it to explode with bugs which I am too tired/lazy to look at it. What a good use of a Friday morning.

In other news, the people at the Freakonomics blog have had their say on Swoopo, which is cool since I wrote about it a couple days before they did. A commenter (the only one) on my last post advertised this site called pricedrip.com as an alternative to Swoopo. The mechanism for this site is that you place a single bid, naming a price that you would like to pay for the item and then the price is lowered periodically until someone’s bid is met. Of course, the site charges you for each bid that you placed, but it would seem that an individual would be less likely to place more than a couple of bids per item with this method. This type of auction is known as a Dutch auction, where the auctioneer starts high and then lowers the price until someone is willing to buy it, with the difference that you must pay for your bid on this site. Admittedly, I haven’t really watched any of these auctions, but I guess the real question is how far does the price drop below the market price (the best price that you could purchase it for on the internet). It seems like you have a decent chance of getting a marginal savings on a product. As long as your savings is greater than the price of the bid, you would probably make money. Of course, if everyone uses that same strategy, then everyone gets screwed because everyone will have to bid higher to get that discount, thus decreasing the value of the discount. The thing that makes this less “evil” than Swoopo is that you do not gain any more information over the course of the auction that would cause you to change your bid. Say your bid is X and the auctioneer announces the current price is Y such that X < Y. If no one buys it at Y and the price moves to Y-1, then you gained no information because you implicitly assumed that no one was going to buy it at Y (if you had assumed that, you would have never spent money placing a losing bid of X). If someone does win it at Y, you gained information, but the auction is over so that information is useless now. Since you gain no new information, it never really makes any sense to place more than one bid.

You might conclude that if everyone else is bidding such as you are, that is trying to get a small discount from the item, then they might outbid you on it, so you should increase your bid by a little bit to win the item. Of course then they might have the same idea and increase their bid, and so on and so forth. Eventually you might reach a conclusion that you should bid more than the market value of the item to win the auction, but that is irrational. While that might happen at Swoopo (you would continue bidding so as to not lose the investment that you had already put in), that wont happen here since all that calculation would go in before you place your bid. Once you realize that the only winning strategy is to overbid the item, you would just go to Amazon and buy it and thus never lose any money on the auction. Thus my claim is that this site is probably less evil than Swoopo. I think some work has been done on auctions where there is an entry fee to the auction (not per bid), but considering it never makes any sense to bid more than once in this auction, the same strategies can be applied. I will try to look into that.

I started reading “Linked” by Barabasi which is a popSci book about the science of networks. Barabasi is a pretty big guy in the field but based on some videos that I have seen with him, I think he is prone to overestimating the importance of network theory. Still, it should be an interesting read and is only about 200 pages. I kind of want to do some type of empirical study in network theory (maybe a project like this or this), but I have yet to come up with a good, feasible idea (good means it has to be interesting and its conclusions have to be worth the effort and feasbile means that data has to be relatively easy collect). Oh well, back to doing some work (more likely, reading random news stories before lunch).

Swoopo

July 7, 2009 by thisaintnogame

After having Google Ad-sense advertise this site to me approximately 100 times, I finally visited swoopo.com the other day. The site advertises that it sells electronics for absurdly cheap prices, such as 20 dollars for an ipod touch or 200 dollars for a new macbook, so I was obviously curious if this was true and how could they afford to undercharge for such items. The first thing that struck me was that the site was an auction site, so there is no notion of directly purchasing something. After I looked around a while more, I discovered the trick for why they can afford to charge very low prices.

Here is the rundown of a standard swoopo auction: The item, lets say a PS3, starts off at some very low price, say 30 dollars. Then every time a user submits a bid, the price is increased by a predetermined amount, usually 15 cents and the time until the auction closes increases by a certain amount, usually 15 seconds (at some points, the time left jumped up dramatically, like by 20 minutes, but I couldn’t figure out the pattern there). At some point, someone will have placed an unchallenged bid and the auction will end. This part is relatively simple, just the standard ascending bid English auction (that is some moderator will name the next price and bidders will agree to pay that price, as opposed to having bidders name their own prices), but this doesn’t explain how this site makes any money.

The real trick is that every time you bid, you must spend one of your “bid tickets” and you have to buy bid tickets from the site. In other words, they charge you $1 for the right to submit a bid. So while one person might get an item that is way below market price, there are many more people who have essentially spent money and have nothing to show for it. So if the site is selling an item worth $1000 dollars and the bid price starts at $0, they must get 1000 bids on the item to make it a profitable venture for them (let’s ignore overhead costs right now). Those 1000 bids will only drive the price up to $150 which still makes it an amazing price to get that product at, so it is pretty likely that people will continue to bid on the item until it hits a point where it is reaching some reasonable approximation of market value. Even though Swoopo probably lost money on selling the laptop to the winner, they have made money off of all the users who submitted bids and lost.

Although this format strikes me as kind of gimmicky, the site is very upfront about the process of buying bid tickets and that you will lose a ticket every time you place a bid, regardless of whether or not it was the winning bid. Also, it seems that representatives from the site have been very good about responding to interviews and being honest about what they do. In one of the interviews that I read with a PR rep from Swoopo, he said something like “we don’t want people to feel like it is impossible, we want people to win” which sounds very reminiscent of rhetoric from casinos or from “Thank You for Smoking.”

One thing that I do find to be questionable about the site is this automatic bidding bot that it provides free of charge. Essentially, you tell the software what item you would like to bid on, how much you are willing to pay, and how many bids you are willing to use towards that item and then it will automatically submit bids very close to the end of the auction on your behalf. After watching an auction on a PS3 that started off at 37 dollars (or rather that was when I started watching the auction), it became very clear that many people were using this bidding robot since with about 5 seconds left in the auction, four or five bids would go in automatically (it would also say if the bid was placed by their bid butler as it is called). That PS3 ended up selling for somewhere in the 200 range, so I cant even imagine how many people lost a ton of their bidding tickets because of the straightforward strategy used by this bot.

Looking at this from an analytic point of view, we can see that we only really need one user to “bite the bullet” and submit the bid that will prevent the current bidder from winning and prolong the auction, so there is no need to submit a bid if you are sure that someone else is going to submit one. In fact, not submitting the bid there is to your advantage since you will have more bids later in the auction if we assume that everyone has finite resources. One could probably do some statistical analysis to estimate the probability that someone will submit a bid given what percentage of market value the current price is. If we had access to that, we could layback and wait for all the “noise traders” (people who use no intelligent strategy) to waste their bids when it is extremely improbable that anyone will win the auction, and then enter the auction later on when its more likely the auction will end within a reasonable number of bids. For instance, if we knew that product X has a 90% chance of getting a bid at price Y, then we would submit a bid at that price only 10% of the time (determined randomly). Following such a strategy would certainly not guarantee a win, but it would definitely result in using less bid tickets (in the above case, assuming that the 90% chance held for a while, we could save 90% of our bids) than using their bid butler. Obviously, Swoopo has no incentive to improve their software since their current bid butler is probably making them tons by wasting users’ bids, but it might provide a market for a third party software company to make a bidding bot that saves users lots of bids over times. Of course, if everyone used such a process, it would come down to the random numbers generated by the software, making it more of an explicit gambling game, but would save everyone money. I would also guess that it would decrease Swoopo’s profit by a decent amount since there wouldn’t be as many traders submitting simultaneous bids and thus keep the price a little bit lower.

I highly doubt that I will be making any bids on Swoopo in the future, but it does offer an interesting application of auction theory (according to my advisor, this “pay per bid” auction has not been studied too much). And if I am ever given a bunch of money and told to start a company, I might pursue that more intelligent bidding software option as a product (would it be better to charge a subscription fee or a certain percentage of the price of an item that a client of ours has won?).

Getting Past Brute Force

July 1, 2009 by thisaintnogame

I have been playing around with some questions on Project Euler over the past couple of days. In general, I like PE for a number of reasons. First, its a site that challenges people to think mathematically, which is always good. It appeals to a couple of my friends who aren’t very strong coders, probably because you dont need a huge coding or algorithms background to attack first 20 problems or so. Hopefully it will lead to them getting more comfortable with brain teaser/coding questions and they can start doing things like Topcoder or the problems on the UVA judge (can’t find a working link right now). Second, not everything requires code to answer, sometimes its just a simple equation evaluation (if you know the right equation) and in general it forces coders to get closer to the math. A lot of times I feel that we lose sight of the fact that we are applied mathematicians and should look at some of the mathematical properties of a problem before we start banging away at the keys or looking for a recurrence relation that we can DP on. Actually, in my Combinatorics class in the fall, I suggested we could solve the lattice path counting problem using a recurrence relation and received a response of “ah, yes that is how a computer scientist would think, but we mathematicians can do better” (it was said in a loving and European way). However, I have some reservations in the fact that most of the problems, at least on the first page, seem to involve simple computations, but adds in difficulty by putting in very large numbers. I guess it’s just a little annoying to have the right algorithm, but have to change data types from “int” to “long long” or just give up on the problem because it requires the use of some big int class.

Anyway, I found one problem that I really did like (problem 24 for those who are interested). The generalized version of the problem is as follows: Given N digits, determine the Kth permutation of the digits (that number K is determined by its position in the lexicographic ordering of all the permutations). So for example, if we were given 1 2 3 and told to find the 4th permutation, we would get 123 as the first, 132  as the second, 213 as the third and 231 as the 4th (the 5th would be 312 and 321 for the 6th). Now, I would like to give this problem at one of our programming competitions and make N semi-large (but small enough such that N! fits with in a 64-bit int). This is one of those problems that has a very obvious brute force solution, but the lack of a small bound on N destroys that approach, thus forcing the programmers to actually think about the problem. The BF approach would be to iterate through every permutation until you hit K (some programmers may not know a standard library call to a next_permutation function, so the problem might be a little bit tougher). Still, it isn’t that hard to come up with a good solution for this problem, so let me talk about that.

One of the nice things about permutation questions is that they usually lend nice recursive solutions, which is my favorite thing to work with. We notice that if we have N elements and we fix the first element, then we have (N-1)! permutations that we can construct. So if K <= (N-1)! then we know that the first element in the solution is the first element in the list. So now it really comes down to finding the Kth permutation of the remaining N-1 digits. If K > (N-1)! , then we know that the first element in the list cannot be the first element in the solution, so we must now switch elements. Since digit D can have (N-1)! permutations constructed with D as the first element, we must find the smallest position i such that i*(N-1)! >= K, and then switch the first element with that ith element. Now its a matter of finding the (K- ((i-1)*(N-1)!))th permutation of the remaining N-1 digits. That expression is a little scary, so let me explain it. Notice that if we had the digits 12345, the first 4! = 24 permutations would all begin with 1. So if we wanted to find the 25th permutation, we realize that 1 cant be the first permutation since it only heads the first 24 permutations, so we switch with the next lexicographic digit, which is 2 in this case. Since we already covered 24 permutations, we really only need to find the 1st permutation with 2 at the beginning, or the first permutation of the digits 1345.

So we modified the permutation number we were looking for by subtracting (i-1)*(N-1)! or the number of permutations that we already ruled out cant be the solution permutation, from the permutation that we are looking for. Again, i is the smallest position such that i*(N-1)! >= K. I am much better at explaining things in person, I swear.

Anyway, if we had put all of this mess together in an algorithm, we get a running time of O(nlgn) if we search for that i cleverly (binary search), which significantly kicks the ass of the brute force O(n!) algorithm. Regardless of this convoluted explanation, its a pretty simple algorithm but it makes you feel nice and special after you are done. I think its a pretty gentle problem to get people to think beyond the obvious, which is something that I strive for with the problems that I write. I have previously done that by beating them over the head with 2 dimensional DP problems, but then that gives people with no algorithms experience no chance at them. This type of problem gives anyone with a knowledge of permutations and basic recursion a decent shot. Since no one reads this blog anyway, I think I should be safe in giving it as a problem in one of this year’s competitions.

Why we do algorithms wrong

June 14, 2009 by thisaintnogame

So by we, I mean my school’s CS department. After only a few meetings with my advisor here, I see the difference between the teaching/advising styles and why one is far better than the other. In our department, the algorithms class pretty much proceeds in a manner than an algorithm is taught, maybe there is some discussion about the run time of it, and then the students usually have to implement it as an assignment. The textbook used is the Cormen textbook, which is pretty much like the webster’s dictionary of algorithms. So what are we missing?

First, we are missing the very important step of the proof of the algorithm. Yes I know that most students don’t care for proofs and find them difficult and useless, but too bad. The fact of the matter is that it is hard to prove the correctness of algorithms, but that is exactly what we need. The proofs are not necessarily important in the content of the proof in itself, but the process that one would go about reasoning the correctness of an algorithm. If you do enough proofs, you start to develop an intuition for the correctness of an algorithm long before you even finish it. We could all benefit by saving time on an approach that wont yield a correct answer.

But the most important thing that we are missing is the emphasis that the algorithms that are taught should help us develop an intuition on how to go about coming up with an algorithm. I know plenty of people who have gone through our algorithms course but still have little idea about how to start developing an algorithm for a Topcoder problem. This is because we teach our class more as a reference class for algorithms, not to teach the paradigm. A good manifest of this concept is the textbook used. Sure, the Cormen book is an excellent reference if you already have a great understanding of all the math and you are looking to read some material on an individual problem. But it doesn’t do that well actually teaching the concepts when compared to the Kleinberg algorithms book or the Papadimitrou/Vazirani book. Those books actually teach the concepts and walk through the development of these algorithms. The Kleinberg book also does a great job of showing how to arrive at correctness proofs (it even teaches about commonly used proof techniques such as exchange arguments). Plus the books are generally easier to read, which is a big plus when starting to teach more advanced concepts such as approximation and randomized algorithms.

I realize that our department isn’t hoping to raise the next generations of algorithmists or send people to do work in CS theory, but we should be trying for this approach. Even if it does go over the heads of most students, there is no harm to trying it this way since they can always go back to reading the book for all the answers they need. If our students are too lazy/incompetent to get all their answers from the book and still need a professor to hold their hand in the class, we have a bigger problem than we know.

Hopefully I end up with a really solid foundation for doing advanced work in algorithms after this summer. I am interested to see how our advanced algorithms seminar is going to turn out. I have some doubts, but I’ll keep those private (as if publishing a thought really constitutes it as a public thought).

REU 3.0

June 5, 2009 by thisaintnogame

Well, here I am at my third and final shot at making an REU worthwhile. My first one was a great learning experience, but I really didn’t know enough at that point to really take advantage of the “research” opportunity. I think I was ready to do some serious stuff for the second one, but it turned out to be a terrible experience in almost all facets. This one is starting strong. We have 13 students so far (we have one more to come) and we seem to have a good social dynamic going on. Im excited to see where the summer leads us around LA.

My project this year is infinitely better than what it was last year. My chief complaint from last summer is that my advisor treated me like a small child with minimal knowledge of what programming was; my advisor this year says he’s going to treat me like a 5th year PhD student who has passed all exams and has read all the literature unless I tell him to slow down and explain something. From our early meetings, its pretty clear that I will be taking advantage of the “slow down and explain something” option quite frequently.

So I’m getting my start in CS theory research this summer. We are considering the problem of maximizing the spread of your influence on a graph. We have a model where at each time step, a node will choose one of its neighbors at random and change to whatever color that node is in the next time step. Before this process begins, you are given an amount of money M that you can spend on nodes to color them your color, assuming that each node has its own activation cost. We would like to find a set of nodes S, such that the total initial cost of those nodes < M that will maximize the expected number of nodes that will be your color after T timesteps. This problem is motivated by the case where a company is looking to pay some people to initially adopt their product in order to spread the adoption of their product throughout the network. Two natural examples of this would be publishers giving out free textbooks to professors in hopes that they will adopt the textbook for their class and other professors will do the same after seeing the original professor use that textbook or paying certain people to start using a social networking site (like Twitter) in hopes that others will join because of the influence of those people.

When we are done researching that simple version, we will explore what happens when there are two companies, A and B, with the same finite amount of resources and nodes will adopt technology A or B with probability proportional to how much money each company has spent on them. In this scenario, I have already looked at a simple probability function where A has already made all of its moves and found an optimal algorithm for B (of course, it is not a polynomial time algorithm, but its better than the “try everything” approach). In the future, we are going to look at equilibria in the game. That is, if A knows how B will react given a strategy for A, what can player A do to preempt B and choose a strategy that will be worse for B when B reacts to it.

There is a fair bit more technical detail to what is going on and I have much more to learn, but I am really enjoying what I have done so far. Hopefully, it will remain that way.

For Loops are kind of essential to programming, if you know what I mean…

May 14, 2009 by thisaintnogame

More Madness…

March 22, 2009 by thisaintnogame

I found some articles having to do with strategies for choosing the NCAA basketball tournament brackets. In particular, I thought this article was pretty good. The article basically states that a contrarian strategy is a solid strategy if all you care about is making money on pool entries in the long run. With the contrarian strategy, you find teams which have a higher chance of winning and moving into later rounds than most people give them credit for and choose them. Notice, this doesn’t imply that they are favorites, even by any statistic, but as long as most people will choose against them, there will be a high payoff if they do win, since you will move to the top of your pool.

This makes the most difference when choosing teams that will make a win over a favorite and move on the top the Final Four. For an example from this year, Gonzaga will be playing UNC in the sweet sixteen. Many people have UNC winning the entire tournament and a large percentage of entries on ESPN have UNC making it to the final four. However, Gonzaga has a higher chance than those numbers suggest. Based on this fact, choosing Gonzaga over UNC will offer a huge advantage if Gonzaga does win since all those people just lost the opportunity to make points in the later rounds, where each game is worth more points (each game is worth twice the amount of points they were worth than the previous round, ensuring that in every round there is 320 possible points, assuming that the first round games were worth 10).

One of the interesting parts of this approach is that it completely contradicts the basic wisdom I was taught, which was to randomly choose some upsets in the first rounds, but then to go with who you thought the winner was going to be in the later rounds. This advice makes sense because of the relatively small information we usually know about the more obscure teams in the 8-9, 7-10 games, etc compared to the fact that we usually recognize the names of the #1 and #2 seeds and might even know some of their stats. But if we assume that everyone else in our pool follows that strategy, then the winner out of them would be largely based on the random happenings of the first round games, because their later round brackets would be too similar for one person to gain any significant point advantage. We also did this since we knew that losing a Final Four team almost guaranteed a near-last place finishing. But in reality, if you’re not in the money, does it really matter where you were?  Knowing that,  you would probably be better off simply betting the better seeded teams in the early rounds and then starting to make some “upsets” in the sweet sixteen when the differences between seeded teams will be smaller. By doing this, we increase the likelihood and the relative values of our upsets.

Looking at it this briefly with some math, choosing Gonzaga to win their division and make the Final four is worth total of 150 points (10 for their first round win, then 20 for the next, then 40, then 80). If they do make it, then against someone who chose UNC to make it to the Final Four (which most people are betting on), you would have a 120 point advantage, which is the equivalent of calling 12 first round upsets correctly. So basically, if you are looking to make some ridiculous calls and go against the grain, it makes more sense to call a later round game than an earlier round game. But it is important to note that this situation only applies when you have good reason to believe that the popular chances of the underdog winning are lower than they really are, which goes back to the root of problem of estimating probabilities of teams winning.

In the article I linked to above, there is another link to an academic paper that someone published on the matter and there was some other discussion on the blog of the guy who wrote Freakonomics. Two things there…

1. When I was looking early for literature on NCAA strategies, I should have known to look to economists rather than mathematicians since economists are much more likely to be cool and care about this type of thing.

2. All this literature illustrates one of the reasons that I want to remain around thinkers for the rest of my life; Even common events become extremely interesting when approached in an intelligent manner and reading the ideas of others just makes it one thousand times more interesting.

p.s. Apparently the problem of determing whether or not you still have a chance in the NCAA pool is NP-complete if we don’t fix the number of teams that enter it. Pretty cool.

March Madness

March 21, 2009 by thisaintnogame

So the NCAA tournament has rolled around again and I have entered into a small pool with some family and friends (its a tradition in my family). However there are two things that set this year’s tournament apart from others.

1. Binghamton made the big dance. It was pretty cool to see us play on national television, but we had an unfortunate first round matchup against Duke, who almost always makes it to my final four. Regardless of my school spirit, I had to go against them.

2. I wrote an algorithm to help me choose teams. I got a bunch of statistics on all division 1 teams from the NCAA website (they provide 16 stats, like points scored, points against, average margin of victory, average turn overs, etc) and was planning on using those statistics and a schedule of win/losses for the year to create a Naieve Bayes Classifier. The basic idea is I train the classifier on all of the regular season games, where each data point is a vector of pairs, where each pair is the ith statistic for each team. So for Duke vs. Binghamton it would have been a data point like this

< (ppg for Duke, ppg for Bing), (apg for Duke, apg for Bing)….>

and then include a win or lose label for each point. Based on that training data, for the games in the tournament, I create the data point and then see if that point is more likely given a win (this is the bayesian part of it) or more likely given a loss. I choose the label which maximizes the probability of that data point. There would have been some more details to work out, but that is the basic idea. But only after I had written all the code to parse the statistic files, I found that I could not find a schedule for the season in plain text, so I couldn’t parse that information and making everything I did pretty much useless.

Being that I had already collected all the statistics for each team, I decided that I would simply cluster all the teams using k-means just to get an idea of what teams were statistically similar. Then using those clusters and my limited knowledge of the NCAA teams, I could infer the quality of a team. For instance, I know that Duke is always a good team (and were also ranked #2), so I can reasonably infer that teams in the same cluster as them will have similar stats and will probably be good teams as well. There are a number of flaws in that reasoning, but I did this all in an afternoon and this algorithm was meant to provide basis to my picks, which would otherwise be arbitrary. So how did this work out? I set k = 33 since there are roughly 330 division 1 teams and I figured that if it were evenly distributed, 10 teams per cluster was a good amount.  Well 7/8 of the top ranked teams in the tournament were placed in the same cluster (UConn being the odd one out), along with some other teams in the tournament, notably gonzaga and LSU. Once I saw the resulting clusters, I ordered them based on their euclidean distance from the “best” cluster, or the one that contained all the top seeds. Then, when making picks, I would choose the team that was in a higher-ranked cluster.

So what are the problems with this algorithm? Most obviously, I have no way to choose between teams in the same clusters, which is a problem since I should expect that the later rounds will come down to teams in the same cluster (assuming that a 1 or 2 seed will be the winner of each regional conference, this algorithm is useless in choosing final four games). The next problem, which is less obvious but more important, is that a cluster being closer to the best cluster does not actually imply that is a better cluster. For instance, lets assume that we only consider average points scored and average points allowed. Lets say the best cluster has an average of 80 points scored per game and lets in an average of 70. Well a cluster whose median team scored 75 points per game and allows 75 points per game will be “closer” than a cluster whose median team scores 90 points a game and allows 60 points against. Clearly, the second cluster would be a better choice for winning, but the first cluster would be ranked higher since it has a smaller euclidean distance. I haven’t yet thought of how to fix this problem, but I assume its possible if I used a different distance metric. Of course, this whole algorithm is extremely ad hoc and based upon weak theoretical foundations, so trying to fix just that problem would be like rearranging deck chairs on the titanic.

So, in practical terms, how is this algorithm doing? My algorithm didn’t choose too many upsets (an upset being defined as a higher seed beating a lower seed, so a 9 seed beating an 8 seed), but about as many as a human would. Well I am near last in my bracket, but there were some true upsets and there were two games that my algorithm choose a 12 seed over a 5 seed and got it right (I choose Wisconsin over Florida St and West Kentucky over Illinois). Many people had those 5 seeds advancing to the sweet sixteen, so I will make up some points there hopefully. In terms of the 8-9 games, I got two right and two wrong, which is about what most people did. It also has some irony that I did no better than flipping coins since I created this algorithm for the purposes of choosing games like that. The big triumph of my algorithm will be if LSU beats UNC, since an 8 seed would knock out a favorite to win the entire tournament. Looking at it after I made that choice, it might make some sense that LSU has a chance since UNC is an entirely offense based team (averaging about 90 points a game), but doesn’t seem to do too much on defense (let’s in about 72 points again which puts them somewhere in 200’s in terms of defensive rankings).

Whether this whole thing turns out good or bad, it was a fun project to do for an afternoon, and I plan on expanding the project for the next season. One fundamental difference between my algorithm and others that I have seen is that my algorithm doesn’t really look to take a chance factor into play. Since I always chose the better ranked team, there is not much difference between what I did and what someone who always chooses the better seeded team to win. So by following this algorithm, I basiscally assume that I have come up with a better ranking than the NCAA selection committee, which is an interesting idea. I feel that better way to do it would be to estimate probabilities of each team winning and then running a simulation, so that I can actually take into account the chance factor. There is also a lot I could do with making the statistics more meaningfully, such as weighting more recent statistics more heavily. Whatever I do, its a cool application for some stuff that I have learned and maybe I can make a research project out of it next year.

Let’s go LSU!