Jake Scruggs

writes ruby/wears crazy shirts

In a previous post I discussed generating a unique id number for an order in our database. Read the original post for the full story, but to sum up, I decided that playing around with rand(100_000_000) was good enough. Martin Ankerl posted a comment pointing out that I was vulnerable to the Birthday Problem. So after a quick trip to wikipedia, I found out that if you have 23 people in a room you might think you have a 23/365 chance of having 2 people with the same birthdays, but no, your chances are closer to 50%! This is because in a group of 23 people there are 253 pairs each of which have a 1/365ish chance to have the same birthday. The formula to calculate how many are needed to give you a 50% chance of having a birthday collision is:

.5 + sqrt(.25 + 2 * 365 * ln(2) ) = 22.9999

Which means the 50% chance of collision mark for 100,000,000 is:
.5 + sqrt(.25 + 2 * 100,000,000 * ln(2) ) = 11774.600235771

Holy crap! I'm screwed if I can only have 12,000 orders before I get a 50% chance of collision. But wait, I think I'm ok because every time I insert a key into the database I check to make sure it isn't already in there. This means that when I'm at 12,000 orders the next key has only a 1 / (100,000,000 - 12,000) chance of being a collision. I think I've changed from the "Birthday problem" to the "Same birthday as you" problem, which is way more intuitive. If I'm wrong I rely on you, the blogoshpere, to tell me so.

If you haven't already, go read the wikipedia article on the Birthday problem. It's pretty cool.

5 comments :

Martin Ankerl said...

Hi! You are right and I was wrong, it's not a birthday problem. When you already use half of your numbers (50_000_000), the chance to find a free key is actually 50%, so that should be no problem.

josh said...

Jake,

I'm not much on doing the math unless I have to, so I thought I'd just try your algorithm with different numbers of carts. Here are my results:

15,000 carts only generated a couple of collisions on average (0.013%).

150,000 carts generated about 90 collisions on average (0.06%).

1,500,000 carts generated 10,000 to 12,000 collisions (0.73%).

15,000,000 carts generated 1,180,524 collisions (7.87%).

The most collisions I saw while trying to find a unique key was 5, and that was quite exceptional, more often there was one or no collisions before a unique key was found. It gets lots worse as you approach your ramdom factor, so my guess is that this thing will get inefficient as you get close to 100,000,000 carts, but it will still work. I tried generating 150,000 carts when the random factor and it was working, but it would take sometimes more than 10 attempts to find a unique key.

I also tried replacing your algo with truncated MD5. At 150,000 carts there were no collisions, but generating 150,000 MD5 hashes is really, really, really slow, In terms of performance, your system is much faster and imho good enough.

AkitaOnRails said...

Why can't you simply use UUID?? Also called GUID? It's the standard way to create (virtually) collision free 128-bit identifiers. There are literally hundreds of implementations out there for almost every programming language, from C to Ruby. I can't think of any reason not to use UUID if collision is an issue.

Jake Scruggs said...

Akita, the requirements of the business state that we can only have 11 digits in this key. GUIDs are way longer. See my first post on this for more background.

Michael said...

I've successfully used a "truncated GUID" solution before. I looked into the java GUID generator I was using and determined that none of the digits were more or less random or significant than others, so I figured truncating was ok.

But with only 11 digits you're definitely giving up a lot of the 32 hex digits that make up a normal GUID.

If the business accepts any alphanumeric character for this ID, I would use a random generator that selects from all the allowed characters, that will give you the least chance of collision. The implementation would be very similar to a Random Password Generator solution (generate random characters from an allowed character set).