Weird requirement at work today: We needed a number that was alpha-numeric, exactly 11 digits, unique, and non-sequential. At first we thought of using a hash, but MD5 and Sha1 give you way too many digits. We could truncate to 11, but not knowing much about hashing, that made me pretty nervous that we'd have collisions. After a bunch of discussion, we decided on this:

I was thinking about using hex, but I was worried about what would happen if the number got too big and went to 12 digits. So I did this calculation:

MAXIMUM_FOR_10_DIGITS = 0xffffffffff

(MAXIMUM_FOR_10_DIGITS + cart_id + Time.now.to_i + rand(100_000_000)).to_s(16)

Turns out there's all sorts of room between the lowest and highest 11 digit hex number. Paul and Schubert did some quick calculations assuming lots of carts per day and figured out that we have about 250,000 years before we run out of space. Now since we use a random number, there is a chance that we could get two matching numbers so we do a database lookup to make sure that a generated number hasn't been used before. There is still a small chance that two identical numbers could be generated in the time between generation and saving to the database, but I think we can live with that.

irb(main):001:0> 0xfffffffffff - 0x10000000000

=> 16492674416639

Update: After some interesting comments, I wrote a follow up post. Check it out for more math goodness.

## 7 comments :

If your only requirement is to have unique non-sequential value it would be much easier to just to multiply it by a large prime.

PRIME = 2654435761

MAX = 2**44-1

"%.11X" % (VALUE * PRIME & MAX)"

I dare say that using 11 first chars of an hash would have given you the same result. That is, with the same collision probability.

I'm not quite sure about that, but I think the average number of steps after which you get a collision in a set of N elements is sqrt(N). So that you'll probably need to wait till you got more than 300.000 ids to find one!

You may run out of space in 250 thousand years, but you might be having too many collisions before then.

Why not just start at 0 and count up sequentially (or the opposite)?

We, of course, argued for a sequential number but the business said no. They don't want anyone figuring out how many orders they get in a day. We also tried to talk them into a Globally Unique Identifier but they didn't want that many digits. 11 was as high as they were willing to go. I thought that truncating a hash would probably work, but I wasn't sure enough (and neither was anyone on the team) so we went with the algorithm described above. The prime thing sounds cool, though.

1) Hash distribution is (theoretically) even. Put in terms of bits, the fequency of any bit being 1 or 0 is the same as any other bit. Therefore, a truncated version is just as evenly distributed as the longer version (you're just picking a subset of the bits, all of whom have equal frequency of 1-ness.)

2) You do not need a hash, you need a random number. As you are already checking for collisions, you could just rand(max) and then convert to your base of choice (which leads me to mynext point)

Hi, I am not sure if you don't run into the birthday problem with this. Time.now.to_is is a 31 bit number, rand(100_000_000) gives a 27 bit number. I assume your cart ID is pretty low to. You just add these numbers, so basically the part that changes are only the last 27 bits.

When you have about 15000 keys in your database, the probability of a collision is over 50%. When you have 50000 keys, you have a probability of 99.999% of a match! So it gets almost impossible to find a free key.

I think you should try to use the maximum possible randomness to get around this problem. The largest number is

largest = ("Z"*11).to_i(26+10)

id = rand(largest).to_s(26+11)

Then you have about 56 bit of randomness which is a lot more. The chance to get a collision here is about 50% when you have 400 million keys. With 800 million keys the probability is 91%.

an alpha numeric gives you 26*2 characters but 10 digits. If you can squeeze in another 2 legal symbols in your number you have each alpha numeric having 1 of 64 possible states (base64 encoding?). That gives you 73,786,976,294,838,206,464 unique 11 possible IDs. All you need to do is use the following forumula:

((current Unix Time to the nanosecond precision) * 1000) + rand(1000)

You're only chance of collision with this number 1/1000 if two requests for an ID number is made within the same nanosecond. You will have several thousand years before this value will wrap around but even then you have very little chance of collision. This will not correlate your order id to a sequential number and should satisfy the business requirement. You can also avoid any chance of predictability by obfuscating the result by picking a secret prime number < MAX_ORDER_ID and do:

(PRIME**orderID) % PRIME.

Check out : http://en.wikipedia.org/wiki/Coprime

Post a Comment