7

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective of the distribution of the the input integers). For example, if I have 3 choices, i.e. [0, 1, 2], the user ID 123 may always be randomly assigned choice 2, whereas the user ID 234 may always be assigned choice 1.

Cross-language and cross-platform algorithmic reproducibility is desirable. I'm inclined to use a hash function and modulo unless there is a better way. Here is what I have:

>>> num_choices = 3
>>> id_num = 123
>>> int(hashlib.sha256(str(id_num).encode()).hexdigest(), 16) % num_choices
2

I'm using the latest stable Python 3. Please note that this question is similar but not exactly identical to the related question to convert a string to random but deterministically repeatable uniform probability.

Acumenus
  • 41,481
  • 14
  • 116
  • 107
  • 1
    In your real application, what's the domain of the user ID integers, and how big is the choice set? And how secure do you wan this randomization to be? Does it just have to look random-ish, or would you like something that's cryptographically strong? – PM 2Ring Dec 21 '16 at 05:32
  • @PM2Ring I do not know the domain of the user ID integers by they could be a 32 or 64 bit unsigned int obtained from a database. The choice set length is 2 to 10. Cryptographic randomness is not necessary, but repeatability and equiprobability are. – Acumenus Dec 21 '16 at 05:36
  • I assume that "choice set length is 2 to 10" means that there at most 10 choices, rather than that the number of choices could be a 10 digit number. If the former is true, then it'd be pretty hard to make it cryptographically strong. :) But you may still be interested in the topic of [format-preserving encryption](https://en.wikipedia.org/wiki/Format-preserving_encryption). – PM 2Ring Dec 21 '16 at 05:41
  • @PM2Ring To clarify the domain of choices, yes, there are at most 10 choices that are to be used in a statistical A/B testing or similar experiment. I do not understand what relation this has to cryptography. – Acumenus Dec 21 '16 at 05:47
  • The condition *The choices must be selected with equal probability* is not well defined unless you explain this in terms of random variables. Which quantities need to be equal ? I think this ambiguity hides an impossible task. – Gribouillis Dec 21 '16 at 06:19
  • The output, e.g. one of `[0, 1, 2]` when `num_choices = 3`, must be with equal probability. This of course doesn't mean that the choice selection frequency must be exactly equal; it must merely converge to equal frequency. – Acumenus Dec 21 '16 at 06:24
  • Converting an ID to a deterministically repeatable choice simply means defining a fixed function from the set A of all possible IDs to the set C of all choices. The probability of a choice is the sum of the probabilities of user ID's corresponding to that choice. It means that you must say something about the user IDs distribution. There is something missing in the problem. I think what you really want is that the function *looks* random although it's not. – Gribouillis Dec 21 '16 at 07:38
  • @Gribouillis Yes, indeed. I want a function that looks random but is deterministic. I have edited the question to mention it. – Acumenus Dec 21 '16 at 12:47

4 Answers4

7

Using hash and modulo

import hashlib

def id_to_choice(id_num, num_choices):
    id_bytes = id_num.to_bytes((id_num.bit_length() + 7) // 8, 'big')
    id_hash = hashlib.sha512(id_bytes)
    id_hash_int = int.from_bytes(id_hash.digest(), 'big')  # Uses explicit byteorder for system-agnostic reproducibility
    choice = id_hash_int % num_choices  # Use with small num_choices only
    return choice

>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
1

Notes:

  • The built-in hash method must not be used because it can preserve the input's distribution, e.g. with hash(123). Alternatively, it can return values that differ when Python is restarted, e.g. with hash('123').

  • For converting an int to bytes, bytes(id_num) works but is grossly inefficient as it returns an array of null bytes, and so it must not be used. Using int.to_bytes is better. Using str(id_num).encode() works but wastes a few bytes.

  • Admittedly, using modulo doesn't offer exactly uniform probability,[1][2] but this shouldn't bias much for this application because id_hash_int is expected to be very large and num_choices is assumed to be small.

Using random

The random module can be used with id_num as its seed, while addressing concerns surrounding both thread safety and continuity. Using randrange in this manner is comparable to and simpler than hashing the seed and taking modulo.

With this approach, not only is cross-language reproducibility a concern, but reproducibility across multiple future versions of Python could also be a concern. It is therefore not recommended.

import random

def id_to_choice(id_num, num_choices):
    localrandom = random.Random(id_num)
    choice = localrandom.randrange(num_choices)
    return choice

>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
2
Acumenus
  • 41,481
  • 14
  • 116
  • 107
0

An alternative is to encrypt the user ID. If you keep the encryption key the same, then each input number will encrypt to a different output number up to the block size of the cipher you use. DES uses 64 bit blocks which cover IDs 000000 to 18446744073709551615. That will give a random appearing replacement for the user ID, which is guaranteed not to give two different user IDs the same 'random' number because encryption is a one-to-one permutation of the block values.

rossum
  • 14,325
  • 1
  • 19
  • 34
0

I apologize I don't have Python implementation but I do have very clear, readable and self evident implementation in Java which should be easy to translate into Python with minimal effort. The following produce long predictable evenly distributed sequences covering all range except zero

XorShift ( http://www.arklyffe.com/main/2010/08/29/xorshift-pseudorandom-number-generator )

public int nextQuickInt(int number) {
    number ^= number << 11;
    number ^= number >>> 7;
    number ^= number << 16;
    return number;
}

public short nextQuickShort(short number) {
    number ^= number << 11;
    number ^= number >>> 5;
    number ^= number << 3;
    return number;
}

public long nextQuickLong(long number) {
    number ^= number << 21;
    number ^= number >>> 35;
    number ^= number << 4;
    return number;
}

or XorShift128Plus (need to re-seed state0 and state1 to non-zero values before using, http://xoroshiro.di.unimi.it/xorshift128plus.c)

public class XorShift128Plus {

private long state0, state1; // One of these shouldn't be zero

public long nextLong() {
    long state1 = this.state0;
    long state0 = this.state0 = this.state1;
    state1 ^= state1 << 23;
    return (this.state1 = state1 ^ state0 ^ (state1 >> 18) ^ (state0 >> 5)) + state0;
}

public void reseed(...) {
    this.state0 = ...;
    this.state1 = ...;
}

}

or XorOshiro128Plus (http://xoroshiro.di.unimi.it/)

public class XorOshiro128Plus {

private long state0, state1; // One of these shouldn't be zero

public long nextLong() {
    long state0 = this.state0;
    long state1 = this.state1;
    long result = state0 + state1;
    state1 ^= state0;
    this.state0 = Long.rotateLeft(state0, 55) ^ state1 ^ (state1 << 14);
    this.state1 = Long.rotateLeft(state1, 36);
    return result;
}

public void reseed() {

}

}

or SplitMix64 (http://xoroshiro.di.unimi.it/splitmix64.c)

public class SplitMix64 {

private long state;

public long nextLong() {
    long result = (state += 0x9E3779B97F4A7C15L);
    result = (result ^ (result >> 30)) * 0xBF58476D1CE4E5B9L;
    result = (result ^ (result >> 27)) * 0x94D049BB133111EBL;
    return result ^ (result >> 31);
}

public void reseed() {
    this.state = ...;
}
}

or XorShift1024Mult (http://xoroshiro.di.unimi.it/xorshift1024star.c) or Pcg64_32 (http://www.pcg-random.org/, http://www.pcg-random.org/download.html)

oᴉɹǝɥɔ
  • 1,466
  • 1
  • 16
  • 25
  • So what do these four (non-Python) PRNGs provide that's better than A-B-B's answer, to the point where OP would want to port them? – pjs Jan 04 '17 at 00:17
  • 1
    Options sir. This answer is not better, it rather compliments the first answer with alternatives. The focus is not even on these 6 specific alternatives but rather direction to look and explore. – oᴉɹǝɥɔ Jan 05 '17 at 01:51
-1

The simplest method is to modulo user_id by number of options:

choice = user_id % number_of_options

It's very easy and fast. However if you know user_id's you may to guess an algorithm.

Also, pseudorandom sequences can be obtained from random seeded with user constants (e.g. user_id):

>>> import random
>>> def generate_random_value(user_id):
...     random.seed(user_id)
...     return random.randint(1, 10000)
...
>>> [generate_random_value(x) for x in range(20)]
[6312, 2202, 927, 3899, 3868, 4186, 9402, 5306, 3715, 7586, 9362, 7412, 7776, 4244, 1751, 3424, 5924, 8553, 2970, 709]
>>> [generate_random_value(x) for x in range(20)]
[6312, 2202, 927, 3899, 3868, 4186, 9402, 5306, 3715, 7586, 9362, 7412, 7776, 4244, 1751, 3424, 5924, 8553, 2970, 709]
>>>
Acumenus
  • 41,481
  • 14
  • 116
  • 107
Eugene Lisitsky
  • 9,867
  • 4
  • 30
  • 55