7

I'm interested in creating tiny url like links. My idea was to simply store an incrementing identifier for every long url posted and then convert this id to it's base 36 variant, like the following in PHP:

$tinyurl = base_convert($id, 10, 36)

The problem here is that the result is guessable, while it has to be hard to guess what the next url is going to be, while still being short (tiny). Eg. atm if my last tinyurl was a1, the next one will be a2. This is a bad thing for me.

So, how would I make sure that the resulting tiny url is not as guessable but still short?

Charles
  • 48,924
  • 13
  • 96
  • 136
Tom
  • 8,437
  • 26
  • 122
  • 217

10 Answers10

9

What you are asking for is a balance between reduction of information (URLs to their indexes in your database), and artificial increase of information (to create holes in your sequence).

You have to decide how important both is for you. Another question is whether you just do not want sequential URLs to be guessable, or have them sufficiently random to make guessing any valid URL difficult.

Basically, you want to declare n out of N valid ids. Choose N smaller to make the URLs shorter, and make n smaller to generate URLs that are difficult to guess. Make n and N larger to generate more URLs when the shorter ones are taken.

To assign the ids, you can just take any kind of random generator or hash function and cap this to your target range N. If you detect a collision, choose the next random value. If you have reached a count of n unique ids, you must increase the range of your ID set (n and N).

relet
  • 6,204
  • 1
  • 30
  • 41
  • Regarding your last paragraph. I think he wants a value he can reverse, i.e., he wants an injective function. – Artefacto Aug 06 '10 at 21:51
  • 1
    No, he wants an unguessable function, really. ;) As he has to store the URLs in a database anyway, he can use the random number as an index. Reversal achieved. – relet Aug 06 '10 at 21:59
  • True, does not have to be injective. – Tom Aug 06 '10 at 22:09
5

I would simply crc32 url

$url = 'http://www.google.com';
$tinyurl = hash('crc32', $url ); // db85f073

cons: constant 8 character long identifier

dev-null-dweller
  • 28,330
  • 3
  • 60
  • 82
  • I like this idea, but the 8-character code is kind of a problem - with URL shorteners these days, every character counts, and 8 is a little high. – Joe Enos Aug 06 '10 at 21:55
4

This is really cheap, but if the user doesn't know it's happening then it's not as guessable, but prefix and postfix the actual id with 2 or 3 random numbers/letters.

If I saw 9d2a1me3 I wouldn't guess that dm2a2dq2 was the next in the series.

BarrettJ
  • 3,261
  • 2
  • 27
  • 26
2

Try Xor'ing the $id with some value, e.g. $id ^ 46418 - and to convert back to your original id you just perform the same Xor again i.e. $mungedId ^ 46418. Stack this together with your base_convert and perhaps some swapping of chars in the resultant string and it'll get quite tricky to guess a URL.

Will A
  • 23,926
  • 4
  • 46
  • 60
2

Another way would be to set the maximum number of characters for the URL (let's say it's n). You could then choose a random number between 1 and n!, which would be your permutation number.

On which new URL, you would increment the id and use the permutation number to associate the actual id that would be used. Finally, you would base 32 (or whatever) encode your URL. This would be perfectly random and perfectly reversible.

Artefacto
  • 90,634
  • 15
  • 187
  • 215
  • Duplicate IDs are possible though in this way, so you'd have to check for that and increment again if duplicate. – Tom Aug 06 '10 at 23:39
1

If you want an injective function, you can use any form of encryption. For instance:

<?php
$key = "my secret";
$enc = mcrypt_ecb (MCRYPT_3DES, $key, "42", MCRYPT_ENCRYPT);
$f = unpack("H*", $enc);
$value = reset($f);
var_dump($value); //string(16) "1399e6a37a6e9870"

To reverse:

$rf = pack("H*", $value);
$dec = rtrim(mcrypt_ecb (MCRYPT_3DES, $key, $rf, MCRYPT_DECRYPT), "\x00");
var_dump($dec); //string(2) "42"

This will not give you a number in base 32; it will give you the encrypted data with each byte converted to base 16 (i.e., the conversion is global). If you really need, you can trivially convert this to base 10 and then to base 32 with any library that supports big integers.

Artefacto
  • 90,634
  • 15
  • 187
  • 215
0

You can pre-define the 4-character codes in advance (all possible combinations), then randomize that list and store it in this random order in a data table. When you want a new value, just grab the first one off the top and remove it from the list. It's fast, no on-the-fly calculation, and guarantees pseudo-randomness to the end-user.

Joe Enos
  • 36,707
  • 11
  • 72
  • 128
  • 1
    I should point out that this is exactly what I did for a URL shortener, and it's a bit of a pain to get started. There are an awful lot of possible combinations, which means you start out with a huge database file for such a simple concept. – Joe Enos Aug 06 '10 at 21:53
  • @relet What exactly are you referring to? The fact that there's a limited number that cannot increase? If that's it, then once you start running out of 4-character codes, then calculate all the 5-character codes and insert that into your queue table. – Joe Enos Aug 06 '10 at 21:54
0

Hashids is an open-source library that generates short, unique, non-sequential, YouTube-like ids from one or many numbers. You can think of it as an algorithm to obfuscate numbers.

It converts numbers like 347 into strings like "yr8", or array like [27, 986] into "3kTMd". You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.

Use it when you don't want to expose your database ids to the user.

It allows custom alphabet as well as salt, so ids are unique only to you.

Incremental input is mangled to stay unguessable.

There are no collisions because the method is based on integer to hex conversion.

It was written with the intent of placing created ids in visible places, like the URL. Therefore, the algorithm avoids generating most common English curse words.

Code example

$hashids = new Hashids();
$id = $hashids->encode(1, 2, 3); // o2fXhV
$numbers = $hashids->decode($id); // [1, 2, 3]
Demis Palma ツ
  • 6,369
  • 1
  • 19
  • 26
0

You can use this approach to generate strings that may not be random but may appear difficult to predict.

Consider this example for 62 characters and for generating URL strings of length LEN=6

  1. Create a string map, mapping numbers as their indices to characters say "abc...zABC...Z012...9" for (here) 62 characters. You may shuffle them to make this sequence appear random.

  2. Get an integer N, such that 2^N is just less than 62^LEN. For this case N=35.

  3. Now start a counter from 1. To generate a new URL, convert this counter to a binary string, swap few bits with each other (the same swapping should be done to all the binary strings you will generate, you may simple reverse the binary string).

  4. Convert that binary number back to the integer. And convert that integer to a base 62 number, mapping every remainder to the characters in the character map generated in the first step.

This is a simple implementation in java. I referred to this for easily padding 0's:

private static final AtomicLong counter = new AtomicLong(0);
private static final String MAP = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz"; // 62 characters

public static String generateNextURL(){

    long num = counter.incrementAndGet();
    String binary = Long.toBinaryString((1L<<35)|num).substring(1); // to binary with padding 0s
    binary = new StringBuilder(binary).reverse().toString();           // swapping ith with 35-1-ith character
    num =  Long.parseLong(binary,2);                             // back to integer

    StringBuilder url= new StringBuilder();
    for(int i=0;i<6;i++){
        long SZ = 62;
        url.append(MAP.charAt((int) (num%SZ)));
        num=num/SZ;
    }

    String newURL = url.reverse().toString();
    System.out.println(newURL);
    return newURL;
}

Have a look at the URLs generated. This code can generate 2^35 strings without any duplicates, and appearing pretty random once the counter starts. You can shuffle the characters in the character map as well.

SkeymQ
JXU4YI
2Hz3KY
EgfPME
X1UNyU
ODzjaM
-1

I ended up creating a md5 sum of the identifier, use the first 4 alphanumerics of it and if this is a duplicate simply increment the length until it is no longer a duplicate.

function idToTinyurl($id) {
    $md5 = md5($id);
    for ($i = 4; $i < strlen($md5); $i++) {
        $possibleTinyurl = substr($md5, 0, $i);
        $res = mysql_query("SELECT id FROM tabke WHERE tinyurl='".$possibleTinyurl."' LIMIT 1");
        if (mysql_num_rows($res) == 0) return $possibleTinyurl;
    }
    return $md5;
}

Accepted relet's answer as it's lead me to this strategy.

Tom
  • 8,437
  • 26
  • 122
  • 217