1

I have a Perl DBM hash containing a list of URLs that I want to pick randomly from to load balance spidering sites. As a result I want to pick a key at random, or select the nth element (so I can randomise n).

I'm aware this goes against the concept of a hash, but is this possible?

NOTE: missed a valuable point that hash size will be too large to load all the keys to randomly select.

brian d foy
  • 121,466
  • 31
  • 192
  • 551
Paul
  • 1,618
  • 1
  • 17
  • 24

4 Answers4

3

I don't think any of the DBM packages have an API for retrieving a random key, or for retrieving keys by index number. You can look up a particular key, or you can read through all the keys in whatever order the database chooses to return them in (which may change if the database is modified, and may or may not be "random" enough for whatever you want to do).

You could read through all the keys and pick one, but that would require reading the entire database each time (or at least a sizable chunk of it), and that's probably too slow.

I think you'll need to rearrange your data structure.

  1. You could use a real SQL database (like SQLite), so you could look up rows both by a sequential row number and by URL. This would be the most flexible.

  2. You could use a sequential integer as the key for your DBM file. That would make picking a random one easy, but you could no longer look up entries by URL.

  3. You could use two DBM files: the one you have now and a second keyed by sequential integer with the URL as value. (Actually, since URLs don't look like integers, you could store both sets of records in the same DBM file, but that would complicate any code that uses each.) This would use twice the disk space, and would make inserting/removing entries a bit more complicated. You'd probably be better off with approach #1, unless you can't install SQLite for some reason.

cjm
  • 59,511
  • 9
  • 121
  • 166
  • related question for 'real SQL databases': http://stackoverflow.com/questions/19412/how-to-request-a-random-row-in-sql – plusplus Jan 28 '10 at 09:34
2

Picking a random element from an array is simpler so you can use keys(%foo) to get the array of keys and pick randomly from that.

I believe this will return a random element $x from an array:

$x = $array[rand @array];

If you want to shuffle an array, consider List::Util::shuffle. See http://search.cpan.org/perldoc/List::Util#shuffle_LIST

dreeves
  • 25,132
  • 42
  • 147
  • 226
  • 1
    There is already `List::Util::shuffle`. See http://search.cpan.org/perldoc/List::Util#shuffle_LIST – Sinan Ünür Jan 27 '10 at 22:19
  • 1
    There's no need to write your own `shuffle` when the one from List::Util is more robust (and faster). – friedo Jan 27 '10 at 22:19
  • Thanks for the advice it's a spot on solution where hash size is manageable. Is there an alternative for hashes too large to load all the keys? – Paul Jan 27 '10 at 22:26
1

Of course, it is possible. First, get a list of the keys. Then, randomize the list, using shuffle from List::Util.

Then, loop over the keys.

If there are too many keys (so keeping them all in a list and shuffling is not possible), just remember that you are using tied hashes. Just use each to iterate over key value pairs.

The order will be deterministic but AFAIK, it will not be alphabetical or order of insertion. That, by itself, might be able to get you what you want.

Sinan Ünür
  • 113,391
  • 15
  • 187
  • 326
  • Thanks for the advice. Unfortunately my hash is too big to load all the keys. – Paul Jan 27 '10 at 22:31
  • PaulMdx - `each` returns a key and a value, iteratively over the hash. So, you can deal with them one at a time. – daotoad Jan 28 '10 at 02:04
1

You could use DBM::Deep instead of a traditional DB file to keep your data.

tie %hash, "DBM::Deep", {
    file => "foo.db",
    locking => 1,
    autoflush => 1
};

# $hash{keys} = [ ... ]
# $hash{urls} = { ... } <- same as your current DB file.

my $like_old = $hash{urls}; # a ref to a hash you can use like your old hashref.
my $count = @{$hash{keys}};

With that you can pull out random values as needed.

daotoad
  • 25,670
  • 7
  • 54
  • 99