5

Due to some limitations I want to switch my current project from EventMachine/EM-Synchrony to Celluloid but I've some trouble to get in touch with it. The project I am coding on is a web harvester which should crawl tons of pages as fast as possible.

For the basic understanding of Celluloid I've generated 10.000 dummy pages on a local web server and wanna crawl them by this simple Celluloid snippet:

#!/usr/bin/env jruby --1.9

require 'celluloid'
require 'open-uri'

IDS = 1..9999
BASE_URL = "http://192.168.0.20/files"

class Crawler
  include Celluloid
  def read(id)
    url = "#{BASE_URL}/#{id}"
    puts "URL: " + url
    open(url) { |x| x.read }
  end
end

pool = Crawler.pool(size: 100)

IDS.to_a.map do |id|
   pool.future(:read, id)
end

As far as I understand Celluloid, futures are the way to go to get the response of a fired request (comparable to callbacks in EventMachine), right? The other thing is, every actor runs in its own thread, so I need some kind of batching the requests cause 10.000 threads would result in errors on my OSX dev machine.

So creating a pool is the way to go, right? BUT: the code above iterates over the 9999 URLs but only 1300 HTTP requests are sent to the web server. So something goes wrong with limiting the requests and iterating over all URLs.

halfer
  • 18,701
  • 13
  • 79
  • 158
ctp
  • 1,025
  • 1
  • 7
  • 27

1 Answers1

7

Likely your program is exiting as soon as all of your futures are created. With Celluloid a future will start execution but you can't be assured of it finishing until you call #value on the future object. This holds true for futures in pools as well. Probably what you need to do is change it to something like this:

  crawlers = IDS.to_a.map do |id|
    begin
      pool.future(:read, id)
    rescue DeadActorError, MailboxError
    end
  end

  crawlers.compact.each { |crawler| crawler.value rescue nil }
  • Hm, next issue :) Previously I've coded the crawler in EventMachine. Every time a callback contained the HTML data, I've fired the next request. In the example above I have to iterate over all crawlers to get the callback data, each of the actors returns 100kb HTML data. So it works fine for a few hundreds pages but how to handle requests to hundred thousand pages? Every request should handle a max timeout of 10 seconds e.g. so batching thousands of times over a bunch of 100 requests with such a timeout doesnt seem to be a good approach. Any recommendations? – ctp Sep 06 '12 at 11:38