0

I want to scrape a big amount of webpages (1000/second) and save 1-2 numbers from this web pages into a database. I want to manage this Workers with RabbitMQ, but I also have to write the data somewhere.

Heroku PostgreSQL has a concurrency limit of 60 requests in their cheapest production tier. Is PostgreSQL the best solution for this job? Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?

nohayeye
  • 1,839
  • 2
  • 15
  • 15
  • What are your durability requirements? Do you mind if you lose the last 100 records on crash? The last hour's worth? The whole data set since the last backup? This affects what you can do, and how, a lot. – Craig Ringer Jan 12 '14 at 10:57
  • How fast is your local machine? How much RAM does it have? Do you have multiple drives? You haven't to us anything necessary for us to answer in a meaningful way. – the Tin Man Jan 13 '14 at 01:48

2 Answers2

1

Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?

Try it and see. If you've got an SSD, or don't require crash safety, then you almost certainly can.

You will find that with anything you choose, you have to make trade-offs with durability and write latencies.

If you want to commit each record individually, in strict order, you should be able to achieve that on a laptop with a decent SSD. You will not possibly get it on something like a cheap AWS instance, a server with a spinning rust hard drive, etc though, as they don't have good enough disk flush rates. (pg_test_fsync is a handy tool for looking at this). This will be true of anything that's doing genuine atomic commits of individual records to durable storage, not just PostgreSQL - about the best rate you're going to get is the max disk flush rate / 2 unless it's a purely append-only system, in which case the commit rate can be equal to the disk flush rate.

If you want to get higher throughput, you'll need to batch writes together and commit them in groups to spread the disk sync overhead. In the case of PostgreSQL, the commit_delay option can be useful to batch commits together. Better still, buffer a few changes client-side and do multi-valued inserts. Turning off synchronous_commit for a transaction if you don't need a hard guarantee it's committed before returning control to your program.

I haven't tested it, but expect Heroku will allow you to set both these params on your sessions using SET synchronous_commit = off or SET commit_delay = .... You should test it and see. In fact, you should do a simulated workload benchmark and see if you can make it go fast enough for your needs.

If you can't, you'll be able to use alternate hosting that will with appropriate configuration.

See also: How to speed up insertion performance in PostgreSQL

Community
  • 1
  • 1
Craig Ringer
  • 259,831
  • 56
  • 584
  • 684
1

PostgreSQL is perfectly capable of handling such job. Just to give you an idea, PostgreSQL 9.2 is expected to handle up to 14.000 writes per second, but this largely depends on how you configure, design and manage the database, and on the available hardware (disk performance, RAM, etc.).

I assume the limit imposed by Heroku is to avoid potential overloads. You may want to consider an installation of PostgreSQL on a custom server or alternative solutions. For instance, Amazon recently announced the support for PostgreSQL on RDS.

Finally, I just want to mention that for the majority of standard tasks, the "best solution" is largely dependent on your knowledge. An efficiently configured MySQL is better than a badly configured PostgreSQL, and vice-versa.

I know companies that were able to reach unexpected results with a specific database by highly optimizing the setup and the configuration of the engine. There are exceptions, indeed, but I don't think they apply to your case.

Simone Carletti
  • 164,184
  • 42
  • 341
  • 356
  • 1
    Totally agree that the best config is often dependent on knowledge, once core feature requirements are satisfied. Is a PostgreSQL-experienced user better off using Pg as a graph database, or learning a graph database? Probably not. – Craig Ringer Jan 12 '14 at 10:56
  • That Postgres can do it isn't the real question, it's can it do it on the OPs local machine, and we have no idea what the machine is or what else it'll be doing. 1000/sec on a machine dedicated to the DB is fine, but not if it's also running MQ and processing 1000 page downloads and scrapes/second. That's a LOT of nodes running at once with about 1sec. Latency/request at best, I think the result will be a lot less that 1000/sec. – the Tin Man Jan 13 '14 at 01:53