0

I have a table that has to be populated every hour from the data being pulled out of Greenplum.This table is being stored on the Greenplum server.

So,what I want to know is that what method(python script, windows scheduler or anything) will be suitable for my data(which I guess can be as huge as 60GB or more) should be used for scheduling the query(written in postgreSQL) to be run every hour.

Can someone attach a sample code for the same?

Nisha Singh
  • 23
  • 2
  • 8

1 Answers1

0

You will want to do a parallel COPY of ranges of the data out of Greenplum and into PostgreSQL. Make sure PostgreSQL is set up for fast data loading. If possible, use an UNLOGGED table; otherwise use wal_level = 'minimal' at least.

How many parallel workers depends on the PostgreSQL server's I/O subsystem more than anything. Test and see.

I would recommend using Python with psycopg2 and the copy_expert cursor function. See the docs. Use multiprocessing with a pipe to share a file-like object between a reader and writer worker, with the reader connected to greenplum and the writer to PostgreSQL.

So effectively each worker does something a bit like the following shell pseudo-code:

psql -h greenplum-box \
     -c "COPY (SELECT * FROM mytable WHERE id BETWEEN 1 AND 10000) TO stdin" \
  | \
psql -h postgres-box \
     -c "COPY myttable FROM stdin";

(but you connect the two up using pyscopg2, copy_export, multiprocessing, and a pipe).

Do all the usual fast loading work like creating indexes afterwards. See how to speed up insertion performance in PostgreSQL.

If you have the disk space, create a table like dataload_temp, populate it, then in one transaction drop the old one and rename the new one to the old one's name. That way there's minimal disruption.

Alternately, take a look at pg_bulkload for off-line (but non-streaming) bulk data loading.

Community
  • 1
  • 1
Craig Ringer
  • 259,831
  • 56
  • 584
  • 684
  • @craig..I am learning GP,postgreSQL and Python all at once and I have been working on it for past 20 hours only. The links are helpful but I have one confusion – Nisha Singh Feb 26 '14 at 05:47
  • @craig..I am learning GP,postgreSQL and Python all at once and I have been working on it for past 20 hours only. The links are helpful but I have one confusion i.e.do I need to use the copy_to and copy_from if I have queries of the following kind cur.execute(create table abc.....);cur.execute(insert into table abc values...(select....)); – Nisha Singh Feb 26 '14 at 05:59
  • @user3336746 You use `COPY ... FROM STDOUT` *instead of* `INSERT INTO ...`. It's a more efficient way to load data. See the PostgreSQL user manual. http://www.postgresql.org/docs/current/static/sql-copy.html . And you can use `COPY ... TO STDOUT` to extract data from Greenplum. Hence my recommendation to connect the two together using a pipe to stream data between the DBs. – Craig Ringer Feb 26 '14 at 06:21
  • Added some pseudo code that might help you understand. You will need to do some study to learn how to do this quickly and in parallel. – Craig Ringer Feb 26 '14 at 06:26
  • I don't need a file.! What I need is just a subset of the data on greenplum and this subset is to be stored on greenplum only.So, why do I need a file? – Nisha Singh Feb 26 '14 at 13:19
  • @user3336746 I think you must've misunderstood. Read again. There's a pipe between two `COPY`s. Do you know what a pipe is in unix? – Craig Ringer Feb 26 '14 at 23:09