-1

I am rewriting python code to plpgsql that slices a table and imports slices of a table back to the database into different tables. Assume Func(..) is a function that returns some INT and Condition(..., ...) is some conditional expression.

Python

  • copy_expert COPY (SELECT *, (Func(B)) FROM TABLE A ORDER BY (SELECT Func(B) FROM B WHERE Condition(A,B)) TO STDOUT to a file
  • slice the resulting table into n parts in python by writing to different dump slice files
  • for each slice get a new cursor and run copy_expert COPY some_table FROM STDIN

PL/Pgsql

  • run something like

        DECLARE
           ...
           part_name VARCHAR[] = ARRAY[{part_names}];
           prep_stmt VARCHAR[] = ARRAY[{parts}];
        BEGIN 
            FOR i in 1..{parts} LOOP
                prep_stmt[i] := $$ INSERT INTO $$ || part_name[i] || $$
                       ({list_cols}) VALUES ({args}); $$;
            END LOOP;
            FOR row IN (SELECT *, (Func(B)) FROM TABLE A ORDER BY      
                          (SELECT Func(B) FROM B WHERE Condition(A,B))
            LOOP
                chosen_part := 0;
                FOR i IN 1..{parts} LOOP
                    -- choose some part based on Func(B)
                    -- this takes negligible amount of time
                    ... 
                END LOOP;
    
                EXECUTE prep_stmt[part] USING row;
            END LOOP;
        END
    

Python approach performs this task way faster. I mean orders of magnitude faster although client and database are different machines. The tables have ~16M rows each having 5-7 columns.

Is there a way to make it run faster using plpgsql?

Context

The slices would then stay on a single machine and be processed using a different cursor and a processor for each. The reason for slicing is that further processing is CPU and not IO intensive as it works with Polygon intersection and slicing gave significant performance advantage (at least to the non-slices implementation)

EDIT

I would really prefer answers that actually answer the question instead of proposing that the approach is wrong. Sometimes poor design prevents from changing the approach easily and this might just be the case. The question is why is python approach faster and how to make the plpgsql approach faster.

EDIT 2

Is it possible to say something like WRITE TO FILE ... instead of INSERT ... and then bulk load it in the end?

EDIT 3

I'd like to reiterate that the setup is that I cannot avoid creating slices as separate tables.

The reason I cannot say CREATE TABLE AS for each slice is because I am "balancing" the slices based on the output of FUNC(B) for each row in A. The approach tries to immitate min-make-span problem. It is not possible to do something like that within CREATE TABLE AS.

Possible solutions:

Here is the list of things I stumbled upon that could be a solution, are you aware of others?

  • DBLINK extension to insert into each slice in a separate connection?
  • Writing to an unlogged temporary table first, and then importing to a slice
iggy
  • 1,455
  • 1
  • 17
  • 27
  • 2
    Why do you want to slice the table? Some context will give you better answers. – Jon Clements Mar 11 '15 at 22:21
  • @JonClements I want to process each slice in parallel afterwards – iggy Mar 11 '15 at 22:22
  • And what do you mean "in parallel"? – Jon Clements Mar 11 '15 at 22:23
  • @JonClements by "in parallel" I mean using different cursor for each slice and a different processor – iggy Mar 11 '15 at 22:23
  • @JonClements basically the question is not whether one can optimize the way of processing the table as if it was sliced, but actually slicing it faster than python approach does but within strictly plpgsql or maybe plpython (but tests show that the actual insertion is slow) – iggy Mar 11 '15 at 22:26
  • Is it a single server? – Jon Clements Mar 11 '15 at 22:29
  • 2
    Sorry - just can't help but feel this is an XY problem... – Jon Clements Mar 11 '15 at 22:30
  • @JonClements yes, I have added that to the context section – iggy Mar 11 '15 at 22:31
  • so, err, why not just a table partition (if really required), or an index, and an update statement to create the new column (if really required)? If you're doing here is making postgres store exactly the same data *twice* - I would focus on making it more accessible in the first place depending on what queries you plan on running. postgres will sort out its own CPU usage etc... – Jon Clements Mar 11 '15 at 22:32
  • If trying to access it from multiple cursors, then the bottleneck is still going to be the server when trying to access it... etc...- eg: if you're trying to access a slice of tables, and the server's limited to just one core, then that's your lot regardless – Jon Clements Mar 11 '15 at 22:35
  • @JonClements alright ) I see that you think the approach is wrong. Let me restate the question: why is extracting the whole table, transferring to another machine, slicing it there and transferring it all back faster than doing it all on the same (very powerful) machine. As I said my benchmarks show that the `EXECUTE INSERT...` part is the bottleneck. – iggy Mar 11 '15 at 22:39
  • Oh, if that's the question, then it's because using a LOAD FILE is a bulk operation, with a single "BEGIN TRANSACTION;" and "COMMIT;" block, while the pgsql you've written is doing that on a row by row basis. – Jon Clements Mar 11 '15 at 22:42
  • @JonClements alright, how do I make it faster in plpgsql then? thanks for bearing with me btw – iggy Mar 11 '15 at 22:43
  • Make it faster: 1) prefer plain SQL over plpgsql, avoid functions, 2) avoid row-at-a-time processing and script languages , 3) avoid cursors and loops, 4) when possible: avoid dynamic sql, or concentrate it on handling larger batches 0) use a sane data model. – wildplasser Mar 12 '15 at 18:45
  • @wildplasser 1) I can't, as I need rather complex load balancing for `min-make-span` 2) I can't because of 2, so I definitely need to insert row by row 3) I can't, because of 1) and 2), 4) well.. I can't do that either.. 0) it is arguable what exactly is sane. sometimes one needs to stay within certain limits and change of the data model is not possible. – iggy Mar 13 '15 at 09:28
  • I appreciate all comments, but I find it funny that people here would rather point out that someone is completely wrong and suggest to redo everything rather than stay within the question limits – iggy Mar 13 '15 at 09:30
  • I find it funny that you do not disclose your `func(B)` What does it do, based on which input? Does it need a complete row of data to operate on, or only a few fields? Does it *need* to be a function, or is it actually just an expression? – wildplasser Mar 14 '15 at 13:05
  • @wildplasser as I mentioned, func(B) is not relevant, what is relevant is the part for populating the slices table. I don't want to disclose it further because even if you suggest something concerning func(B), I wouldn't be able to make that change (I need to stay within certain limits myself). – iggy Mar 14 '15 at 23:54
  • So you are new to databases and SQL, I figure? /rest-my-case – wildplasser Mar 15 '15 at 00:18
  • @wildplasser I am, but how does that help to answer the question? If you insist that disclosing `func(B)` is important, lets say that that table `A` stores `geom_id` and table `A` stores `geom_id` and a postgis 'geometry' type. `func(B)` returns the `ST_NPoints(..)` for that geometry. The task then becomes to create slices that are more or less uniform in the number of points. – iggy Mar 15 '15 at 00:26
  • OMG. I rest my case. (it starts to look like tge XYZ probelm ...) – wildplasser Mar 15 '15 at 00:43
  • @wildplasser There is really no need to be impolite and make me look like a complete dummy without explaining yourself. It is really unclear to me why would you need to know what does `func(B)` does and now that I provided this function you are being even more arrogant. Erwin suggested something quite useful I didn't know before based on what I provided. – iggy Mar 15 '15 at 00:52
  • Yes, there is a real need. Because there is no solution without the real facts beeing revealed. The fact that there is a GIS component to your XYZ problem, for instance. Show your cards or go away. – wildplasser Mar 15 '15 at 01:25
  • @wildplasser This is supposed to be a generic solution for any `func(B)` that takes a row of `A` and and returns an `INTERGER` based on `Condition(A, B)`, this is why I didn't disclose it (and mentioned that it is important to stay within question limits). And Its actually pretty simple, this is a question and answer website. I gave all the details I can and if you don't have an answer someone else will. If you can't keep your anger then you should be the one who goes away. – iggy Mar 15 '15 at 01:44
  • Functions are terrible, in almost every case. Think about : `select * from A Join b on sin(a.x) = cos(b,y)` . – wildplasser Mar 15 '15 at 01:51
  • sure, I get your point. Still there is a large class of functions where this will work, like `SELECT ST_NPoints(B.geom) from B where A.geom_id = B.geom_id`. I just can't do this in any other way, because there is a large system that is written and uses this approach (with hundreds of different `A`, `B`, `func` and `condition`). What I need to do is make splitting (and later importing) faster. – iggy Mar 15 '15 at 02:03

1 Answers1

0

The question is why is python approach faster and how to make the plpgsql approach faster.

That's not due to Python but COPY vs. INSERT. Bulk-loading with COPY is much faster than doing the same with INSERT.

That aside, your given query doesn't seem to make sense:

SELECT *, (Func(B))
FROM TABLE A
ORDER BY (SELECT Func(B) FROM B WHERE Condition(A,B)

Syntactically invalid in multiple ways. It's hard to say anything based on this.

As long as you are staying in the same DB, my first idea would be CREATE TABLE AS:

CREATE TABLE slice_n AS
SELECT col1, col2, ...
FROM   big_table
ORDER  BY whatever  -- cluster data while being at it
WHERE  whatever;

Can't say much more based on the information given.
Craig assembled a list of general advice:

Community
  • 1
  • 1
Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
  • as I mentioned in comments before, I know that COPY is faster than INSERT, so the particular query example is not relevant. It was meant to give an impression to what is actually happening. Is it possible to make inserts faster? by saying that the slices are temporary unlogged or some other technique? Or by writing to a file from plpgsql? – iggy Mar 12 '15 at 15:40
  • @iggy: Unlogged tables are faster, but there is *currently* no way to convert unlogged tables to regular tables. ([Probably going to be in pg 9.5!](http://www.postgresql.org/docs/devel/static/sql-altertable.html)) You would have to create another table. "Writing to a file" means `COPY`. Separate inserts are much slower than inserting all row at once. The fastest way is `CREATE TABLE AS`. I don't follow why that shouldn't be possible. I think it is. You can `CREATE TABLE AS SELECT * FROM myfunc()`. But running a function per row (inside the table function) is going to be expensive. – Erwin Brandstetter Mar 14 '15 at 18:09
  • I am using output of `func(B)` to do something similar to http://en.wikipedia.org/wiki/Bin_packing_problem First fit rule. I can't come up with a way to code the logic of this within the limits of `CREATE TABLE AS`, I guess I need to process row by row, but inserting to file first and then loading is possible. As well as writing to each table with a separate `DBLINK`. – iggy Mar 14 '15 at 23:57
  • in the link that you posted multirow inserts seem promising. But the insert's execution plan is cached in plpgsql and preparing them shouldn't make much difference I guess? – iggy Mar 15 '15 at 00:34