I am rewriting python code to plpgsql that slices a table and imports slices of a table back to the database into different tables. Assume Func(..)
is a function that returns some INT
and Condition(..., ...)
is some conditional expression.
Python
- copy_expert
COPY (SELECT *, (Func(B)) FROM TABLE A ORDER BY (SELECT Func(B) FROM B WHERE Condition(A,B)) TO STDOUT
to a file - slice the resulting table into
n
parts in python by writing to different dump slice files - for each slice get a new cursor and run copy_expert
COPY some_table FROM STDIN
PL/Pgsql
run something like
DECLARE ... part_name VARCHAR[] = ARRAY[{part_names}]; prep_stmt VARCHAR[] = ARRAY[{parts}]; BEGIN FOR i in 1..{parts} LOOP prep_stmt[i] := $$ INSERT INTO $$ || part_name[i] || $$ ({list_cols}) VALUES ({args}); $$; END LOOP; FOR row IN (SELECT *, (Func(B)) FROM TABLE A ORDER BY (SELECT Func(B) FROM B WHERE Condition(A,B)) LOOP chosen_part := 0; FOR i IN 1..{parts} LOOP -- choose some part based on Func(B) -- this takes negligible amount of time ... END LOOP; EXECUTE prep_stmt[part] USING row; END LOOP; END
Python approach performs this task way faster. I mean orders of magnitude faster although client and database are different machines. The tables
have ~16M
rows each having 5-7
columns.
Is there a way to make it run faster using plpgsql?
Context
The slices would then stay on a single machine and be processed using a different cursor and a processor for each. The reason for slicing is that further processing is CPU
and not IO
intensive as it works with Polygon intersection and slicing gave significant performance advantage (at least to the non-slices implementation)
EDIT
I would really prefer answers that actually answer the question instead of proposing that the approach is wrong. Sometimes poor design prevents from changing the approach easily and this might just be the case. The question is why is python approach faster and how to make the plpgsql approach faster.
EDIT 2
Is it possible to say something like WRITE TO FILE ...
instead of INSERT ...
and then bulk load it in the end?
EDIT 3
I'd like to reiterate that the setup is that I cannot avoid creating slices as separate tables.
The reason I cannot say CREATE TABLE AS
for each slice is because
I am "balancing" the slices based on the output of FUNC(B)
for each row in A
. The approach tries to immitate min-make-span
problem. It is not possible to do something like that within CREATE TABLE AS
.
Possible solutions:
Here is the list of things I stumbled upon that could be a solution, are you aware of others?
- DBLINK extension to insert into each slice in a separate connection?
- Writing to an unlogged temporary table first, and then importing to a slice