0

I am moving data from Mysql to Postgres and my code is like below -

import os, re, time, codecs, glob, sqlite3
from StringIO import StringIO
import psycopg2, MySQLdb, datetime, decimal
from datetime  import date
import gc

tables = (['table1' , 27],)
conn = psycopg2.connect("dbname='xxx' user='xxx' host='localhost' password='xxx' ")
curpost = conn.cursor()
db = MySQLdb.connect(host="127.0.0.1", user="root", passwd="root" , unix_socket='/var/mysql/mysql.sock', port=3306 )
cur = db.cursor() 
cur.execute('use xxx;')

for t in tables:
    print t
    curpost.execute( "truncate table " + t[0] )
    cur.execute("select * from "+ t[0] )
    a = ','.join( '%s' for i in range(t[1]) )
    qry = "insert into " + t[0]  + " values ( " + a +" )" 
    print qry
    i = 0
    while True:
        rows = cur.fetchmany(5000)
        if not rows: break
        string = ''
        for row in rows:
            string = string +  ('|'.join([str(x) for x in row])) + "\n"                
        curpost.copy_from(StringIO(string),  t[0], sep="|", null="None" )
        i += curpost.rowcount
        print i  , " loaded"
        curpost.connection.commit()        
        del string, row, rows
        gc.collect()        

curpost.close()
cur.close()

For small tables, the code runs fine. However the larger ones (3.6 million records), the moment the mysql execute (cur.execute("select * from "+ t[0] )) runs, the memory utilization on the machine zooms. This is even though i have used fetchmany and records should only come in batches of 5000. I have tried with 500 records also and its the same. For large tables it seems that fetchmany is not working as documented..

Edit - I added garbage collection and del statements. Still the memory keeps on bloating till all records are not processed.

Any ideas?

myloginid
  • 1,313
  • 1
  • 18
  • 35
  • Its 20 minutes now, RAM usage is approx 4 GB, Not even 5000 records processed :-( – myloginid Feb 20 '15 at 17:26
  • what do you mean by **is not working as documented** could you give me a link please? – Alex Feb 20 '15 at 17:26
  • http://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-fetchmany.html – myloginid Feb 20 '15 at 17:34
  • What i mean by as documented is that the memory should only be used in batches of 5000 records. The Python process accumulates and accumulates memory.. It seems like its fetching all the 3.6mil records and not doing a fetchmany – myloginid Feb 20 '15 at 17:35
  • so I saw that link before - there is nothing about memory usage - function just return asked number of rows. so if you care about memory usage and you found this bug, probably it is better just change query to use `LIMIT` statement – Alex Feb 20 '15 at 17:40
  • Cant use Limit as I want to process all records. Its a Python Issue and not MySql. Python does not release the intermediate memory after processing 5000 records. – myloginid Feb 20 '15 at 17:46

1 Answers1

0

Sorry if I am wrong, you've said that you don't want to change query

But just in case if you have no choice you can try:

replace this fragment:

cur.execute("select * from "+ t[0] )
a = ','.join( '%s' for i in range(t[1]) )
qry = "insert into " + t[0]  + " values ( " + a +" )" 
print qry
i = 0
while True:
        rows = cur.fetchmany(5000)

to this one:

a = ','.join( '%s' for i in range(t[1]) )
qry = "insert into " + t[0]  + " values ( " + a +" )" 
print qry
i = 0
while True:
    cur.execute("select * from "+ t[0]+" LIMIT "+i+", 5000")
    rows = cur.fetchall()
Alex
  • 16,298
  • 1
  • 23
  • 46
  • Will MySql maintain the sanctity of the row number.. I have done lots of oracle before and never rely on any ordering provided by the db. Else I will transfer dupes. – myloginid Feb 20 '15 at 18:16
  • to insure yourself you can add "ORDER BY", but usually it is not needed if your table contents not changing during loop time by other users. if you need, you can start transaction before... but looking on your code, I think you are trying to convert some data, so data source is pretty much static. just try my way – Alex Feb 20 '15 at 18:20
  • Just got the results. Limit worked.. I got 5000 records with avg memory of Approx 100MB. And its reached to 45000 committed records with same memory – myloginid Feb 20 '15 at 18:23
  • One Question - Why does Python not release the fetchmany records once they are processed. The old way memory kept on bloating. Also Order by can be used only with fetchmany. If it was Oracle like DB, and if i had used 50 Order by queries with row numbers filters in each, the DBA will have shot me in the head. – myloginid Feb 20 '15 at 18:26
  • I am not 100% underatsnd what you are trying to say. LIMIT, ORDER BY is regular statement for sql, I am not very familiar with oracle. but I am pretty sure - there is nothing wrong to use those statements as many tines as you want. If we talk about `fetchmany` low level memory usage. I guess that your memory used not by `fetchmany` but just by `cur.execute` when mysql return ALL data to python. to check that - try to `sleep` or loop 10000000 right after `.execute` and check your memory usage – Alex Feb 20 '15 at 18:31
  • ok, just checked - there is no LIMIT for oracle but there is `OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY;` http://stackoverflow.com/questions/470542/how-do-i-limit-the-number-of-rows-returned-by-an-oracle-query-after-ordering – Alex Feb 20 '15 at 18:36
  • My earlier comment referred to the fact that in Oracle if you fire multiple queries with ronum or offset kind of clauses, for each of the queries fired on the db end, it has to run in conjunction with an Order By clause. Sort clauses and repeatedly over the same sort keys are something that are better avoided on tables with millions of records. This the comment that "DBA wont like it". :-) – myloginid Feb 22 '15 at 17:07
  • The solution of limit works for now.. But i cant accept that as an answer as its a mem issue. I am running stuff in Cygwin so cant use the psutil module.. – myloginid Feb 22 '15 at 17:08
  • I don't really understand your reasons. But I don't care. that is up to if you want or not to accept this as an answer. Good luck with your project anyway – Alex Feb 22 '15 at 17:26