I am moving data from Mysql to Postgres and my code is like below -
import os, re, time, codecs, glob, sqlite3
from StringIO import StringIO
import psycopg2, MySQLdb, datetime, decimal
from datetime import date
import gc
tables = (['table1' , 27],)
conn = psycopg2.connect("dbname='xxx' user='xxx' host='localhost' password='xxx' ")
curpost = conn.cursor()
db = MySQLdb.connect(host="127.0.0.1", user="root", passwd="root" , unix_socket='/var/mysql/mysql.sock', port=3306 )
cur = db.cursor()
cur.execute('use xxx;')
for t in tables:
print t
curpost.execute( "truncate table " + t[0] )
cur.execute("select * from "+ t[0] )
a = ','.join( '%s' for i in range(t[1]) )
qry = "insert into " + t[0] + " values ( " + a +" )"
print qry
i = 0
while True:
rows = cur.fetchmany(5000)
if not rows: break
string = ''
for row in rows:
string = string + ('|'.join([str(x) for x in row])) + "\n"
curpost.copy_from(StringIO(string), t[0], sep="|", null="None" )
i += curpost.rowcount
print i , " loaded"
curpost.connection.commit()
del string, row, rows
gc.collect()
curpost.close()
cur.close()
For small tables, the code runs fine. However the larger ones (3.6 million records), the moment the mysql execute (cur.execute("select * from "+ t[0] )) runs, the memory utilization on the machine zooms. This is even though i have used fetchmany and records should only come in batches of 5000. I have tried with 500 records also and its the same. For large tables it seems that fetchmany is not working as documented..
Edit - I added garbage collection and del statements. Still the memory keeps on bloating till all records are not processed.
Any ideas?