1

I have a python script which runs at the begining of each day using Task Scheduler, read a constantly growing log files (text files) and insert data into Postgresql DB. New log file generated each day. The approximate size of each log is 1GB.

  • Platform: Windows 7, I don't prefer it usually but I have to this time)
  • RAM: 32GB

I searched on tunning PostgreSQL for handling heavy I/O and here is what I modified:

shared_buffers: 8GB 
work_mem: 100 MB 
maintenance_work_mem: 512 MB 
checkpoint_segments: 100 
checkpoint_timepot: 1hr 
synchronous_commit = off 
full_page_writes = off 
fsync = off

The script written to read log file line by line and insert into DB:

 import psycopg2 as psycopg
    try:
      connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
      cx = psycopg.connect(connectStr)
      cu = cx.cursor()
      logging.info("connected to DB")
    except:
      logging.error("could not connect to the database")


import time
file = open('textfile.log', 'r')
while 1:
    where = file.tell()
    line = file.readline()
    if not line:
        time.sleep(1)
        file.seek(where)
    else:
        print line, # already has newline
        dodecode(line)

def dodecode(fields):
   global cx
   from time import strftime, gmtime
   from calendar import timegm
   import os
   msg = fields.split(',')
   part = eval(msg[2])
   msgnum = int(msg[3:6])
   print "message#:", msgnum
   print fields

   if (part==1):
     if msgnum==1:
       msg1 = msg_1.decode(bv)
       #print "message1 :",msg1
       Insert(msgnum,time,msg1)
     elif msgnum==2:
       msg2 = msg_2.decode(bv)
       #print "message2 :",msg2
       Insert(msgnum,time,msg2)
     elif msgnum==3:
     ....
     ....
     ....    

def Insert(msgnum,time,msg):
 global cx

 try:    
         if msgnum in [1,2,3]:   
          if msg['type']==0:
            cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+"  WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")      
            cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
            cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
          elif msg['type']==1:
            cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+"  WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")    
            cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
            cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
          elif msg['type']==2:
        ....
        ....
        ....

The problem is, after 6 hours of running the script, it olny inserted 5 minutes of the file data! I suspect the data is streaming into log file as chunks instead of lines, but I really can't figure out how to fix this to make it more like real-time data in the DB.

Shad
  • 195
  • 2
  • 10
  • Why are you using `time.sleep(1)`? What does it do? – Nebril Dec 28 '14 at 10:47
  • @Nebril In case there is no line inserted into log file, sleep until the next line – Shad Dec 28 '14 at 11:18
  • show the code where you insert the line into the table. – Clodoaldo Neto Dec 28 '14 at 11:23
  • 3
    Do you understand that **`fsync = off`** is basically saying "it's OK if you eat my data, don't bother being crash-safe"? (It's also entirely pointless to set `synchronous_commit = off` if you have `fsync=off`). Anyway, use `psycopg2` and `copy_from`. – Craig Ringer Dec 28 '14 at 11:43
  • 1
    See http://stackoverflow.com/q/12206600/398670 – Craig Ringer Dec 28 '14 at 11:46
  • @ClodoaldoNeto I updated my post. Thanks – Shad Dec 28 '14 at 12:19
  • @CraigRinger I already use `psycopg2` to connect, but what do you mean by `copy_from` ? Thanks – Shad Dec 28 '14 at 12:20
  • 1
    @Shad http://initd.org/psycopg/docs/cursor.html#cursor.copy_from ; per a google search for "psycopg2 copy_from". In other words, use the PostgreSQL `COPY` API to bulk-load the input. – Craig Ringer Dec 28 '14 at 12:31
  • 1
    @Shad Also show your table/index definitions. If you have numerous or large indexes that could be a big part of the speed issues. Those `UPDATE`s you're doing will be a killer too. If you're committing each line individually, that'll be even worse. BTW, instead of the sleep loop, use a proper file change waiting option like select or poll. – Craig Ringer Dec 28 '14 at 12:33
  • 1
    Do you commit it? `cx.commit()` – Clodoaldo Neto Dec 28 '14 at 13:00
  • @CraigRinger Actually, yes, I have a large indexes (needed for speed up queyrire) and also I commit each line! So I understand from you that copy_from will let me handle and commit bulks instead of lines? Thanks a lot! And about file change writing, can you provide me an example of using poll or select? – Shad Dec 28 '14 at 16:58
  • @ClodoaldoNeto Yes, I commit each inserted line whch i just realized it's a mistake. Any idea on how to avoid this? – Shad Dec 28 '14 at 16:58
  • Apparent follow-up at http://stackoverflow.com/q/27688242/398670 – Craig Ringer Dec 29 '14 at 12:05
  • the follow up question http://stackoverflow.com/questions/27700213/load-bulk-into-postgresqlfrom-log-files-using-python – Shad Dec 30 '14 at 11:44

1 Answers1

1

Have you considered using psycopg2's executemany? Simple example from this SO answer:

namedict = ({"first_name":"Joshua", "last_name":"Drake"},
            {"first_name":"Steven", "last_name":"Foo"},
            {"first_name":"David", "last_name":"Bar"})

cur = conn.cursor()
cur.executemany("""INSERT INTO bar(first_name,last_name) VALUES (%(first_name)s, %(last_name)s)""", namedict)
Community
  • 1
  • 1
Justin
  • 450
  • 6
  • 15