3

I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.

Here are the basic functions (part of a class) that try to match each record with any existing entities:

private function _getOrg($data)
{
    // if name of organization is null, skip it
    if($data[44] == '') return null;

    // use each of the possible names to check if organization exists
    $names = array($data[44],$data[45],$data[46],$data[47]);

    // cycle through the names
    foreach($names as $name) {
        // check to see if there is actually an entry here
        if($name != '') {
            if(($org_id = $this->_parseOrg($name)) != null) {
                $this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
                return $org_id;
            }
        }
    }

    return $this->_addOrg($data);
}

private function _parseOrg($name)
{
    // check to see if it matches any org names
    // db class function, performs simple "like" match
    $this->db->where('org_name',$name,'like');

    $result = $this->db->get('orgs');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }

    // check to see if matches any org aliases
    $this->db->where('org_alias_name',$name,'like');

    $result = $this->db->get('orgs_aliases');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }
    return null; // no matches, have to add new entity
 }

The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.

Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.

Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?

I'm running this on my Mac OS 10.6 with local MySQL server.

tchaymore
  • 3,508
  • 12
  • 44
  • 84

2 Answers2

1

load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.

you might also find some of my other answers of interest:

Optimal MySQL settings for queries that deliver large amounts of data?

MySQL and NoSQL: Help me to choose the right one

How to avoid "Using temporary" in many-to-many queries?

60 million entries, select entries from a certain month. How to optimize database?

Interesting presentation:

http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/

example code (may be of use to you)

truncate table staging;

start transaction;

load data infile 'your_data.dat' 
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');

commit;

drop procedure if exists process_staging_data;

delimiter #

create procedure process_staging_data()
begin

    insert ignore into organisations (org_name) select distinct org_name from staging;

    update...

    etc.. 

    -- or use a cursor if you have to ??

end#

delimiter ;

call  process_staging_data();

Hope this helps

Community
  • 1
  • 1
Jon Black
  • 15,289
  • 5
  • 40
  • 41
  • Fantastic! This is really helpful, thank you. I'm still having some trouble with the IF / THEN statements, but I think I'll get it to work. – tchaymore May 12 '11 at 19:26
  • Well, I think I followed your advice. You can see my [stored procedure here](http://pastebin.com/GpfqS48S). The speed improvement is immense, about 1500 records / minute, but I would like to get it even faster. Any suggestions? – tchaymore May 13 '11 at 02:57
0

It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?

Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.

Community
  • 1
  • 1
zeekay
  • 46,885
  • 12
  • 100
  • 104