0

I have a task to practice. Generate 5 million records: string (15), string (15), int (min: 1 max: 99) and store them in a database postgresql. My first version of the script was the estimated time for 5 millions: 16120 seconds.

EDIT (Dec 12 '15 at 11:50): After several optimizations currently: 80 seconds.

What should I have to change in order to achieve the desired result?

Below I present most optimized version of the code.

// In Source Code -> function insertDataFive

$countRecords = 5000000;

for ($i = 0; $i < ($countRecords/100000); $i++) // $i < 50
{
    echo "Loop one \$i: " . $i . "\n";
    try
    {
        echo "Memory used (before) real: " . (memory_get_peak_usage(true) / 1024 / 1024) . " MiB\n";

        $sql = "INSERT INTO users
    (first_name, last_name, user_age)
    VALUES ('" . randomString(15) . "','" . randomString(15) . "'," . randomAge() . ")";

        for ($j = 1; $j < 100000; $j++)
        {
            $sql .= ",('" . randomString(15) . "','" . randomString(15) . "'," . randomAge() . ")";
        }
        $dbh->exec($sql);
        
        
        echo "Memory used (after) real: " . (memory_get_peak_usage(true) / 1024 / 1024) . " MiB\n\n";
        unset($sql);
    } catch (PDOException $e)
    {
        echo 'PDO error: ' . $e->getMessage();
    }
}

Maybe I cloud use something else than INSERT?

EDIT: problem solved! 2016 Dec 19 script relase 1.0 source

IMPORTANT! before run this script exec once in cmd bottom command

mkdir /tmp/ram

mount -t tmpfs -o size=512m tmpfs /tmp/ram

Generate data:

function createDataFile($task, $rows = 1250000)
{
    try
    {
        global $fileName, $fileExtension, $timeFileCreate;
        $startTime = microtime(true);
        $fileCSV = $fileName . $task . $fileExtension;
        $fileHandler = fopen('/tmp/ram/' . $fileCSV, 'w');
        if ($fileHandler != false)
        {
            if (DEBUG)
            {
                echo "Memory used (before) fwrite: " . getMemoryUsage() . " MiB\n";
            }
            for ($i = ($task - 1) * $rows; $i < $task * $rows; $i++)
            {
                fwrite($fileHandler, (( $i + 1 . ","
                        . generateRangomString(15) . ","
                        . generateRangomString(15) . ","
                        . generateRangomAge()) . "\n"));
            }
            fclose($fileHandler);
            if (DEBUG)
            {
                echo "Memory used (after) fwrite: " . getMemoryUsage() . " MiB\n";
            }
        }
        else
        {
            echo "File open error";
        }
        $timeFileCreate += (microtime(true) - $startTime);
    }
    catch (Exception $ex)
    {
        echo "File Error: " . $ex->getMessage();
    }
}

Insert rows:

function insertSingleDataFile($dbh, $task)
{
    global $fileName, $fileExtension, $timeSqlBulk;
    $startTime = microtime(true);
    $fileCSV = $fileName . $task . $fileExtension;
    $sqlBulk = "COPY users (user_id, first_name, last_name, user_age)
    FROM '/tmp/ram/$fileCSV'
    DELIMITER ','";

    try
    {
        $dbh->query($sqlBulk);
    }
    catch (PDOException $e)
    {
        echo 'PDO error: ' . $e->getMessage() . "\n\n";
    }
    $timeSqlBulk += (microtime(true) - $startTime);
}
Community
  • 1
  • 1
  • 3
    Load data infile and adding indexes after inserts are the points of improvement – Your Common Sense Dec 09 '16 at 13:07
  • 5 million records?? ⊙▃⊙ – Blueblazer172 Dec 09 '16 at 13:08
  • 5
    I think this question is more suited for the codereview stackexchange, because we're not dealing with an error in your code, but with optimization. – KhorneHoly Dec 09 '16 at 13:09
  • 377 seconds - time run script 362 seconds - run insert 12 seconds - randomString() 1.5 seconds - randomAge() – Szczepan Zaskalski Dec 09 '16 at 13:19
  • You could also take a look at doing everything directly in the database using a stored procedure. – jeroen Dec 09 '16 at 13:26
  • In the content of the task is it code should be write in PHP. – Szczepan Zaskalski Dec 09 '16 at 13:29
  • 1
    Here are a couple of links that might be useful. http://stackoverflow.com/questions/758945/whats-the-fastest-way-to-do-a-bulk-insert-into-postgres http://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql – mlinth Dec 09 '16 at 14:08
  • use yourself function of PostgreSQL to fill table users [link](http://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql) or [link](http://stackoverflow.com/questions/3970795/how-do-you-create-a-random-string-in-postgresql) – GrApDev Dec 09 '16 at 14:11
  • http://php.net/manual/en/function.pg-copy-from.php – mlinth Dec 09 '16 at 14:16

1 Answers1

0

You could try moving everything into one loop. There's no need to first collect the records, e.g.

$dbh->beginTransaction();

$sql = 'INSERT INTO users
    (first_name, last_name, user_age)
    VALUES (?, ?, ?)';

$sth = $dbh->prepare($sql);

for ($i = 0; $i < $countRecords; $i++) {
    $sth->execute(array(
        randomString(15),
        randomString(15),
        randomAge(),
    ));
}

$dbh->commit();

This way you save the additional memory for a large array, and the packing and unpacking from an associative to a simple array.

Olaf Dietsche
  • 66,104
  • 6
  • 91
  • 177