How to optimize the import of data from a flat file to BD PostgreSQL?

Question

Good morning to the community, I have a query you happen to have to import 14 million records containing the information of clients of a company.

Flat File. Txt weighs 2.8 GB, I have developed a java program that reads the flat file line by line, deal the information and put it in an object that in turn inserted into a table in the PostgreSQL database, the subject is that I have made a calculation that 100000 records inserted in a time of 112 minutes, but the issue is that I insert parts.

public static void main(String[] args) {  

  // PROCESSING 100,000 records in 112 minutes 
  // PROCESSING 1,000,000 records in 770 minutes = 18.66 hours

  loadData(0L, 0L, 100000L);
}

/**
 * Load the number of records Depending on the input parameters.
 * @param counterInitial - Initial counter, type long.
 * @param loadInitial - Initial load, type long.
 * @param loadLimit - Load limit, type long.
 */
private static void loadData(long counterInitial, long loadInitial, long loadLimit){
  Session session = HibernateUtil.getSessionFactory().openSession(); 
  try{
      FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
      DataInputStream entrada = new DataInputStream(fstream);
      BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
      String strLinea;
      while ((strLinea = buffer.readLine()) != null){
          if(counterInitial > loadInitial){
              if(counterInitial > loadLimit){
                  break;
              }
              Sppadron spadron= new Sppadron();
              spadron.setSpId(counterInitial);                
              spadron.setSpNle(strLinea.substring(0, 9).trim());
              spadron.setSpLib(strLinea.substring(9, 16).trim());
              spadron.setSpDep(strLinea.substring(16, 19).trim());
              spadron.setSpPrv(strLinea.substring(19, 22).trim());
              spadron.setSpDst(strLinea.substring(22, 25).trim());
              spadron.setSpApp(strLinea.substring(25, 66).trim());
              spadron.setSpApm(strLinea.substring(66, 107).trim());
              spadron.setSpNom(strLinea.substring(107, 143).trim());                   
              String cadenaGriSecDoc = strLinea.substring(143, strLinea.length()).trim();                    
              String[] tokensVal = cadenaGriSecDoc.split("\\s+");
              if(tokensVal.length == 5){
               spadron.setSpNac(tokensVal[0]);
               spadron.setSpSex(tokensVal[1]);
               spadron.setSpGri(tokensVal[2]);
                  spadron.setSpSec(tokensVal[3]);
                  spadron.setSpDoc(tokensVal[4]);
              }else{
               spadron.setSpNac(tokensVal[0]);
               spadron.setSpSex(tokensVal[1]);
               spadron.setSpGri(tokensVal[2]);
                  spadron.setSpSec(null);
                  spadron.setSpDoc(tokensVal[3]);
              }
              try{
                  session.getTransaction().begin();
                  session.save(spadron); // Insert
                  session.getTransaction().commit(); 
              } catch (Exception e) {
                  session.getTransaction().rollback();
                  e.printStackTrace();
              }
          }
          counterInitial++;
      }
      entrada.close();  
  } catch (Exception e) {
      e.printStackTrace();
  }finally{
      session.close();
  }
}

The main issue is if they check my code when I insert the first million records, the parameters would be as follows: loadData (0L, 0L, 1000000L);

The issue is that when you insert the following records in this case would be the next million records would be: loadData (0L, 1000000L, 2000000L); What will cause it to scroll back the first 100 billion of records, and then when the counter is in the value 1000001 recently will begin insert following records, someone can give me a more optimal suggestion to insert the records, knowing that it is necessary treat information as seen in previous code shown.

You are committing each and every insert. That *is* going to be terribly slow. You should only commit at the end. Or stop using Hibernate for this, it is not designed for bulk loading. The most efficient solution is to use `copy` through the `CopyManager` API. Or at least use plain JDBC and statement batching. — a_horse_with_no_name, May 14 '14 at 17:01

score 0 · Answer 1 · edited May 23 '17 at 10:28

See How to speed up insertion performance in PostgreSQL .

The first thing you should do is bypass Hibernate. ORMs are convienient, but you pay a price in speed for that convenience, especially with bulk operations.

You could group your inserts into reasonable sized transactions and use multi-valued inserts, using a JDBC PreparedStatement.

Personally though, I'd use PgJDBC's support for the COPY protocol to do the inserts more directly. Unwrap your Hibernate Session object to get the underlying java.sql.Connection, get the PGconnection interface for it, getCopyAPI() to get the CopyManager, and use copyIn to feed your data into the DB.

Since it looks like your data isn't in CSV form but fixed-width field form, what you'll need to do is start a thread that reads your data from the file, converts each datum into CSV form suitable for PostgreSQL input, and writes it to a buffer that copyIn can consume with the passed Reader. This sounds more complicated than it is, and there are lots of examples of Java producer/consumer threading implementations using java.io.Reader and java.io.Writer interfaces out there.

It's possible you may instead be able to write a filter for the Reader that wraps the underlying file reader and transforms each line. This would be much simpler than producer/consumer threading. Research it as the preferred option first.

How to optimize the import of data from a flat file to BD PostgreSQL?

1 Answers1