0

I'm new to Java, and I have to write a code that processes a huge CSV files, more specifically to pivot the file and output it in a different way.

When I test the code with couple hundreds of rows it works perfectly! but when I make the file around 700k rows, which it's the case, it outputs a file without any values!

Here is the file schema I'm using:

row_number,Time_float,V1_float,V2_float,V3_float,V4_float,V5_float,V6_float,V7_float,V8_float,V9_float,V10_float,V11_float,V12_float,V13_float,V14_float,V15_float,V16_float,V17_float,V18_float,V19_float,V20_float,V21_float,V22_float,V23_float,V24_float,V25_float,V26_float,V27_float,V28_float,Amount_float,Class_float
0,-1.996583023457193,-0.6942423209592997,-0.04407492457044802,1.6727734992241514,0.973365514375461,-0.24511658354252203,0.34706794516190337,0.19367893831762997,0.0826372794108694,0.33112778320993075,0.08338554524255039,-0.540407035731003,-0.6182957177945313,-0.9960989219768981,-0.32461018632700356,1.6040138389062168,-0.5368328685192513,0.24486345402090126,0.030769932602201018,0.4962820266510568,0.3261180160164485,-0.024923364961491876,0.38285443833968436,-0.17691133433749112,0.1105069205607409,0.2465854429694212,-0.3921704315485548,0.3308916226487169,-0.06378115069750527,0.24496426337017338,-0.04159897836869265
1,-1.996583023457193,0.608496327692869,0.16117591988327584,0.10979710210602639,0.31652292674812704,0.043483352047263765,-0.06181996595114916,-0.06370020974401087,0.07125348305960227,-0.23249418894056792,-0.15334962861450452,1.5800028495181635,1.0660885709726662,0.49141820388511015,-0.14998248182742244,0.6943604186631746,0.5294337538197206,-0.13516996909847068,-0.21876258231230966,-0.17908604930753447,-0.08961086263099204,-0.3073768045456866,-0.880076754448717,0.16220118362512442,-0.5611305498033158,0.3206939009070385,0.2610694754212664,-0.02225567818703569,0.04460751768012477,-0.3424745411051305,-0.04159897836869265
2,-1.9965619655334634,-0.6935004627168448,-0.8115778263099092,1.1694684928233277,0.2682312938426663,-0.3645717858377013,1.3514535859514658,0.6397756379029846,0.20737272949225669,-1.3786753514283783,0.19069961380161685,0.6118297100478469,0.06613661868011089,0.720699852326139,-0.173113888845362,2.5629061849805295,-3.298235372241396,1.3068678794417397,-0.14478999149835625,-2.7785608505208725,0.6809749715218957,0.33763169617058825,1.0633582711230263,1.456319745720107,-1.1380921384645095,-0.6285367205194723,-0.2884467520153837,-0.1371368556920797,-0.18102082710565728,1.1606859252297228,-0.04159897836869265
3,-1.9965619655334634,-0.4933248981469483,-0.11216942463929835,1.1825164508596404,-0.6097266412236745,-0.007468880343610824,0.9361498321757143,0.19207063819763437,0.3160175995113439,-1.2625031722059106,-0.050467953146436886,-0.22189161429591525,0.17837098770941834,0.5101687012441243,-0.30036049360793543,-0.6898374093803747,-1.2092959931757417,-0.8054446421603216,2.3453045221536306,-1.5142049233314772,-0.26985522543866364,-0.1474432966746349,0.007266907401368897,-0.30477654737727394,-1.941027139612268,1.2419037126390184,-0.46021734156146177,0.15539620725948264,0.1861885865396321,0.14053425198451397,-0.04159897836869265
4,-1.9965409076097336,-0.5913297637052182,0.5315410497431424,1.0214116755556233,0.28465540421364943,-0.29501543610441505,0.07199858317395955,0.4793022833609244,-0.22651023121331143,0.7443262870821311,0.6916250322008002,-0.8061465859382944,0.5386266478753359,1.3522443515849518,-1.1680335140679103,0.1913234721415302,-0.5152051215648076,-0.2790807862169683,-0.04556900286007717,0.987037297022314,0.5299387935696488,-0.012839217648641888,1.1000112712256314,-0.2201233960065819,0.2332500879381676,-0.39520164174981287,1.041611299621563,0.5436197963831321,0.6518159160992683,-0.07340334025310606,-0.04159897836869265
5,-1.9965409076097336,-0.21747462336443635,0.5816748868716723,0.7525853776136525,-0.11883332826086528,0.30500895924490323,-0.02231347965260122,0.3849359380956985,0.21795466774987732,-0.5176185728033812,-0.34110109646946285,1.3140464540936228,0.360182114067614,-0.3597915734090633,-0.1430571341620137,0.5655071130706558,0.45845966002880245,-0.06844505898436722,0.08190792006276612,-0.04077665181354113,0.1102154145899529,-0.2835222214491732,-0.7714270019345486,-0.04227284666287563,-0.6132733979473761,-0.4465835771366699,0.21963714625507125,0.6289004813773338,0.24563620173852233,-0.33855641663507124,-0.04159897836869265
6,-1.996498791762274,0.6277951843506749,0.08538910070753938,0.029922970138274736,0.8493831061038468,0.139019583041878,0.20469452783029024,-0.004170268076019058,0.0679975594125977,0.4232179413095267,-0.0911553627796218,-1.3881568003685008,-0.15394904162357412,-0.7546302429818224,0.174601525806113,0.05478291827131152,-0.5062323708749048,0.0033208458038780472,-0.7301429071537743,-0.05598631103904616,-0.28489532870127926,-0.22833366503144684,-0.373032409163239,-0.24677959261431537,-1.2879725270568725,1.4390366584060272,-0.5334361005147672,0.08549234948827854,0.015655983019391424,-0.33327894285907317,-0.04159897836869265
7,-1.9964356179910847,-0.3289283471917919,0.8586923440261,0.708576234767151,-0.34763105227477603,0.6875116870747386,0.32134542115371234,0.9058597727667212,-3.1882293041144427,0.5601291069522123,1.1474295272213637,-0.6068981847284727,0.2917078257188488,1.7663144960411647,-1.381049034972912,0.7496140133583983,-0.08687803964000758,-1.4389216521167638,-0.42738292340326745,0.39863532675456365,-0.20331695379069067,2.6458886485220905,-1.3992756084115,0.0920853274818958,-1.0727537219105157,-0.7966329656594401,-0.1070748551085608,-2.9901537037868864,-3.2880827705218882,-0.19010747625415497,-0.04159897836869265
8,-1.9964356179910847,-0.45657301681652984,0.17329147453401414,-0.07465262179597631,-0.19177387759003722,1.9341493665366207,2.793594054309211,0.29920595987227944,0.7125916765850463,-0.35685128708200337,-0.3769401127987114,-0.690809076979848,-0.11054073462883784,-0.2876133294193531,0.07756710959787069,-0.3592023119205175,-0.23974544333000244,-0.5884222737443606,0.14169463193440693,0.7006152251433341,0.06840582416474023,-0.09996300675392492,-0.36942472981431596,-0.327055266660248,1.6702690915973604,0.7159429034669824,-0.7966330619330197,0.02910414108912445,0.43142017601489174,0.019392240308193916,-0.04159897836869265
9,-1.9963935021436252,-0.17269774387883696,0.6780049203798014,0.6887814922128964,-0.15692675912027115,0.36179160642264196,-0.1852187109108031,0.5267057814810946,0.05822291589340413,-0.6705871938144513,-0.33691175301635523,0.996966068530707,0.837059523154456,1.0116259929700517,-0.46268060306897046,0.1641174886473323,0.8438820412040243,-0.6369448535966505,0.568708693313514,0.5549770030986336,0.2642433515365529,-0.336155613099156,-0.8732980464093,-0.19343791270246957,-0.6357673003859946,-0.1337734549071511,0.195341606203293,0.6100096992904638,0.25168133078356686,-0.33851643577313184,-0.04159897836869265
10,-1.9963724442198953,0.7398016180995844,-0.7123689085952064,0.6027095930866821,-0.9716078839795883,-1.4282856664438242,-0.4722411589208851,-1.150469214075524,0.04057090066824084,-1.5659577306254295,1.493926862783695,1.1753019928647082,-0.6719776055792052,-0.5163883874999073,-0.09915047715944188,0.2522962688421534,0.03648207425930615,0.29836818807428966,1.0192908179449345,-0.27193462986307226,-0.5022889927521785,-0.012663864860947654,0.43254001227028344,0.0444226909424958,0.826410614996712,0.4822144153958534,-0.26850047275423083,0.10616079403669354,0.04923997135326903,-0.32204432065410754,-0.04159897836869265

and Here is the code:

import com.opencsv.CSVReader;
import com.opencsv.CSVWriter;
import java.io.FileWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;

public class test {

    public static void main(String[] args) {
        try {
            //csv file containing data
            CSVReader reader = new CSVReader(new FileReader("/Users/csv-parser/sentiment.csv"));
            CSVWriter writer = new CSVWriter(new FileWriter("/Users/csv-parser/Output.csv"));
            String [] nextLine;
            String [] Target_Columns = new String[] {"row_number", "column_name", "column_value_string", "column_value_float"};
            List<String> Columns = null;
            long start = System.currentTimeMillis();

            int lineNumber = 0;

            while (true) {
                if ((nextLine = reader.readNext()) == null) break;
                lineNumber++;
                if (lineNumber == 1) {
                    writer.writeNext( Target_Columns );
                    Columns = Arrays.asList( nextLine );
//                    System.out.println( Arrays.toString( nextLine ) );
                    continue;
                }
                System.out.println( Arrays.toString( nextLine ) );
                int indexOfRow = Columns.indexOf( "row_number" );
                String row_value = "{ROW_NUMBER},{COl_NAME},{COL_STR},{COL_FL}";
                int i = 1;
                for(String column: Columns){
                    if (column.equals("row_number")) continue;
                    row_value = row_value.replace( "{ROW_NUMBER}", nextLine[0]);

                    if (column.contains( "_string" )) {
//                        System.out.print( "col_str: " + finalNextLines[i] + " " );
                        row_value = row_value.replace( "{COL_STR}", nextLine[i] );
                        row_value = row_value.replace( "{COL_FL}", " " );
                        row_value = row_value.replace( "{COl_NAME}", column.replace( "_string", "" ) );
                    }
                    if (column.contains( "_float" )) {
//                        System.out.print( "col_fl: " + finalNextLines[i] + " " );
                        row_value = row_value.replace( "{COL_STR}", " " );
                        row_value = row_value.replace( "{COL_FL}",  nextLine[i] );
                        row_value = row_value.replace( "{COl_NAME}", column.replace( "_float", "" ) );
                    }


                    i++;
//                    System.out.println( "ROW: " + row_value + " " );
                    writer.writeNext( row_value.split( "," ) );
                    row_value = "{ROW_NUMBER},{COl_NAME},{COL_STR},{COL_FL}";
                }

//                System.out.println("\n");
            }
            writer.close();
            long finish = System.currentTimeMillis();
            System.out.println( "Time elapsed: " + (finish - start));

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

and also, is there a way to make the code read/processes the file concurrently or using multithreads, I mean make it more efficient?

Update:

Here how the output looks like when I process the original file output

basel.ai
  • 127
  • 8
  • 2
    Do you get any exceptions? – Hovercraft Full Of Eels Aug 25 '19 at 13:15
  • 1
    ... and if you're planning on manipulating a very large amount of data, have you considered converting the CSV file into a database? – Hovercraft Full Of Eels Aug 25 '19 at 13:19
  • @HovercraftFullOfEels No I'm not getting any exceptions at all! and for the DataSet I didn't think about it, as I mentioned I'm new to Java – basel.ai Aug 25 '19 at 13:20
  • How big is the file? (in term of memory size) – jhamon Aug 25 '19 at 13:25
  • @jhamon 185 mega bytes – basel.ai Aug 25 '19 at 13:27
  • 2
    looks to me like the issue is you are only reading the nextLine into your Columns array `Columns = Arrays.asList( nextLine );` only once in the loop when `lineNumber ==1` – diarmuid Aug 25 '19 at 13:31
  • @diarmuid looks to me that you didn't go all the way to the end of the code, this is only for the first line to populate the columns which it's the first row. – basel.ai Aug 25 '19 at 13:32
  • 1
    This is quite convoluted and inefficient code. You want an array of 4 strings. And instead of just using an array and to populate each element of the array, you replace several times parts of a string, and then split the string. But anyway, you should start your program with your debugger, and step through its code line by line, to see where the problem is. – JB Nizet Aug 25 '19 at 13:36
  • Also, if you want performance, wrap the FileWriter into a BufferedWriter. Same for the reader. – JB Nizet Aug 25 '19 at 13:37
  • I'd also include (some of) the output of how it works, when it works correctly. That kind of (what I expect / what I see) difference sometimes helps in keeping things clear. – Edwin Buck Aug 25 '19 at 15:02

4 Answers4

1

In these very large files, occasionally you get bad input. The best time to check for nonsense input is in the read, at a record / row level (for CSV files, row is the natural choice).

So put an "if" statement in the iterator loop, and have it crash the program with a good error statement on whatever you deem to be bad input. Who knows, you just might find 400 blank rows between a large number of fully populated rows. Also, you might find special control characters, etc.

Once you feel your input is correct, you can start debugging into the logic of the pivot. Who knows, maybe you found a bug in the library.

Personally, I think it's a problem with the input data, as you indicate the logic works well when using smaller data sets. Odds are there's something weird in the larger data set that you haven't noticed, due to its size.

Edwin Buck
  • 64,804
  • 7
  • 90
  • 127
1

It's really hard to say what is wrong without a full stack trace because your code has lots of potential problems with performance, JVM memory usage etc. I think it's simpler to give you one of a correct solution; at leats, it works with 700k text file.

import com.opencsv.CSVReader;
import com.opencsv.CSVWriter;

import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Set;

public class ReadJSONExample {

    public static void main(String[] args) throws IOException {
        // Move time calculation out of main logic
        long start = System.currentTimeMillis();

        new CsvDataModifier(Paths.get("e:/data.csv"), Paths.get("e:/res.csv")).apply();

        long finish = System.currentTimeMillis();
        System.out.printf("Time elapsed: %d ms", finish - start);
    }
}

final class CsvDataModifier {

    private final Path src;
    private final Path dest;

    public CsvDataModifier(Path src, Path dest) {
        this.src = src;
        this.dest = dest;
    }

    private static final int ROW_NUMBER = 0;
    private static final int COLUMN_NAME = 1;
    private static final int COLUMN_VALUE_STRING = 2;
    private static final int COLUMN_VALUE_FLOAT = 3;
    private static final String[] COLUMN_NAMES = { "row_number", "column_name", "column_value_string", "column_value_float" };

    public void apply() throws IOException {
        // Use `try with recources` to close streams correctly
        try (CSVReader reader = new CSVReader(new FileReader(src.toFile())); 
             CSVWriter writer = new CSVWriter(new FileWriter(dest.toFile()), ',', CSVWriter.NO_QUOTE_CHARACTER)) {
            // key - ordered list of columns in source file
            Map<String, Marker> columnNameFloatMarker = getSourceColumnNamesWithFloatMarker(reader.readNext());
            int posRowNumber = getRowNumberPosition(columnNameFloatMarker.keySet());

            if (columnNameFloatMarker.isEmpty())
                return;

            writer.writeNext(COLUMN_NAMES);

            // Create buffer only once for all lines; do not use string concatenation or replacing
            String[] buf = new String[COLUMN_NAMES.length];

            reader.forEach(values -> {
                buf[ROW_NUMBER] = values[posRowNumber];

                int col = 0;
                // this is just reference to buf[]; if `null` then no output
                String[] resultLine;

                for (Map.Entry<String, Marker> entry : columnNameFloatMarker.entrySet()) {
                    String columnName = entry.getKey();
                    Marker marker = entry.getValue();

                    if ((resultLine = marker.createResultLine(columnName, values[col], buf)) != null)
                        writer.writeNext(resultLine);

                    col++;
                }
            });
        }
    }

    private static final String FLOAT = "_float";
    private static final String STRING = "_string";

    private enum Marker {
        NONE {
            @Override
            public String[] createResultLine(String columnName, String value, String[] buf) {
                return null;
            }
        },
        STRING {
            @Override
            public String[] createResultLine(String columnName, String value, String[] buf) {
                buf[COLUMN_VALUE_STRING] = value;
                buf[COLUMN_VALUE_FLOAT] = " ";
                buf[COLUMN_NAME] = columnName;
                return buf;
            }
        },
        FLOAT {
            @Override
            public String[] createResultLine(String columnName, String value, String[] buf) {
                buf[COLUMN_VALUE_STRING] = " ";
                buf[COLUMN_VALUE_FLOAT] = value;
                buf[COLUMN_NAME] = columnName;
                return buf;
            }
        };

        public abstract String[] createResultLine(String columnName, String value, String[] buf);
    }

    // Source column preprocessing to avoid string comparision; do it only once
    private static Map<String, Marker> getSourceColumnNamesWithFloatMarker(String... columns) {
        if (columns == null || columns.length == 0)
            return Collections.emptyMap();

        Map<String, Marker> map = new LinkedHashMap<>();

        for (int i = 0; i < columns.length; i++) {
            String columnName = columns[i];
            Marker marker = Marker.NONE;

            if (columnName.endsWith(FLOAT)) {
                columnName = columnName.substring(0, columnName.length() - FLOAT.length());
                marker = Marker.FLOAT;
            } else if (columnName.endsWith(STRING)) {
                columnName = columnName.substring(0, columnName.length() - STRING.length());
                marker = Marker.STRING;
            }

            if (map.put(columnName, marker) != null)
                throw new IllegalArgumentException("Column duplication in the source file");
        }

        return map;
    }

    private static int getRowNumberPosition(Set<String> columnNames) {
        int i = 0;

        for (String columnName : columnNames) {
            if ("row_number".equals(columnName))
                return i;
            i++;
        }

        throw new IllegalArgumentException("Source file does not contain 'row_number' column");
    }
}

  • Input file: data.csv ~432 Mb
  • Output file: res.csv ~868 Mb
  • Console outputs: Time elapsed: 21058 ms
oleg.cherednik
  • 12,764
  • 2
  • 17
  • 25
  • thanks a lot for the detailed response! one more thing, it outputs the float / integer number surrounded with " " (i.e. "0","Time","-1.996583023457193"," ") How can I output the numbers (either for the column _float or row_number) without the double quotes? because I need to push the output to Elastic-search and I want to keep the index/mapping and datatypes detectable. – basel.ai Aug 26 '19 at 02:38
  • 1
    @basel.ai https://stackoverflow.com/questions/13969254/unwanted-double-quotes-in-generated-csv-file – oleg.cherednik Aug 26 '19 at 05:05
  • one more thing, how can I output the result into a json format not into a csv file? – basel.ai Aug 28 '19 at 15:14
0

Your output file is bigger than input file. It's possible that when the code is run against 700k rows it is producing a huge file which is difficult for editors to open.

What's the size of /Users/csv-parser/Output.csv when you run it for 700k rows?

Did you try printing the content of /Users/csv-parser/Output.csv using code rather than opening it using editor?

kiDDevil
  • 25
  • 1
  • 12
-1

Have you tried to put it in another thread/task and run it? Maybe it is blocking the main thread...

Check this out: https://winterbe.com/posts/2015/04/07/java8-concurrency-tutorial-thread-executor-examples/

smithnblack
  • 472
  • 6
  • 16