46

My csv is getting read into the System.out, but I've noticed that any text with a space gets moved into the next line (as a return \n)

Here's how my csv starts:

first,last,email,address 1, address 2
john,smith,blah@blah.com,123 St. Street,
Jane,Smith,blech@blech.com,4455 Roger Cir,apt 2

After running my app, any cell with a space (address 1), gets thrown onto the next line.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class main {

    public static void main(String[] args) {
        // -define .csv file in app
        String fileNameDefined = "uploadedcsv/employees.csv";
        // -File class needed to turn stringName to actual file
        File file = new File(fileNameDefined);

        try{
            // -read from filePooped with Scanner class
            Scanner inputStream = new Scanner(file);
            // hashNext() loops line-by-line
            while(inputStream.hasNext()){
                //read single line, put in string
                String data = inputStream.next();
                System.out.println(data + "***");

            }
            // after loop, close scanner
            inputStream.close();


        }catch (FileNotFoundException e){

            e.printStackTrace();
        }

    }
}

So here's the result in the console:

first,last,email,address 
1,address 
2
john,smith,blah@blah.com,123 
St. 
Street,
Jane,Smith,blech@blech.com,4455 
Roger 
Cir,apt 
2

Am I using Scanner incorrectly?

coffeemonitor
  • 11,838
  • 33
  • 93
  • 146
  • 4
    Use a proper CSV library http://stackoverflow.com/questions/200609/can-you-recommend-a-java-library-for-reading-and-possibly-writing-csv-files – artbristol Jan 11 '13 at 08:31
  • Scanner default delimiter is whitespace, may be that is the problem. – Pankaj Jan 11 '13 at 08:32

8 Answers8

158

Please stop writing faulty CSV parsers!

I've seen hundreds of CSV parsers and so called tutorials for them online.

Nearly every one of them gets it wrong!

This wouldn't be such a bad thing as it doesn't affect me but people who try to write CSV readers and get it wrong tend to write CSV writers, too. And get them wrong as well. And these ones I have to write parsers for.

Please keep in mind that CSV (in order of increasing not so obviousness):

  1. can have quoting characters around values
  2. can have other quoting characters than "
  3. can even have other quoting characters than " and '
  4. can have no quoting characters at all
  5. can even have quoting characters on some values and none on others
  6. can have other separators than , and ;
  7. can have whitespace between seperators and (quoted) values
  8. can have other charsets than ascii
  9. should have the same number of values in each row, but doesn't always
  10. can contain empty fields, either quoted: "foo","","bar" or not: "foo",,"bar"
  11. can contain newlines in values
  12. can not contain newlines in values if they are not delimited
  13. can not contain newlines between values
  14. can have the delimiting character within the value if properly escaped
  15. does not use backslash to escape delimiters but...
  16. uses the quoting character itself to escape it, e.g. Frodo's Ring will be 'Frodo''s Ring'
  17. can have the quoting character at beginning or end of value, or even as only character ("foo""", """bar", """")
  18. can even have the quoted character within the not quoted value; this one is not escaped

If you think this is obvious not a problem, then think again. I've seen every single one of these items implemented wrongly. Even in major software packages. (e.g. Office-Suites, CRM Systems)

There are good and correctly working out-of-the-box CSV readers and writers out there:

If you insist on writing your own at least read the (very short) RFC for CSV.

Basil Bourque
  • 218,480
  • 72
  • 657
  • 915
Scheintod
  • 7,352
  • 7
  • 35
  • 58
  • So Right ! This should be the accepted answer : Don't ! use a real parser that support quotes, newline in values etc. Thank you Scheintod for preventing some people to do it. – baraber Feb 18 '16 at 18:44
  • 1
    We can argue if we are talking about CSV files in many of these cases. – Jan Bodnar Aug 10 '16 at 18:44
  • I agree with @JanBodnar here... points 2, 3, 4, and 5 are false. CSV by definition can only have 1 delimiter, a comma. Otherwise it's just messy data in rows and columns, which may or may not have the file extension .csv. 9 is irrelevant. Just use a dynamic data structure. Anything talking about " and ' are irrelevant to data capturing, although could be removed in a data cleaning step. You're really just talking about a .csv file that was treated like a .txt file. No software package will interpret that correctly, because you don't know what a CSV file is. Good bottom link though. – TurnipEntropy Mar 10 '17 at 21:20
  • 1
    You are partly right and I have poor wording in in those points. "Quoting characters" would be the correct word instead of "delimiter". I might change this after a coffee. But while you are right that 100% correct csv would have a comma as field separator in reality you will encounter all sorts of them. I see semicolons way more often than commas and have encountered event tabs and spaces. All good csv libraries that I'm aware of can parse these because that is their job: parse the csv some other program has genenerated. (But I have even encountered ".csv" files which no parser could parse) – Scheintod Mar 16 '17 at 08:37
  • @TurnipEntropy "CSV" is used, commonly but imprecisely, as a general term for files where delimiter characters in values are handled via quoting. The alternative approach of escaping delimiter characters in values is commonly referred to as "TEXT" in contrast. Postgres, for instance, uses these terms for input formats for its COPY command for bulk import/export. So saying that the file should must use "," as a delimiter to be a csv is in line with the RFC and the usage in the desktop computing world but too narrow for the database and data transfer world. – Bell Jul 14 '17 at 15:39
  • My problem with CSV-Parsers is that they are too complex. I need half an hour to find, choose and to configure a parser. I just want to read some text from a file. So there is always a temptation to write something quick and dirty. Is there a parser out there which can guess all the config from the file? – Alex Feb 19 '18 at 16:55
  • @Alex: I'm not aware of any. But if you've any of them once the next time it is more a matter of seconds to use them again. And normally the only thing you have to configure is the separating character. If your input is too broken (because someone else decided to write a broken csv writer ... again) then they normally can't help you. – Scheintod Feb 19 '18 at 19:36
  • @Alex: Have a look at my two links. Using both of these is mainly the same code as using a normal Reader but now its a csv-reader which doesn't read strings bug arrays of strings. There is really not much more to it. – Scheintod Feb 19 '18 at 19:37
  • I'm using opencsv for my projects. It's ok and it works most of the time (there was a problem parsing AdWords reports a month ago). Yesterday I wanted to parse a csv-file (one-time adhoc). I have to provide escape-char, quote-char, encoding, decide whether the file is zipped or not, catch IOExceptions, call readNext() and check if it is null or not in a loop. Better would be: Exceptional> myStream = CsvReader.read("/path/to/file.csv"); and the reader should take care of everything else. Just saying.. – Alex Feb 20 '18 at 13:19
  • @Alex: That's what my whole point is all about: I have written to many parsers for broken csv in my life. You shouldn't need to specify escape or quote chars because csv-spec already has. But I don't think there is a *reliable* way of autodetecting this shit. (e.g.: a,b;c as "a,b" "c" vs "a" "b;c") But perhaps you could try parsing the whole file with different settings and see what settings produces the same line length for all lines. On the other hand: I've seen csv files where each line has it's own lenght ... – Scheintod Feb 21 '18 at 10:50
  • Adding a comment - all is good, Mr. Scheintod. But my CSV files usually are simpler. I use the comma for delimiter, and ignore commas inside of double quotes. That is all. No newlines in values, no escape chars, and so on. If a parser does this accurately, I consider the parser as good. . . I voted your answer up, anyway. – Baruch Atta Jul 03 '18 at 16:43
45
scanner.useDelimiter(",");

This should work.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;


public class TestScanner {

    public static void main(String[] args) throws FileNotFoundException {
        Scanner scanner = new Scanner(new File("/Users/pankaj/abc.csv"));
        scanner.useDelimiter(",");
        while(scanner.hasNext()){
            System.out.print(scanner.next()+"|");
        }
        scanner.close();
    }

}

For CSV File:

a,b,c d,e
1,2,3 4,5
X,Y,Z A,B

Output is:

a|b|c d|e
1|2|3 4|5
X|Y|Z A|B|
Pankaj
  • 4,738
  • 3
  • 25
  • 36
  • Tried that and it appears to put each cell on a new line. – coffeemonitor Jan 11 '13 at 08:41
  • added a test program, System.out.println() will print in new line, you need to use print(). – Pankaj Jan 11 '13 at 08:48
  • 17
    Am I correct in thinking it's not delimiting newlines? i.e. it's reading "e\n1" and "5\nX" as single tokens? Presumably a more complex regex would be needed to delimit on both commas and newlines. (Of course, even with this it would still fail to keep quoted strings together so wouldn't be ideal for CSV in general.) – mwfearnley May 08 '13 at 22:01
  • 4
    we can use Scanner.hasNextLine() and nextLine() methods to get single line data and then process it with above program. – Pankaj Oct 16 '13 at 23:05
  • 4
    This solution will not work if the delimiter (,) is part of the value. This answer is wrong. – Chirag Agrawal Jun 30 '16 at 10:15
  • 1
    support for delimiters as part of data was never mentioned. The term "Comma Separated Values" is quite self-explanatory. No mention was made of a need for 'encapsulators' as per RFC-4180 or Excel CSV variants. – robert Nov 22 '16 at 08:49
  • 1
    That'll work well and good, until you get a value such as, `"Commas can go in values, too!"` – DDPWNAGE Nov 13 '18 at 05:33
  • This works, unless there are quotes or commas inside your values, then it breaks. Should code as if you don't know the format of the CSV. – Andrew Mar 12 '20 at 02:02
10

Scanner.next() does not read a newline but reads the next token, delimited by whitespace (by default, if useDelimiter() was not used to change the delimiter pattern). To read a line use Scanner.nextLine().

Once you read a single line you can use String.split(",") to separate the line into fields. This enables identification of lines that do not consist of the required number of fields. Using useDelimiter(","); would ignore the line-based structure of the file (each line consists of a list of fields separated by a comma). For example:

while (inputStream.hasNextLine())
{
    String line = inputStream.nextLine();
    String[] fields = line.split(",");
    if (fields.length >= 4) // At least one address specified.
    {
        for (String field: fields) System.out.print(field + "|");
        System.out.println();
    }
    else
    {
        System.err.println("Invalid record: " + line);
    }
}

As already mentioned, using a CSV library is recommended. For one, this (and useDelimiter(",") solution) will not correctly handle quoted identifiers containing , characters.

hmjd
  • 113,589
  • 17
  • 194
  • 245
1

Split nextLine() by this delimiter: (?=([^\"]*\"[^\"]*\")*[^\"]*$)").

DDPWNAGE
  • 1,423
  • 8
  • 35
1

I agree with Scheintod that using an existing CSV library is a good idea to have RFC-4180-compliance from the start. Besides the mentioned OpenCSV and Oster Miller, there are a series of other CSV libraries out there. If you're interested in performance, you can take a look at the uniVocity/csv-parsers-comparison. It shows that

are consistently the fastest using either JDK 6, 7, 8, or 9. The study did not find any RFC 4180 compatibility issues in any of those three. Both OpenCSV and Oster Miller are found to be about twice as slow as those.

I'm not in any way associated with the author(s), but concerning the uniVocity CSV parser, the study might be biased due to its author being the same as of that parser.

To note, the author of SimpleFlatMapper has also published a performance comparison comparing only those three.

Yushin Washio
  • 475
  • 6
  • 10
0

I have seen many production problems caused by code not handling quotes ("), newline characters within quotes, and quotes within the quotes; e.g.: "he said ""this""" should be parsed into: he said "this"

Like it was mentioned earlier, many CSV parsing examples out there just read a line, and then break up the line by the separator character. This is rather incomplete and problematic.

For me and probably those who prefer build verses buy (or use somebody else's code and deal with their dependencies), I got down to classic text parsing programming and that worked for me:

/**
 * Parse CSV data into an array of String arrays. It handles double quoted values.
 * @param is input stream
 * @param separator
 * @param trimValues
 * @param skipEmptyLines
 * @return an array of String arrays
 * @throws IOException
 */
public static String[][] parseCsvData(InputStream is, char separator, boolean trimValues, boolean skipEmptyLines)
    throws IOException
{
    ArrayList<String[]> data = new ArrayList<String[]>();
    ArrayList<String> row = new ArrayList<String>();
    StringBuffer value = new StringBuffer();
    int ch = -1;
    int prevCh = -1;
    boolean inQuotedValue = false;
    boolean quoteAtStart = false;
    boolean rowIsEmpty = true;
    boolean isEOF = false;

    while (true)
    {
        prevCh = ch;
        ch = (isEOF) ? -1 : is.read();

        // Handle carriage return line feed
        if (prevCh == '\r' && ch == '\n')
        {
            continue;
        }
        if (inQuotedValue)
        {
            if (ch == -1)
            {
                inQuotedValue = false;
                isEOF = true;
            }
            else
            {
                value.append((char)ch);

                if (ch == '"')
                {
                    inQuotedValue = false;
                }
            }
        }
        else if (ch == separator || ch == '\r' || ch == '\n' || ch == -1)
        {
            // Add the value to the row
            String s = value.toString();

            if (quoteAtStart && s.endsWith("\""))
            {
                s = s.substring(1, s.length() - 1);
            }
            if (trimValues)
            {
                s = s.trim();
            }
            rowIsEmpty = (s.length() > 0) ? false : rowIsEmpty;
            row.add(s);
            value.setLength(0);

            if (ch == '\r' || ch == '\n' || ch == -1)
            {
                // Add the row to the result
                if (!skipEmptyLines || !rowIsEmpty)
                {
                    data.add(row.toArray(new String[0]));
                }
                row.clear();
                rowIsEmpty = true;

                if (ch == -1)
                {
                    break;
                }
            }
        }
        else if (prevCh == '"')
        {
            inQuotedValue = true;
        }
        else
        {
            if (ch == '"')
            {
                inQuotedValue = true;
                quoteAtStart = (value.length() == 0) ? true : false;
            }
            value.append((char)ch);
        }
    }
    return data.toArray(new String[0][]);
}

Unit Test:

String[][] data = parseCsvData(new ByteArrayInputStream("foo,\"\",,\"bar\",\"\"\"music\"\"\",\"carriage\r\nreturn\",\"new\nline\"\r\nnext,line".getBytes()), ',', true, true);
for (int rowIdx = 0; rowIdx < data.length; rowIdx++)
{
    System.out.println(Arrays.asList(data[rowIdx]));
}

generates the output:

[foo, , , bar, "music", carriage
return, new
line]
[next, line]
Andrew Tang
  • 91
  • 1
  • 4
-1

If you absolutely must use Scanner, then you must set its delimiter via its useDelimiter(...) method. Else it will default to using all white space as its delimiter. Better though as has already been stated -- use a CSV library since this is what they do best.

For example, this delimiter will split on commas with or without surrounding whitespace:

scanner.useDelimiter("\\s*,\\s*");

Please check out the java.util.Scanner API for more on this.

Hovercraft Full Of Eels
  • 276,051
  • 23
  • 238
  • 346
-3

Well, I do my coding in NetBeans 8.1:

First: Create a new project, select Java application and name your project.

Then modify your code after public class to look like the following:

/**
 * @param args the command line arguments
 * @throws java.io.FileNotFoundException
 */
public static void main(String[] args) throws FileNotFoundException {
    try (Scanner scanner = new Scanner(new File("C:\\Users\\YourName\\Folder\\file.csv"))) {
         scanner.useDelimiter(",");
         while(scanner.hasNext()){
             System.out.print(scanner.next()+"|");
         }}
    }
}
Tunaki
  • 116,530
  • 39
  • 281
  • 370