1

I've a CSV file whose content is as follows:

1,"hello, there",I have a csv in which,"only when ""double quote"" or comma are there in the content",it will be wrapped in the double quotes,otherwise not,something like 1/2" will not be wrapped up in double quotes.

I used OpenCSV and other CSV libraries for parsing but it didn't work.

I used the Regular Expression quoted in the StackOverflow question but it also didn't work.

However, when I open it in Excel it is working fine. Can someone give me a hint regarding how to parse this CSV file.

Note that when the content contains a comma, then only it is wrapped up in the text qualifier. When such content is wrapped up in double quotes, and the double quote is part of the content, then it is escaped with the double quote. In other words, it changes to double double quote. But if the content has a double quote, then it is not wrapped up in the text qualifiers.

Please advise regarding this.

The output of the above content when parsed should be as below:

The output should be as follows:

1
hello, there
I have a csv in which
only whn "double quote" or comma are there in the content
it will be wrapped in the double quotes
otherwise not
something like 1/2" will not be wrapped up in double quotes.

I tried using open csv and also tried split using the regular expression:

",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"

But of no use.

My Data is like below:

PRODUCT,,1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL,

Want this to be parsed as below (I used to represent an empty cell when we see it in excel)

PRODUCT
<BLANK>
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE
P
2510906459
<BLANK>
DEWALT TOOLS
<BLANK>
<BLANK>
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL
Community
  • 1
  • 1
  • Your question as it stands right now is very hard to read and understand. Could you please format it properly, add examples of what your content might be and how you want to have it split and also what you have tried so far? – Sebastian Proske Mar 26 '16 at 12:31
  • Thanks Sebastian for reading this. Edited the content. Please let me know if it is more readable now. – Shaik Allabakash Mar 26 '16 at 12:47
  • 1
    As stated, I'd say this problem is *impossible* to solve unless you make a stricter definition for "something like 1/2" " – Darius X. Mar 26 '16 at 14:07
  • Hi Darius, Thanks for your comments. Updated the question with more details. Please refer and let me know. – Shaik Allabakash Mar 26 '16 at 16:53

2 Answers2

2

I had no problems parsing your input with uniVocity-parsers:

    String input = "PRODUCT,,1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src=\"http://example.com/image.png\"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL,";
    Reader reader = new StringReader(input);

    CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial.
    settings.setNullValue("<BLANK>"); //use that to obtain <BLANK> to represent nulls

    String[] row = new CsvParser(settings).parseAll(reader).get(0);
    for(String element : row){
        System.out.println(element);
    }

Output:

PRODUCT
<BLANK>
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE
P
2510906459
<BLANK>
DEWALT TOOLS
<BLANK>
<BLANK>
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL
<BLANK>

Disclaimer: I'm the author of this library, it's open-source and free (Apache 2.0 license)

Jeronimo Backes
  • 5,701
  • 2
  • 20
  • 28
  • 1
    Hi @JeronimoBackes This was my dream API for parsing my csv. Thank you very much. – Shaik Allabakash Mar 28 '16 at 16:35
  • 1
    Thank you. Saved immense amount of time. none of the other parsers were able to detect the multi line scenario where this worked like charm – ashoka Feb 15 '17 at 12:26
1

Try following regex:

Stream<String> lines = Files.lines(Paths.get("path to csv file"));

Pattern regex = Pattern.compile("\"(.*?)\"(?=,|$)|(?<=(?:,|^))(.*?)(?=,|$)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

lines.forEach( line -> {
    Matcher matcher = regex.matcher(line);
    while (matcher.find()) {
        String content = matcher.group(1) == null ? matcher.group() : matcher.group(1);
        System.out.println(content);
    }
});

Based on sample input text

1,"hello, there",I have a csv in which,
"only when ""double quote"" or comma are there in the content",
it will be wrapped in the double quotes,otherwise not,
something like 1/2" will not be wrapped up in double quotes.

It will emit.

1
hello, there
I have a csv in which
only when ""double quote"" or comma are there in the content
it will be wrapped in the double quotes
otherwise not
something like 1/2" will not be wrapped up in double quotes.
Saleem
  • 7,965
  • 2
  • 16
  • 31
  • @saleem-mirza Thanks for the answer. It is working like charm. The only thing that is missing is that it is not returning the last item. Can you please suggest some changes? – Shaik Allabakash Mar 27 '16 at 06:47
  • Also, the doubled-up double-quotes are coming through as-is: not being reduced to non-escaped double-quotes – Darius X. Mar 27 '16 at 17:51
  • Guys, check out updated code. hopefully it will work for you. – Saleem Mar 27 '16 at 18:32