0

I have a problem with splitting a sentence in Java

input string :

"retinol,\"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid\",C034534,81485-25-8,\"Carcinoma, Hepatocellular\",MESH:D006528,Cancer|Digestive system disease,,17270033,therapeutic";

and i want to split it and get splitted terms like as follows ;

  1. retinol
  2. 3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid
  3. C034534
  4. 81485-25-8
  5. Carcinoma, Hepatocellular
  6. MESH:D006528
  7. Cancer|Digestive system disease
  8. (nothing)
  9. 17270033
  10. therapeutic

I tried few way to solve this problem such as Pattern/Matcher and split(",")[] etc.. But, i couldn't find the answer..

Dave Newton
  • 152,765
  • 23
  • 240
  • 286
  • 2
    Could you post those attempts with pattern/matcher and split in your question? – Jerry Dec 31 '13 at 17:58
  • Per @Jerry's comment - you must post your attempted solutions and ask questions about those issues - you cannot ask users of this sight to do your work for you. – maerics Dec 31 '13 at 18:01
  • What are the elements that occur in every instance of your data? MESH? What else? – Kovács Imre Dec 31 '13 at 18:01
  • String pattern = "(.*?),\\\"(.*?)\\\",(.*?),(.*?),\\\"(.*?)\\\",(.*?),(.*?),"; I used this pattern, but it is just useless when different pattern comes out. – Brandon Rubinsky Dec 31 '13 at 18:02
  • 2
    Looks like standard CSV, no? – Dave Newton Dec 31 '13 at 18:02
  • yeah, that's true. I'm extracting information from the CSV file. – Brandon Rubinsky Dec 31 '13 at 18:03
  • 2
    Then the best thing for you to do is to use a library to parse the CSV for you. [See this question.](http://stackoverflow.com/questions/4751539/parsing-a-csv-file-in-java) – Rick Hanlon II Dec 31 '13 at 18:04
  • Not sure if you've seen it yet, but you might also consider looking at [this Stack Overflow](http://stackoverflow.com/questions/3481828/how-to-split-a-string-in-java) question. – acrognale Dec 31 '13 at 18:07
  • Thank all of you very much, i think it might help me. I'm trying with your hints ,now – Brandon Rubinsky Dec 31 '13 at 18:07
  • 2
    Your best bet (and for the long run) is trying a library specifically written for parsing CSV, such as [OpenCSV](http://opencsv.sourceforge.net/). You could use a 'quick and dirkty way' perhaps of [this](http://ideone.com/XdhH9g) form, but it might not always work for all your data. – Jerry Dec 31 '13 at 18:12
  • Thank you again all of you, you are all an answer each of you. God bless you happy new year. – Brandon Rubinsky Jan 01 '14 at 06:12
  • I used CSVlibrary and it is very nice and easy. – Brandon Rubinsky Jan 01 '14 at 06:29

1 Answers1

3

As discussed in the comments, since you're parsing a CSV file, you're going to want to use a library specifically written to parse CSVs. Otherwise you'll continue to run into problems where what you write is "useless when a different patten comes out" (as you said).

However, to solve the question at hand you just have to split on a comma, ignoring commas inside of quotes. So you can do this (from this answer):

String input = "retinol,\"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid\",C034534,81485-25-8,\"Carcinoma, Hepatocellular\",MESH:D006528,Cancer|Digestive system disease,,17270033,therapeutic";
String[] output = input.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");

for(String s : output){
    System.out.println(s);
}

This will give you this output (note the quotes and empty line):

retinol
"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid"
C034534
81485-25-8
"Carcinoma, Hepatocellular"
MESH:D006528
Cancer|Digestive system disease

17270033
therapeutic

You can replace the quotes and ignore the empty line as you wish. This loop will print the exact output requested in the question:

int i=1;
for(String s : output){
    if(!s.isEmpty()){
        System.out.println(i++ + ". " + s.replace("\"", ""));
    }
}

Output:

  1. retinol
  2. 3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid
  3. C034534
  4. 81485-25-8
  5. Carcinoma, Hepatocellular
  6. MESH:D006528
  7. Cancer|Digestive system disease
  8. 17270033
  9. therapeutic

But, please, use a library like OpenCSV.

Community
  • 1
  • 1
Rick Hanlon II
  • 16,833
  • 7
  • 42
  • 52