5

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:

author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954 

and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.

Ryan Amos
  • 5,164
  • 4
  • 31
  • 53
david
  • 51
  • 1
  • 3
  • You could try using a Javascript [regular expression](http://www.regular-expressions.info/javascript.html). – Wylie Aug 27 '11 at 03:00
  • @Ryan: My mistake. But a regular expression would certainly be one way to do it. – Wylie Aug 27 '11 at 03:17
  • 1
    @David Can you please tell us which answer helped you best by selecting it as best answer? – Ryan Amos Aug 30 '11 at 22:33

3 Answers3

5

You could either use a regex like

"(.+)"

It will match any character between quotes. In Java would be:

Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
  System.out.println(m.group(1));      
}

Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes

Offcourse you could also use a substring to select everything except the first and last char:

String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;    
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
    unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
  unquoted = quoted;
}
Benjamin Udink ten Cate
  • 12,052
  • 3
  • 43
  • 63
  • I agree regex is a pain in the ass, but if you work with strings often, it is definately worth looking into. A good place to start is the perl regex docs: http://perldoc.perl.org/perlre.html – Benjamin Udink ten Cate Aug 27 '11 at 04:37
  • 1
    you can greatly simplify your code y *not* capturing the surrounding double quotes: use lookbehind and lookahead – Bohemian Aug 27 '11 at 04:39
3

There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.

I like to use String.split(). It's a simple function and does what you need it to do.

So if I have a String word: "hello" and I want to take out "hello", I can simply do this:

myStr = string.split("\"")[1];

This will cut the string into bits based on the quote marks.

If I want to be more specific, I can do

myStr = string.split("word: \"")[1].split("\"")[0];

That way I cut it with word: " and "

Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.

Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.

Best of luck!

Ryan Amos
  • 5,164
  • 4
  • 31
  • 53
1

Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().

If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".

For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?

One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.

If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Community
  • 1
  • 1
Jessica Brown
  • 7,812
  • 6
  • 41
  • 80