7

I want to construct a regex, that matches either ' or " and then matches other characters, ending when a ' or an " respectively is matched, depending on what was encountered right at the start. So this problem appears simple enough to solve with the use of a backreference at the end; here is some regex code below (it's in Java so mind the extra escape chars such as the \ before the "):

private static String seekerTwo = "(['\"])([a-zA-Z])([a-zA-Z0-9():;/`\\=\\.\\,\\- ]+)(\\1)";

This code will successfully deal with things such as:

"hello my name is bob"
'i live in bethnal green'

The trouble comes when I have a String like this:

"hello this seat 'may be taken' already"

Using the above regex on it will fail on the initial part upon encountering ' then it would continue and successfully match 'may be taken'... but this is obviously insufficient, I need the whole String to be matched.

What I'm thinking, is that I need a way to ignore the type of quotation mark, which was NOT matched in the very first group, by including it as a character in the character set of the 3rd group. However, I know of no way to do this. Is there some sort of sneaky NOT backreference function or something? Something I can use to reference the character in the 1st group that was NOT matched?? Or otherwise some kind of solution to my predicament?

flamming_python
  • 624
  • 6
  • 12
  • Hi and welcome to StackOverflow. I have taken the liberty to reformat your post a little. You can click on the edit link to see how I did this. Very important to know if you need to post code... – Tim Pietzcker Mar 15 '12 at 11:15

2 Answers2

12

This can be done using negative lookahead assertions. The following solution even takes into account that you could escape a quote inside a string:

(["'])(?:\\.|(?!\1).)*\1

Explanation:

(["'])    # Match and remember a quote.
(?:       # Either match...
 \\.      # an escaped character
|         # or
 (?!\1)   # (unless that character is identical to the quote character in \1)
 .        # any character
)*        # any number of times.
\1        # Match the corresponding quote.

This correctly matches "hello this seat 'may be taken' already" or "hello this seat \"may be taken\" already".

In Java, with all the backslashes:

Pattern regex = Pattern.compile(
    "([\"'])   # Match and remember a quote.\n" +
    "(?:       # Either match...\n" +
    " \\\\.    # an escaped character\n" +
    "|         # or\n" +
    " (?!\\1)  # (unless that character is identical to the matched quote char)\n" +
    " .        # any character\n" +
    ")*        # any number of times.\n" +
    "\\1       # Match the corresponding quote", 
    Pattern.COMMENTS);
Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • Excellent work there Tim, and thank-you for editing my post. Thanks to your suggestion, with a bit of work I modified my code as thus: "(['\"])([a-zA-Z])((?!\\1)[a-zA-Z0-9():;/`'\"\\=\\.\\,\\- ])+(\\1)" so your solution was actually simple enough and perfectly effective; add the equivelent of a regex if statment before the main character set, that will skip right to the last loop. And add both types of quotes to the main character set. This way if the found-at-start quote char is found at any time, the regex will terminate and return. Nice. – flamming_python Mar 15 '12 at 11:36
2

Tim's solution works fairly well if you can use lookaround (which Java does support). But if you should find yourself using a language or tool that does not support lookaround, you could simply match both cases (double quoted strings and single quoted strings) separately:

"(\\"|[^"])*"|'(\\'|[^'])*'

matches each case separately, but returns either case as the whole match


HOWEVER

Both cases can fall prey to at least one eventuality. If you don't look closely, you may think there should be two matches in this excerpt:

He turned to get on his bike. "I'll see you later, when I'm done with all this" he said, looking back for a moment before starting his journey. As he entered the street, one of the city's trolleys collided with Mike's bicycle. "Oh my!" exclaimed an onlooker.

...but there are three matches, not two:

"I'll see you later, when I'm done with all this"
's trolleys collided with Mike'
"Oh my!"

and this excerpt contains only ONE match:

The fight wasn't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn't love you!"

can you find that one? :D

't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn'

I would recommend (if you are up for using lookaround), that you consider doing some extra checking (such as a positive lookbehind for whitespace or similar before the first quote) to make sure you don't match things like 's trolleys collided with Mike' - though I wouldn't put much money on any solution without a lot of testing first. Adding (?<=\s|^) to the beginning of either expression will avoid the above cases... i.e.:

(?<=\s|^)(["'])(?:\\.|(?!\1).)*\1                    #based on Tim's

or

(?<=\s|^)("(\\"|[^"])*"|'(\\'|[^'])*')               #based on my alternative

I'm not sure how efficient lookaround is compared to non-lookaround, so the two above may be equivalent, or one may be more efficient than the other (?)

Code Jockey
  • 6,348
  • 6
  • 27
  • 43
  • Some good points here Code Jockey, and indeed parsing English text this way would not be wise. However, I am actually attempting to parse Russian text in MySQL code (I changed the а-яА-ЯёЁ to a-zA-Z in my code above, so that people here would be able to grasp the meaning), and when parsing Strings in code they are of course always guaranteed to be enclosed by one type of quotation mark or the other. – flamming_python Mar 15 '12 at 15:41