0

The short and immediate version of the question is: Why are these two regex different? i.e.,

href=(['"]).+?\1

vs

href=(['"]).+?['"] or href=(['"]).+?(['"])


I am practicing regex on this site and I am trying to solve this level

http://play.inginf.units.it/#/level/6

I am posting the entire content here in case the site goes down in future.

           <tr>
                          <a href="javascript:openurl('/Xplore/accessinfo.jsp')" class="topUnderlineLinks">
                                            <A href="/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606" class="bodyCopy">PDF</A>(3141 KB)&nbsp;
                        <A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
                <td width="33%" ><div align="right"> <a href="/xplorehelp/Help_start.html#Help_searchresults.html" class="subNavLinks" target="blank">Help</a>&nbsp;&nbsp;&nbsp;<a href="/xpl/contactus.jsp" class="subNavLinks">Contact
Kimya ile ilgili çeþitli temel referans
<a href="http://search.epnet.com/login.asp?profile=web&amp;defaultdb=geh"
<a href="http://iimpft.chadwyck.com/" target="_parent">International
<a href="standartlar.html#tse" target="_parent">NFPA Standartlarý</a>
<a href="http://www.gutenberg.org/" target="_parent">Project Gutenberg</a>
<a href="http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&amp;uicode=istanbultek"
<a href="http://www.scitation.org" target="_parent">Scitation</a>
dergilerin listesini görmek için <a href="/online/aip.html">bu yolu</a>
<a href="http://www3.interscience.wiley.com/journalfinder.html"
               <td width="46%"><a href="/xpl/periodicals.jsp" class="dropDownNav" accesskey="j">Journals &amp; Magazines
               <td><a href="http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf" class="dropDownNav">IEEE Xplore Demo</a></td>
                          &nbsp;|&nbsp;&nbsp; <a href="/xpl/tocalerts_signup.jsp" class="topUnderlineLinks">Alerts</a>
                        <A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
                                    <a href="/search/srchabstract.jsp?arnumber=1554748&isnumber=33079&punumber=10417&k2dockey=1554748@ieeecnfs&query=%28+grammatical+evolution%3Cin%3Eti+%29&pos=9" class="bodyCopy">Abstract</a>
                                          <td><a href="history.jsp">View Session History</a></td>
                                          <td><a href="advsearch.jsp">New Search</a></td>
<a href="http://web5s.silverplatter.com/webspirs/start.ws?customer=kaynak"
<a href="standartlar.html#tse">Türk Standartlarý</a>
<a href="http://isiknowledge.com" target="_parent">Web of Science</a>
<a href='deneme.html#bg'>Butler Group </a>veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. &nbsp;<span class="tarih">(19.03.2007)</span> 
<a href='deneme.html#ps'>Productscan</a> veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. &nbsp;<span class="tarih">(19.03.2007)</span> 

I am supposed to match text like this

href="history.jsp"

That is I need to match any href in the above text.

Now according to Solutions, it seems like the answer for this is href=(['"]).+?\1

But that last backreference, if I don't use it and repeat the regex group(I hope parenthesis is called group, correct me if I am wrong), why am I getting different results? That is if I use this I am getting wrong results. href=(['"]).+?['"] or href=(['"]).+?(['"])

theprogrammer
  • 985
  • 1
  • 15
  • 31
  • @Ivar Thanks, Ivar. Actually I just checked that question. In fact I posted a comment there before asking this question since I wasn't able to figure this out just by looking at the definition of backreference. – theprogrammer Jun 06 '20 at 23:36
  • 1
    I was about to add an additional comment to explain it for your use case, but since you already received an answer it wouldn't add much value. – Ivar Jun 06 '20 at 23:39

2 Answers2

2

The backreference has to match the same thing that the capture group matched. So the first regexp will match

"abcd"

or

'abcd'

The second version doesn't link the two ends of the match, so it will match the following as well:

"abcd'

or

'abcd"

So the version with the back-reference only matches a string surrounded by the same types of quotes.

This difference is important if you have embedded quotes in a string, e.g.

some text "<div id='foo'>" more text

The version with the back-reference will match "<div id='foo'>", but the version without the back-reference will match "<div id='.

Barmar
  • 596,455
  • 48
  • 393
  • 495
  • Thanks a lot, this makes perfect sense. I can now see the advantage of backreference here since it remembers(or has memory) of what was captured before that is either a single quote or a double quote. – theprogrammer Jun 06 '20 at 23:37
0

The regex snippet (['"]).+?\1 captures the opening quote with (...), and uses a back-reference to use it later on with \1. That means that 'xyzzy' or "plugh" will match but not 'twisty".

That's probably the correct form since, with (['"]).+?['"], it can open and close with either quote.


As an aside, there's little point capturing the groups in your latter expression, unless you're going to use them in the code somehow. If you capture both, you could check to ensure they're identical but that's probably best handled by the use of the back-reference version.

In other words, if you wanted to allow something like 'twisty", all you need is ['"].+?['"].

paxdiablo
  • 772,407
  • 210
  • 1,477
  • 1,841