Regex ignore matches between – George Reith Sep 21 '12 at 14:59

  • @dan1111 2) The plugin author said he would update the plugin [here](http://wordpress.org/support/topic/plugin-relevanssi-a-better-search-highlights-happening-inside-script-tags) provided the correct reg-ex. 3) I updated the post with the function it originates from. – George Reith Sep 21 '12 at 15:01
  • Try to combine //iu with subpattern negation ?! to get a single working regex. I suspect the resulting pattern will be very ugly. – Martin Sep 21 '12 at 15:03
  • @Martin doesn't have to be a single line of reg-ex, if it is possible to remove the ` – George Reith Sep 21 '12 at 15:08
  • @GeorgeReith - put your script in a seperate js file - there is a quick fix – Scott Selby Sep 21 '12 at 15:10
  • @ScottSelby It is not that simple and I do use seperate javascript files. The inline javascript comes from a video-player plugin which inserts a video player into the post. If you arrive on a page from a certain search term the video doesn't play because the url in the inline javascript gets skewed. Here's the page http://rcnhca.org.uk/sites/first_steps/quality/accountability-and-delegation/accountability/ search for "accountability" and select that page from the list and then try and play the video (relies on HTTP refferer). – George Reith Sep 21 '12 at 15:12
  • 4 Answers4

    2

    The most accurate approach is to:

    • Parse the HTML with a proper HTML parser
    • Ignore the strings that are within the <script> tags.

    You don't want to try parsing HTML with regular expressions. Here's an explanation of why: http://htmlparsing.com/regexes.html

    It will make you sad in the long run. Please take a look at the rest of http://htmlparsing.com/ for some pointers that could get you started.

    Andy Lester
    • 81,480
    • 12
    • 93
    • 144
    • This isn't possible for the context of my question. This is part of an already built plug-in for the wordpress CMS. I am not rewriting the plug-in I am attempting to modify a small part of it's behaviour. The regex just looks for a certain word in the page to highlight, it already ignores html but it doesn't ignore the contents of ` – George Reith Sep 21 '12 at 14:45
    1

    Since lookbehind assertions need to be fixed in length, you cannot use them to look for a preceding <script> tag somewhere before the searched term.

    So, after you replace all the occurrences of the desired term, you need a second pass to revert back those occurrences of the modified term that appear to be inside a <script> tag.

    # provide some sample data
    $excerpt = 'My name is bob!
    
    And bob is cool.
    
    <script type="text/javascript">
    var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
    alert(bobby);
    
    var bob = 5;
    </script>
    
    Yeah, the word "bob" works fine.';
    
    $start_emp_token = '<em>';
    $end_emp_token = '</em>';
    $pr_term = 'bob';
    
    # replace everything (not in a tag)
    $excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '$1' . $end_emp_token, $excerpt);
    
    # undo some of the replacements
    $excerpt = preg_replace_callback('#(<script(?:[^>]*)>)(.*?)(</script>)#is',
                           create_function(
                             '$matches',
                             'global $start_emp_token, $end_emp_token, $pr_term;
                              return $matches[1].str_replace("$start_emp_token$pr_term$end_emp_token", "$pr_term", $matches[2]).$matches[3];'
                           ),
                           $excerpt);
    
    var_dump($excerpt);
    

    The code above produces the following output:

    string(271) "My name is <em>bob</em>!
    
    And <em>bob</em> is cool.
    
    <script type="text/javascript">
    var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
    alert(bobby);
    
    var bob = 5;
    </script>
    
    Yeah, the word "<em>bob</em>" works fine."
    
    Kouber Saparev
    • 6,277
    • 2
    • 23
    • 23
    • Thanks, but it doesn't appear to do anything. – George Reith Sep 25 '12 at 12:39
    • Your regex is successfully matching the `script` tags but it isn't removing the tokens. – George Reith Sep 25 '12 at 12:46
    • For me it works just fine. I've just edited my reply by including some sample data. Can you try it and see whether it works for you too? – Kouber Saparev Sep 27 '12 at 15:52
    • The code works in that scenario but seems to be failing to find the `$start_emp_token` and `$end_emp_token` when used in a string with more variation such as my HTML source code. See http://rcnhca.org.uk/sites/first_steps/test.php where I use your sample code but simply replaced the string with my HTML source code. See line 115 for where it fails to extract the tokens. – George Reith Sep 28 '12 at 08:59
    • Ignore my last comment as the `em` tags where already in the string. It appears to work in your scenario, but not when added to the plugin. I think this is to do with the video being embedded by `SWFobject` and some sort of timing issue. I think my only option is to disable the highlighting feature. Shame. – George Reith Sep 28 '12 at 09:31
    • You just need to debug it in detail with print_r/var_dump in order to see why in your script it does not work. If used properly, the code above should work in any situation, at least it does what you were asking for - to ignore matches between the script tags. :) – Kouber Saparev Sep 28 '12 at 09:46
    • It is much more than this as I believe it is to do with the way each plugin hooks into the CMS, far too many files and methods are involved to debug it sanely and I'm only going to end up editing files I don't want to. Especially since my host has forced PHP errors to be suppressed. Thanks for your help :) – George Reith Sep 28 '12 at 13:31
    0

    You mentioned in a comment that it would be acceptable to remove script tags before performing the search.

    $data = preg_replace('/<\s*script.*?\/script\s*>/iu', '', $data);
    

    This code may help with that.

    Martin
    • 5,667
    • 7
    • 44
    • 73
    0

    George, resurrecting this ancient question because it had a simple solution that wasn't mentioned. This situation is straight out of my pet question of the moment, Match (or replace) a pattern except in situations s1, s2, s3 etc

    You want to modify the following regex to exclude anything between <script> and </script>:

    (\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
    

    Please forgive me for switching out $term with SOMETERM, it is for clarity because $ has a special meaning in regex.

    With all the disclaimers about matching html in regex, to exclude anything between <script> and </script>, you can simply add this to the beginning of your regex:

    <script>.*?</script>(*SKIP)(*F)|
    

    so the regex becomes:

    <script>.*?</script>(*SKIP)(*F)|(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
    

    How does this work?

    The left side of the OR (i.e., |) matches complete <script...</script> tags, then deliberately fails. The right side matches what you were matching before, and we know it is the right stuff because if it was between script tags, it would have failed.

    Reference

    How to match (or replace) a pattern except in situations s1, s2, s3...

    Community
    • 1
    • 1
    zx81
    • 38,175
    • 8
    • 76
    • 97