272

I wrote a regex to fetch string from HTML, but it seems the multiline flag doesn't work.

This is my pattern and I want to get the text in h1 tag.

var pattern= /<div class="box-content-5">.*<h1>([^<]+?)<\/h1>/mi
m = html.search(pattern);
return m[1];

I created a string to test it. When the string contains "\n", the result is always null. If I removed all the "\n"s, it gave me the right result, no matter with or without the /m flag.

What's wrong with my regex?

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
  • 14
    Don't use regular expressions to parse HTML, HTML is NOT a regular language. Use a HTML parser, resp. the DOM. That is also much simpler. – Svante Jul 01 '09 at 11:24
  • You're looking for DOTALL, not multiline. – Vanuan Oct 17 '14 at 21:25
  • Note that JavaScript [will soon have](https://developers.google.com/web/updates/2017/07/upcoming-regexp-features) the `dotAll` modifier so you can do `/.../s` and your dots will also match new lines. As of July 2017 it's behind a flag in Chrome. –  Jul 17 '17 at 13:48
  • @Svante "Don't use regular expressions to parse HTML": this is not parsing. Let us not to learn Chinese just for finding a '鱼'. – Dávid Horváth Dec 15 '20 at 01:27
  • You can call it like you want, but the _actual_ question seems to be “how do I find the h1 headline in the div box-content-6?” while this regex (when it works) seems rather like “give me the last h1 headline that appears after the start tag of the div box-content-6” and even fails at that e. g. when there are matching parts that are commented out. With a parser, you just parse, then query, which, depending on the language, might be e. g. just a css selector ".box-content-5 h1". This is simpler, much more correct, and obviously so. – Svante Dec 15 '20 at 13:49

5 Answers5

618

You are looking for the /.../s modifier, also known as the dotall modifier. It forces the dot . to also match newlines, which it does not do by default.

The bad news is that it does not exist in JavaScript (it does as of ES2018, see below). The good news is that you can work around it by using a character class (e.g. \s) and its negation (\S) together, like this:

[\s\S]

So in your case the regex would become:

/<div class="box-content-5">[\s\S]*<h1>([^<]+?)<\/h1>/i

As of ES2018, JavaScript supports the s (dotAll) flag, so in a modern environment your regular expression could be as you wrote it, but with an s flag at the end (rather than m; m changes how ^ and $ work, not .):

/<div class="box-content-5">.*<h1>([^<]+?)<\/h1>/is
T.J. Crowder
  • 879,024
  • 165
  • 1,615
  • 1,639
molf
  • 68,548
  • 13
  • 132
  • 117
  • Damn! And what's the logic behind `[\s\S]*` ?!?! – simo Nov 10 '12 at 11:05
  • 5
    @simo Match any whitespace or non whitespace character, effectively matching any character. It's like `.`, but matching whitespace too (`\s`) means it matches `\n` (which `.` doesn't do in JavaScript, or can be made to do with the `s` flag). – alex Nov 29 '12 at 03:05
  • 1
    This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Modifiers". – aliteralmind Apr 10 '14 at 00:39
  • 43
    According to MDN, `[^]` also works to match any character, including newlines, in JavaScript. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#character-classes – Dan Allen Aug 04 '14 at 09:20
  • 6
    For performance issues, it is highly recommended to use the `*?` quantifier instead of `*` in order to avoid greediness. This will avoid catching the **last**

    of the document: that's probably not what you want and that's not efficient as the regexp will continue to look for

    until the end of the string even if it has already found it before.

    – KrisWebDev Aug 22 '14 at 09:03
  • 9
    The [^] version is way easier on the regexp compiler, and also more terse. – Erik Corry Feb 06 '15 at 15:21
  • 2
    `[^]` works only in JavaScript (and other ECMAScript implementations), and can give wildly unexpected results if you try to use it in other flavors. – Alan Moore Jun 27 '16 at 11:47
  • I am using ([\\S\\s]*?) regex in Nodejs but it's not working with $1 string. – Anil Yadav Jul 31 '18 at 11:50
  • If you are on webpack: take a look at https://babeljs.io/docs/en/babel-plugin-transform-dotall-regex – keul Sep 20 '19 at 13:19
21

You want the s (dotall) modifier, which apparently doesn't exist in Javascript - you can replace . with [\s\S] as suggested by @molf. The m (multiline) modifier makes ^ and $ match lines rather than the whole string.

Greg
  • 295,929
  • 52
  • 357
  • 326
  • 4
    You might add that the /s" modifier sets singleline mode as opposed to multiline mode. +1 – Cerebrus Jul 01 '09 at 10:02
  • Nine years later, JavaScript now has the `s` flag (ES2018). :-) – T.J. Crowder Nov 20 '18 at 16:51
  • Can I use: https://caniuse.com/#feat=mdn-javascript_builtins_regexp_dotall MDN: https://developer.mozilla.org/ru/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll – Filyus Aug 19 '20 at 09:44
13

[\s\S] did not work for me in nodejs 6.11.3. Based on the RegExp documentation, it says to use [^] which does work for me.

(The dot, the decimal point) matches any single character except line terminators: \n, \r, \u2028 or \u2029.

Inside a character set, the dot loses its special meaning and matches a literal dot.

Note that the m multiline flag doesn't change the dot behavior. So to match a pattern across multiple lines, the character set [^] can be used (if you don't mean an old version of IE, of course), it will match any character including newlines.

For example:

/This is on line 1[^]*?This is on line 3/m

where the *? is the non-greedy grab of 0 or more occurrences of [^].

Michael Grant
  • 271
  • 2
  • 5
  • 1
    For those who wonder what `[^]` means: it is like a double negation: *"match any character that is **not** in this **empty** list"* and so it comes down to saying *"match any character"*. – trincot Sep 05 '18 at 07:21
8

The dotall modifier has actually made it into JavaScript in June 2018, that is ECMAScript 2018.
https://github.com/tc39/proposal-regexp-dotall-flag

const re = /foo.bar/s; // Or, `const re = new RegExp('foo.bar', 's');`.
re.test('foo\nbar');
// → true
re.dotAll
// → true
re.flags
// → 's'
Forivin
  • 12,200
  • 21
  • 77
  • 171
0

My suggestion is that it's better to split the multiple-line string with "\n" and concatenate the splits of the original string and becomes a single line and easy to manipulate.

<textarea class="form-control" name="Body" rows="12" data-rule="required" 
                  title='@("Your feedback ".Label())'
                  placeholder='@("Your Feedback here!".Label())' data-val-required='@("Feedback is required".Label())'
                  pattern="^[0-9a-zA-Z ,;/?.\s_-]{3,600}$" data-val="true" required></textarea>


$( document ).ready( function() {
  var errorMessage = "Please match the requested format.";
  var firstVisit = false;

  $( this ).find( "textarea" ).on( "input change propertychange", function() {

    var pattern = $(this).attr( "pattern" );
    var element = $( this );

    if(typeof pattern !== typeof undefined && pattern !== false)
    {
      var ptr = pattern.replace(/^\^|\$$/g, '');
      var patternRegex = new RegExp('^' + pattern.replace(/^\^|\$$/g, '') + '$', 'gm');     

      var ks = "";
      $.each($( this ).val().split("\n"), function( index, value ){
        console.log(index + "-" + value);
        ks += " " + value;
      });      
      //console.log(ks);

      hasError = !ks.match( patternRegex );
      //debugger;

      if ( typeof this.setCustomValidity === "function") 
      {
        this.setCustomValidity( hasError ? errorMessage : "" );
      } 
      else 
      {
        $( this ).toggleClass( "invalid", !!hasError );
        $( this ).toggleClass( "valid", !hasError );

        if ( hasError ) 
        {
          $( this ).attr( "title", errorMessage );
        } 
        else
        {
          $( this ).removeAttr( "title" );
        }
      }
    }

  });
});
Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
Ghebrehiywet
  • 756
  • 3
  • 9
  • 19