Using regex to get text between multiple HTML tags

Question

Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:

<div>first html tag</div>
<div>another tag</div>

Would output:

first html tag
another tag

The regex pattern I am using only matches my last div tag and misses the first one. Code:

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

Output:

Matches found: 1

Inner DIV: This is ANOTHER test

Is it imperative of this task that you use a regular expression? HTML is a context free grammar, which cannot be parsed with regular expressions. Often times you can get close, but you would be better off using an HTML parser. See http://stackoverflow.com/a/1732454/2022565 — Tom Jacques, Apr 14 '13 at 23:20

score 17 · Answer 1 · answered Apr 14 '13 at 23:19

17

Replace your pattern with a non greedy match

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}

answered Apr 14 '13 at 23:19

coolmine

4,161
2
30
43

It found both of the matches but displays empty value(s) on my program – Ben Apr 14 '13 at 23:51
The above code should work, note that its m.Groups[1] and not m.Groups[2] as I changed it a bit since there is no reason to capture the tag itself. http://www.rubular.com/r/XQrcobmfAK – coolmine Apr 15 '13 at 00:00

score 10 · Answer 2 · answered Oct 01 '16 at 11:58

As other guys didn't mention HTML tags with attributes, here is my solution to deal with that:

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

score 1 · Answer 3 · answered Apr 14 '13 at 23:20

First of all remember that in the HTML file you will have a new line symbol("\n"), which you have not included in the String which you are using to check your regex.

Second by taking you regex:

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

Also a good place to look for this sort of information:

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

score 1 · Answer 4 · edited May 23 '17 at 11:47

The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.

The reason is because HTML is a context free grammar which is a more complex class than a regular expression.

Here's an example -- what if you have multiple stacked divs?

<div><div>stuff</div><div>stuff2</div></div>

The regexes listed as other answers will grab:

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

because that's what regular expressions do when they try to parse HTML.

You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.

More information: https://stackoverflow.com/a/1732454/2022565

score 1 · Answer 5 · edited May 23 '17 at 12:26

1

Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?

CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.

CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.

If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); } (only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).

edited May 23 '17 at 12:26

Community

1
1

answered Apr 15 '13 at 01:55

Craig

8,289
1
18
20

A downvote without any explanation or comment. Thanks! The fact is that HTML/XML are notoriously a pain in the neck to deal with using Regex. Not that you can't do it, and I certainly have on numerous occasions, but CSS selector syntax is a much cleaner proposition. – Craig Oct 13 '16 at 14:38

score 1 · Answer 6 · answered Jul 15 '14 at 03:12

I think this code should work:

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

score 1 · Answer 7 · answered Feb 06 '20 at 06:30

1

I hope below regex will work:

<div.*?>(.*?)<*.div>

You will get your desired output

This is a test This is ANOTHER test

answered Feb 06 '20 at 06:30

Partha Mondal

21
3

Using regex to get text between multiple HTML tags

7 Answers7

Linked