12

Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:

<div>first html tag</div>
<div>another tag</div>

Would output:

first html tag
another tag

The regex pattern I am using only matches my last div tag and misses the first one. Code:

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

Output:

Matches found: 1

Inner DIV: This is ANOTHER test

Ben
  • 4,675
  • 8
  • 27
  • 45
  • 1
    Is it imperative of this task that you use a regular expression? HTML is a context free grammar, which cannot be parsed with regular expressions. Often times you can get close, but you would be better off using an HTML parser. See http://stackoverflow.com/a/1732454/2022565 – Tom Jacques Apr 14 '13 at 23:20

7 Answers7

17

Replace your pattern with a non greedy match

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}
coolmine
  • 4,161
  • 2
  • 30
  • 43
  • It found both of the matches but displays empty value(s) on my program – Ben Apr 14 '13 at 23:51
  • The above code should work, note that its m.Groups[1] and not m.Groups[2] as I changed it a bit since there is no reason to capture the tag itself. http://www.rubular.com/r/XQrcobmfAK – coolmine Apr 15 '13 at 00:00
10

As other guys didn't mention HTML tags with attributes, here is my solution to deal with that:

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World
Mehdi Dehghani
  • 8,186
  • 5
  • 49
  • 53
1

First of all remember that in the HTML file you will have a new line symbol("\n"), which you have not included in the String which you are using to check your regex.

Second by taking you regex:

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

Also a good place to look for this sort of information:

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

Mayman
  • 11
  • 1
1

The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.

The reason is because HTML is a context free grammar which is a more complex class than a regular expression.

Here's an example -- what if you have multiple stacked divs?

<div><div>stuff</div><div>stuff2</div></div>

The regexes listed as other answers will grab:

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

because that's what regular expressions do when they try to parse HTML.

You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.

More information: https://stackoverflow.com/a/1732454/2022565

Community
  • 1
  • 1
Tom Jacques
  • 655
  • 4
  • 19
1

Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?

CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.

CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.

If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); } (only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).

Community
  • 1
  • 1
Craig
  • 8,289
  • 1
  • 18
  • 20
  • A downvote without any explanation or comment. Thanks! The fact is that HTML/XML are notoriously a pain in the neck to deal with using Regex. Not that you can't do it, and I certainly have on numerous occasions, but CSS selector syntax is a much cleaner proposition. – Craig Oct 13 '16 at 14:38
1

I think this code should work:

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }
Tri Nguyen Dung
  • 809
  • 2
  • 12
  • 24
1

I hope below regex will work:

<div.*?>(.*?)<*.div>

You will get your desired output

This is a test This is ANOTHER test