2

I have an HTML page with <tr> classes and I need to capture the text inbetween those tags.

I tried with Regex:

(?i)<tr[^>]*?>([^<]*)</tr> 

But it doesn't work.

This is all my code in C#:

string patternPost = @"(?i)<tr[^>]*?>([^<]*)</tr>";
MatchCollection m1 = Regex.Matches(html, patternPost, RegexOptions.Multiline);
foreach (Match m in m1)
    {
        MessageBox.Show(m.Groups[1].Value);
    }

Here you can find an example of HTML page: http://pastebin.com/ewN5NZis

You can see 2 block, I need to store for each of blocks, three info in three different list:

List 1: Title1, Title2
List 2: John, Antony
List 3: 29/04/14, 28/04/14

With my first regex I wanna try first to catch all blocks and skip useless information like tags differents from tr and next I wanna try to catch 3 infos for each blocks with 3 different regex. Is this right? I hope now you understand me.

user3579313
  • 83
  • 1
  • 6
  • 10
    [NOOOOOOOOOOOOOOOoooooooooooooo](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jonesopolis Apr 29 '14 at 22:53
  • Why not use the `XmlDocument` class? – Blue0500 Apr 29 '14 at 22:55
  • 1
    As per @Jonesy's link, this is a **BAD IDEA**! – Brandon Apr 29 '14 at 22:56
  • Please define doesn't work. For all the naysayers, regex is more powerful than it seems, [read this](http://stackoverflow.com/questions/17003799/what-are-regular-expression-balancing-groups/17004406#17004406). Ok it might not be maintainable for upcoming newbie-regex-reviewers but hey just add comments with the `x` modifier :) – HamZa Apr 29 '14 at 22:58
  • 2
    So surprising that you are the very first person to write web scraping tool in C#... I seem to remember seeing similar question in the past - maybe you can try searching for it :) Note that *the question* contains most of the possible approaches to reading HTML (including some theory of RegEx and HtmlAgilityPack link), so please make sure to read it past the top answer. – Alexei Levenkov Apr 29 '14 at 22:59
  • Doesn't works = empty string for all the html page.. – user3579313 Apr 29 '14 at 23:01
  • @user3579313 it would be really helpful to add a fiddle, try [regex101.com](http://regex101.com) out. It's not C# specific, but I think it might be enough for your use-case. – HamZa Apr 29 '14 at 23:06

1 Answers1

0

EDIT: In your last comment, you said: <tr ....> <tag> ... </tag> <tag2>...</tag2> </tr> which is quite an expansion on the original problem. At this stage, I concur with all other advice: you are going to need a dom parser.

Older Edit: Originally you asked to match contents of <tr> tags. Specs have changed, so this answer contains the evolving versions.

For a plain <tr> tag: extract Group 1 from

(?i)<tr>([^<]*)</tr>

or for a <tr with stuff>:

(?i)<tr[^>]*>([^<]*)</tr>

or for <tr stuff><td stuff>Grab Me</td>

(?i)<tr[^>]*?>\s*<td[^>]*?>(.*)</td

Here is a code sample:

using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {

string s1 = "<tr stuff><td stuff>Grab Me</td>";
var r = new Regex("(?i)<tr[^>]*?>\\s*<td[^>]*?>(.*)</td");
string capture = r.Match(s1).Groups[1].Value;
Console.WriteLine(capture);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

Output: Grab Me

zx81
  • 38,175
  • 8
  • 76
  • 97
  • 1
    No, I think the problem is that into tag there are a lot of other tags so my code doesn't works.. – user3579313 Apr 29 '14 at 23:29
  • @user3579313 Please see the second part of the solution that I just added. :) – zx81 Apr 29 '14 at 23:30
  • Is equal to my code on first post :) – user3579313 Apr 29 '14 at 23:31
  • 1
    @user3579313 Please give me an example of a full tag that is not matching. – zx81 Apr 29 '14 at 23:32
  • @user3579313 Please see my "Grab Me" regex just added for your new specs. – zx81 Apr 29 '14 at 23:49
  • @user3579313 Also updated C# code sample to get "Grab Me" – zx81 Apr 29 '14 at 23:51
  • Thanks for your help but the problem is that into my html code with tag I have a lot of other tag, so I must specify all the tags? Isn't possible to capture all text between ... ... ? I need to capture tag too, all text, tags included, between tag and .. – user3579313 Apr 29 '14 at 23:55
  • @user3579313 Hmm... Please see my latest edit at the top. That is as far as I can go with it. – zx81 Apr 30 '14 at 00:36
  • Sorry but you don't understand me... I edited my first post with a screen shoot of html page, I need ALL the row between the and includes all other tags. Thanks for your patience. – user3579313 Apr 30 '14 at 01:00
  • @user3579313 If you give me a test string I can directly copy and paste, I can do a test. I also need you to tell me exactly what you are trying to match. Some people use websites where you can paste a code sample. – zx81 Apr 30 '14 at 03:55
  • I edited my first post (hope for the last time :) ). – user3579313 Apr 30 '14 at 19:26