Parse HTML with Regex and C#

Question

I've HTML code like this:

<tr class="discussion r0"><td class="topic starter"><a href="SITE?d=6638">Test di matematica</a></td>

I need to only select "Test di matematica" and I think to do this with Regular Expression. I tried with:

 string pattern= "<tr class=\"discussion r0\"><td class=\"topic starter\"><a href=\"" + site + "=d{1,4}\"" + ">\\s*(.+?)\\s*</a></td>";

but it doesn't works.. what I can do for selecting words after expression and before other expression?

EDIT: Can you tell me how can I do with HTMLAgility to parse this string? Thanks.

http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c — user2864740, Apr 27 '14 at 21:52
Is totally impossible? With this: string patternTitolo = "d=\\d{1,4}\">\\s*(.+?)\\s*"; works a little.. — user3579313, Apr 27 '14 at 21:54
@user3579313 It is "misguided" and "fragile". HTML should not be parsed with regular expressions - it's a gross hack with an inappropriate tool applied when there are existing solutions. — user2864740, Apr 27 '14 at 21:57
@user3579313 Sure you can. However, this doesn't fix the fundamentally broken design of trying to use a regular expression for this task. I would *reject* (and I have before) any such code that uses regular expressions and/or manual string manipulation to deal with HTML or XML (or JSON or ..). — user2864740, Apr 27 '14 at 21:59
Anyway, `=d{1,4}` is clearly wrong, and I don't feel like wading through the rest of it. — user2864740, Apr 27 '14 at 22:02

Pedro Lobito · Answer 1 · 2014-04-27T22:05:24.673

Try this:

string myString = "<tr class=\"discussion r0\"><td class=\"topic starter\"><a href=\"SITE?d=6638\">Test di matematica</a></td>";
Regex rx = new Regex(@"<a.*?>(.*?)</a>");
MatchCollection matches = rx.Matches(myString);
if (matches.Count > 0)
{
    Match match = matches[0]; // only one match in this case
    GroupCollection groupCollection = match.Groups;
    Console.WriteLine( groupCollection[1].ToString());
}

DEMO

http://ideone.com/nFY6aw

zx81 · Accepted Answer · 2014-04-28T19:43:22.003

0

This regex makes sure that the text we capture is inside an <a tag which is inside a <td tag which is inside a <tr tag.

using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {

string s1 = "<tr class=\"discussion r0\"><td class=\"topic starter\"><a href=\"SITE?d=6638\">Test di matematica</a></td>";
var r = new Regex(@"(?i)<tr[^>]*?>\s*<td[^>]*?>\s*<a[^>]*?>([^<]*)<", RegexOptions.IgnoreCase);
string capture = r.Match(s1).Groups[1].Value;
Console.WriteLine(capture);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

The Output: Test di matematica

edited Apr 28 '14 at 19:43

answered Apr 28 '14 at 01:16

zx81

38,175
8
76
97

@user3579313 Fantastic. Thanks for letting me know. – zx81 Apr 29 '14 at 10:06
Pls, can ou tell me how can I edit you regex in order to capture text only inside a tag ? I'm trying with (?i)]*?>([^ but doesn't works... – user3579313 Apr 29 '14 at 22:24
@user3579313 `(?i)([^` – zx81 Apr 29 '14 at 22:38
This is new answer: http://stackoverflow.com/questions/23376687/regex-for-html-tr-tag – user3579313 Apr 29 '14 at 23:16

Parse HTML with Regex and C#

2 Answers2