0

Anyone has a regex that can remove the attributes from a body tag

for example:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body>

It would also be interesting to see an example of removing just a specific attribute, like:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body bgcolor="White">
bcm
  • 5,320
  • 9
  • 55
  • 90
  • What language are you doing this in? – t0mm13b Sep 28 '10 at 23:45
  • 5
    [Use a parser.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – jball Sep 28 '10 at 23:48
  • 4
    Obligatory post is obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – LittleBobbyTables - Au Revoir Sep 28 '10 at 23:49
  • I don't want a full-blown parser... I just want a regex specific for this replace. Also, read 2nd answer of that poetic post. – bcm Sep 28 '10 at 23:50
  • 3
    Linking that answer in response to every question containing the words Regex and HTML is now more of an epidemic than people trying to parse HTML with regex. – MooGoo Sep 28 '10 at 23:59
  • @Brandon how specifically for this replace? `html.Replace("", "")` is certainly a viable solution if you really don't need parsing. – jball Sep 29 '10 at 00:06
  • 2
    @MooGoo, we'll have to continue doing it too, until every programmer understands that HTML is neither regular nor context-free. – jball Sep 29 '10 at 00:07
  • @moogoo: you haven't read that linky by LittleBobbyTables have you? – t0mm13b Sep 29 '10 at 00:08
  • @jball, why don't you give an example of using htmlparser for this specific purpose? – bcm Sep 29 '10 at 00:13
  • 1
    @tommieb75: Of course I have. You cannot spend more than 5 minutes on SO without running into it. It is not relevant to this question. Removing attributes from an XML tag != parseing. Using a full blown XML/HTML parser for such a simple task is a ridiculious amount of overkill. – MooGoo Sep 29 '10 at 00:25
  • I'm not wanting to parse the entire page, just a section which has been saved by a rich text editor, it may or may not contain the body tag depending on where the user is pasting html from. So.... how to apply HTMLAgilityPack is this scenario, taking into account I'm totally new to this component and I'm not a back-end developer. – bcm Sep 29 '10 at 00:26
  • Voting to close as this has been answered elsewhere and there's too much noise! – t0mm13b Sep 29 '10 at 00:31
  • All the 2k and 10k reps silent when it's more work than to just point to a snob article? – bcm Sep 29 '10 at 00:31
  • @Brandon: You **should** have checked elsewhere on stackoverflow before submitting a question... – t0mm13b Sep 29 '10 at 00:37
  • what makes you assume I haven't – bcm Sep 29 '10 at 00:40
  • 1
    Please post a link to another question about removing attributes from XML tags. – MooGoo Sep 29 '10 at 00:45
  • There seems to be a bit of an emotional component to these comments - I can't speak for anyone else, but my personal reason for recommending a parser and trying to dissuade Brandon from regex is to try and save them trouble down the line. @MooGoo, when you say *"for such a simple task"* it implies to me that you have not fully understood either the complexity of xhtml tags and the limitations of Regex. Any Regex devised for this task can be easily broken by real-world xhtml. A parser will save time and be more robust. – jball Sep 29 '10 at 16:42
  • An aside - I have to disagree with the assertions that this has been answered anywhere else on SO. – jball Sep 29 '10 at 16:43

7 Answers7

3

You can't parse XHTML with regex. Have a look at the HTML Agility Pack instead.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
    body.Attributes.Remove("style");
}
Community
  • 1
  • 1
dtb
  • 198,715
  • 31
  • 379
  • 417
  • what if the html block i'm looking at does not contain the body tag/node, will this still work? I'm only filtering a certain section of a page. – bcm Sep 29 '10 at 00:42
  • @Brandon: SelectSingleNode returns `null` if no body element is present. – dtb Sep 29 '10 at 00:45
  • possible to get the doc return as a string again ? - same way it goes in as a string in (html) – bcm Sep 29 '10 at 01:07
  • @Brandon: Try `doc.Save(filename)` or `doc.DocumentNode.OuterXml`. – dtb Sep 29 '10 at 01:18
  • @Brandon: see my sharpquery answer to find out how to output the document again. it uses htmlagilitypack under the hood.. just makes finding tags easier. – mpen Sep 29 '10 at 01:24
  • also, i disagree with "don't even think about it". regexes aren't great in general for parsing html..but for stripping off a few attributes, i think a regex is fine. – mpen Sep 29 '10 at 01:25
  • thanks, led me to doc.DocumentNode.OuterHtml; which works. I'm looking for a way to remove meta and link tags also. – bcm Sep 29 '10 at 01:29
  • I tried to download the documentation but when I open the chm file.. the right pnl shows error 'navigation to webpage was canceled.' – bcm Sep 29 '10 at 01:34
  • yea... i haven't managed to find any documentation on htmlagilitypack either. you kind of just have to guess ;) meta and link tags can be removed in the exact same fashion, no? – mpen Sep 29 '10 at 01:37
  • @Mark, yes I can remove them in the same fashion... the ones I'm seeing are hiding in , I'm seeing if I can remove that as well.... – bcm Sep 29 '10 at 02:16
  • started new thread about removing comments here: http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack – bcm Sep 29 '10 at 02:48
2

If you're doing a quick-and-dirty shell script, and you don't plan on using this much...

s/<body [^>]*>/<body>/

but I'm going to have to agree with everyone else that a parser is a better idea. I understand that sometimes you must make do with limited resources, but if you rely on a regex here... it has a strong chance of coming back to bite you when you least expect it.

and to remove a specific attribute:

s/\(<body [^>]*\) style="[^>"]*"/\1/

That will grab "body" and any attributes up to "style", drop the "style" attribute, and spit out the rest.

Tim
  • 8,474
  • 3
  • 37
  • 54
  • In what way could it come back to bite him if all he wants to do is remove unnestable attributes? – MooGoo Sep 29 '10 at 00:06
  • @Moogoo - see my comment above! – t0mm13b Sep 29 '10 at 00:08
  • Regardless, this is the only answer that actually bothered to *answer* the question and not just mindlessly spout "bad bad bad evil evil evil". So, +1 – MooGoo Sep 29 '10 at 00:44
  • @Moogoo totally agree. I'm surprised that there are so many HtmlAgilityPack or Anti-HTML-regex disciples. Mind you I'm not against HtmlAgilityPack... just want a more measured response. – bcm Sep 29 '10 at 00:48
  • It is simply a knee-jerk reaction as many people *do* want to use regex to parse nested HTML tags which as you may have heard will not work. However in the limited case you are describing, it should do just fine. I'm fairly certain that many programmers here use regex to find and replace things in their code *all the time* without ripping a single hole in the fabric of spacetime. Programing languages are not regular either, so next time you want to change the name of a variable, it damn well better be done using an abstract syntax tree! – MooGoo Sep 29 '10 at 00:57
  • Mainly, it's pretty easy to mess up a regex. In testing, it may work for every case you try, but in production, you are very likely to encounter a new, unexpected case. This can cause problems with some regex casting too wide a net. And modifying xml or html this way can result in invalid xml or html. So, basically, there's a risk of bugs. But as long as you understand the risks, regex can still be very useful. And yes, the lack of any real answers (at the time) is why I posted this in the first place. Just because a tool may not be The Best Way (tm) doesn't mean it's not useful. – Tim Sep 29 '10 at 16:49
2

Three ways to do it with regexes...

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);
mpen
  • 237,624
  • 230
  • 766
  • 1,119
0

LittleBobbyTables comment above is correct!

Regex is not the right tool, if you read it, it's actually true, using regex for this kind of thing will strike you down with undue strain and stress as the answer clearly shown on that link that LittleBobbyTables posted, what the answerer experienced as a result of using the wrong tool for the wrong job.

Regex is NOT the duct tape for doing such things nor is the answer to everything including 42... use the right tool for the right job

However you should check out HtmlAgilityPack which will do the job for you and ultimately save you from the stress, tears and blood as a result of getting to the grips of death using regex to parse html...

t0mm13b
  • 32,846
  • 7
  • 71
  • 106
  • give an example of HtmlAgilityPack accomplishing what I want? – bcm Sep 29 '10 at 00:15
  • @Brandon: obviously you do not understand the repercussions of regex's and not bothering to read up examples found pertaining to Html Agility Pack, here's the link for an example http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home – t0mm13b Sep 29 '10 at 00:32
  • I've already read it thank you very much.. and dont' find it helpful for my scenario. – bcm Sep 29 '10 at 00:34
  • I've also read http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack but don't see how to modify this to suit my requirements... – bcm Sep 29 '10 at 00:39
  • just because you know alot doesn't mean you have to be so assumptuous about others 'not bothering' – bcm Sep 29 '10 at 00:40
  • @tommieb75: are you serious? if i said it *wasn't* a self promo, would you still flag it? it's perfectly related to your post, and it's not like i'm making money off the damn thing. i'm sharing it out of the goodness of my heart for pete's sake! – mpen Sep 29 '10 at 00:44
  • @Brandon: I think you're right actually. What you would do is use HTML agility pack to find the body tag, remove all the attributes, then re-render the HTML... which I'm not even sure is possible with htmlagilitypack.. never used it for generating html. – mpen Sep 29 '10 at 00:49
  • 1
    tommieb75... i think you are the worse SO user I've met so far. congrats. – bcm Sep 29 '10 at 00:49
0

Here's how you'd do it in SharpQuery

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
    a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());

Which depends on HtmlAgilityPack and is a beta product... but I wanted to prove that you could do it this way.

mpen
  • 237,624
  • 230
  • 766
  • 1,119
0
string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);

This is just in case your project requirements limit your third party options (and doesn't give you the time to reinvent a parser).

Les
  • 9,279
  • 4
  • 33
  • 57
0

Chunky code I've got working at the moment, will be looking at reducing this:

private static string SimpleHtmlCleanup(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            //foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))

            var bodyNodes = doc.DocumentNode.SelectNodes("//body");
            if (bodyNodes != null)
            {
                foreach (HtmlNode nodeBody in bodyNodes)
                {
                    nodeBody.Attributes.Remove("style"); 
                }
            }

            var scriptNodes = doc.DocumentNode.SelectNodes("//script");
            if (scriptNodes != null)
            {
                foreach (HtmlNode nodeScript in scriptNodes)
                {
                    nodeScript.Remove();
                }
            }

            var linkNodes = doc.DocumentNode.SelectNodes("//link");
            if (linkNodes != null)
            {
                foreach (HtmlNode nodeLink in linkNodes)
                {
                    nodeLink.Remove();
                }
            }

            var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
            if (xmlNodes != null)
            {
                foreach (HtmlNode nodeXml in xmlNodes)
                {
                    nodeXml.Remove();
                }
            }

            var styleNodes = doc.DocumentNode.SelectNodes("//style");
            if (styleNodes != null)
            {
                foreach (HtmlNode nodeStyle in styleNodes)
                {
                    nodeStyle.Remove();
                }
            }

            var metaNodes = doc.DocumentNode.SelectNodes("//meta");
            if (metaNodes != null)
            {
                foreach (HtmlNode nodeMeta in metaNodes)
                {
                    nodeMeta.Remove();
                }
            }

            var result = doc.DocumentNode.OuterHtml;

            return result;
        }
bcm
  • 5,320
  • 9
  • 55
  • 90
  • code very much improved/reduced, reference here: http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack/3828478#3828478 – bcm Sep 30 '10 at 07:03