Trying to ascertain the Regex to remove newline from start and end of content in
tag, using Regex.Replace from .NET

Question

I have a "pre" which is getting newlines added before the content and after the content ie:

<pre>

My Content
</pre>

The above seems to be equivalent to 2 newlines before and 1 after.

I would like to parse my HTML string for all "pre" tags and to remove these before and after newlines.

I would use ASP.NET code to do the replacing:

Regex.replace(myHtmlString,@"Regex Pattern",String.Empty);

The result should be:

<pre>My Content</pre>

So what would the "Regex Pattern" look like please?

Thanks in advance.

EDIT

Answer so far:

strCleanXhtmlDoc = Regex.Replace(strCleanXhtmlDoc,@"<pre>[\r\n]*(.*?)[\r\n]*</pre>", "<pre>$1</pre>")

The replace bit is $1.

EDIT:

Strruggling to get the Regex to work with:

<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">

L1

L11

L111
</pre>

Which does need matching, to produce:

<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">L1

L11

L111</pre>

Take a look at the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496) under "Escape Sequences", particularly at [`\n` and `\r`](http://stackoverflow.com/a/3451192), and then give it a try in one of the listed online regex testers (in the bottom section). — aliteralmind, Apr 10 '14 at 01:50
I don't recommend it, but [here's a spoiler](http://regex101.com/r/fX8fE2) to @aliteralmind's post if you can't figure it out. — Sam, Apr 10 '14 at 01:59
Ok, Thanks for this. How would you tweak this to search for
contains /r/n. I have come across :
]*>(.*?)
. Sorry I am very new to regex. — SamJolly, Apr 10 '14 at 02:04
@aliteralmind, thanks your for your comment. I have looked at the FAQ. — SamJolly, Apr 10 '14 at 02:05
@CAustin, Thanks for your comment. Very helpful. Forgive my ignorance, but what does \1 in
\1
mean? I did try this in a tester and got
\1
as the answer and not
My Content — SamJolly, Apr 10 '14 at 02:26
Depending on the type of regex engine your tester is using, you may have to use `$1` instead. Anyway, `\1` (and `\2`, `\3`, etc) are references to capturing groups in the regex pattern. Anything matched within the first set of `()` can later be referenced by `\1`, either in the replacement, or later in the pattern itself. See this for a more complete explanation: http://stackoverflow.com/questions/21880127/have-trouble-understanding-capturing-groups-and-back-references — CAustin, Apr 10 '14 at 02:30
@CAustin, OK thank for this . Learning a bit more. I am not totally sure what my Regest.Replace state might look like. I have added it as an edit. The key question is how to get the \1 bit working, and this might be a .NET question. — SamJolly, Apr 10 '14 at 02:41
Sorry, didn't notice you were using .NET. In that case, yes, you'll need to use `$1` for the reference. — CAustin, Apr 10 '14 at 02:54
You need to use Multi-line regex http://msdn.microsoft.com/en-us/library/yd1hzczs — hazzik, Apr 10 '14 at 02:54
Still having problems with the testing. I have added a tag that does also need to be matched. I could not get it to work in the tester. Possibly need tweaking?? Thanks to all. See 2nd EDIT — SamJolly, Apr 10 '14 at 03:20
This `(
]*>)\s*([^)` with the replacement `$1$2$3` will work as long as you don't have sub tags. — David Ewen, Apr 10 '14 at 05:22
and this works even with sub tags `(
]*>)\s*([\w\W]*?)\s*(
)`. replacement is `$1$2$3` — David Ewen, Apr 10 '14 at 05:28

score 1 · Accepted Answer · answered Apr 10 '14 at 05:34

1

The regex you need is this (<pre[^>]*>)\s*([\w\W]*?)\s*(</pre>)

To break it down

(<pre[^>]*>) matches the start pre tag including any attributes. [^>]* this bit does most of the work and means all chars that aren't >
\s* then we match all the whitespace we can
([\w\W]*?) this grabs the content \w\W means any character and is more inclusive than .. The ? is present so that this doesnt also grab the whitespace that the next bit is meant to grab its a non greedy modifier.
\s* match the whitespace at the end of the content before the end tag
(</pre>) match the end tag nothing special here

The replacement is $1$2$3 to grab the 3 parenthesized sections and put them back together without the whitespace.

Hope that makes some sense and helps you write your next one.

answered Apr 10 '14 at 05:34

David Ewen

3,212
1
16
28

This was a very helpful reply, and wonderful explanation which certainly assists with my journey of learning Regex. So a big thank you. It all works wonderfully. Thank you for enlightening me on this multiple replacement mechanism. Also I feel a number of folks have helped me, so a big thank you. I have tried to so my appreciation by marking up the comments. – SamJolly Apr 10 '14 at 08:37
One last question on this. Some folks have "do not recommend this approach". I am curious to know why. It seems pretty powerful to me. Perhaps its is very CPU intensive and knocks application significantly, not that I have noticed. Some folks have suggested that one should use HtmlAgility Pack as a better solution??? – SamJolly Apr 10 '14 at 08:39
Regex is notoriously bad at parsing things like html anything more complicated than this and it all starts becoming really hard (for instance if you had a pre inside a pre this stops working and the regex becomes way longer). If your intent is to be parsing a lot of html then using something like the HtmlAgility Pack would end up a lot easier. – David Ewen Apr 10 '14 at 08:44

Trying to ascertain the Regex to remove newline from start and end of content in tag, using Regex.Replace from .NET

1 Answers1

Trying to ascertain the Regex to remove newline from start and end of content in
tag, using Regex.Replace from .NET