0

I have a "pre" which is getting newlines added before the content and after the content ie:

<pre>

My Content
</pre>

The above seems to be equivalent to 2 newlines before and 1 after.

I would like to parse my HTML string for all "pre" tags and to remove these before and after newlines.

I would use ASP.NET code to do the replacing:

Regex.replace(myHtmlString,@"Regex Pattern",String.Empty);

The result should be:

<pre>My Content</pre>

So what would the "Regex Pattern" look like please?

Thanks in advance.

EDIT

Answer so far:

strCleanXhtmlDoc = Regex.Replace(strCleanXhtmlDoc,@"<pre>[\r\n]*(.*?)[\r\n]*</pre>", "<pre>$1</pre>")

The replace bit is $1.

EDIT:

Strruggling to get the Regex to work with:

<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">

L1

L11

L111
</pre>

Which does need matching, to produce:

<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">L1

L11

L111</pre>
SamJolly
  • 5,945
  • 10
  • 51
  • 105
  • 2
    Take a look at the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496) under "Escape Sequences", particularly at [`\n` and `\r`](http://stackoverflow.com/a/3451192), and then give it a try in one of the listed online regex testers (in the bottom section). – aliteralmind Apr 10 '14 at 01:50
  • 1
    I don't recommend it, but [here's a spoiler](http://regex101.com/r/fX8fE2) to @aliteralmind's post if you can't figure it out. – Sam Apr 10 '14 at 01:59
  • Ok, Thanks for this. How would you tweak this to search for
     contains /r/n. I have come across : 
    ]*>(.*?)
    . Sorry I am very new to regex.
    – SamJolly Apr 10 '14 at 02:04
  • 1
    @aliteralmind, thanks your for your comment. I have looked at the FAQ. – SamJolly Apr 10 '14 at 02:05
  • 2
    Search for `
    [\r\n]*(.*?)[\r\n]*
    `, replace with `
    \1
    `
    – CAustin Apr 10 '14 at 02:15
  • @CAustin, Thanks for your comment. Very helpful. Forgive my ignorance, but what does \1 in
    \1
    mean? I did try this in a tester and got
    \1
    as the answer and not
    My Content
    – SamJolly Apr 10 '14 at 02:26
  • Depending on the type of regex engine your tester is using, you may have to use `$1` instead. Anyway, `\1` (and `\2`, `\3`, etc) are references to capturing groups in the regex pattern. Anything matched within the first set of `()` can later be referenced by `\1`, either in the replacement, or later in the pattern itself. See this for a more complete explanation: http://stackoverflow.com/questions/21880127/have-trouble-understanding-capturing-groups-and-back-references – CAustin Apr 10 '14 at 02:30
  • @CAustin, OK thank for this . Learning a bit more. I am not totally sure what my Regest.Replace state might look like. I have added it as an edit. The key question is how to get the \1 bit working, and this might be a .NET question. – SamJolly Apr 10 '14 at 02:41
  • Sorry, didn't notice you were using .NET. In that case, yes, you'll need to use `$1` for the reference. – CAustin Apr 10 '14 at 02:54
  • You need to use Multi-line regex http://msdn.microsoft.com/en-us/library/yd1hzczs – hazzik Apr 10 '14 at 02:54
  • Still having problems with the testing. I have added a tag that does also need to be matched. I could not get it to work in the tester. Possibly need tweaking?? Thanks to all. See 2nd EDIT – SamJolly Apr 10 '14 at 03:20
  • This `(
    ]*>)\s*([^)` with the replacement `$1$2$3` will work as long as you don't have sub tags.
    – David Ewen Apr 10 '14 at 05:22
  • and this works even with sub tags `(
    ]*>)\s*([\w\W]*?)\s*(
    )`. replacement is `$1$2$3`
    – David Ewen Apr 10 '14 at 05:28

1 Answers1

1

The regex you need is this (<pre[^>]*>)\s*([\w\W]*?)\s*(</pre>)

To break it down

  • (<pre[^>]*>) matches the start pre tag including any attributes. [^>]* this bit does most of the work and means all chars that aren't >
  • \s* then we match all the whitespace we can
  • ([\w\W]*?) this grabs the content \w\W means any character and is more inclusive than .. The ? is present so that this doesnt also grab the whitespace that the next bit is meant to grab its a non greedy modifier.
  • \s* match the whitespace at the end of the content before the end tag
  • (</pre>) match the end tag nothing special here

The replacement is $1$2$3 to grab the 3 parenthesized sections and put them back together without the whitespace.

Hope that makes some sense and helps you write your next one.

David Ewen
  • 3,212
  • 1
  • 16
  • 28
  • This was a very helpful reply, and wonderful explanation which certainly assists with my journey of learning Regex. So a big thank you. It all works wonderfully. Thank you for enlightening me on this multiple replacement mechanism. Also I feel a number of folks have helped me, so a big thank you. I have tried to so my appreciation by marking up the comments. – SamJolly Apr 10 '14 at 08:37
  • One last question on this. Some folks have "do not recommend this approach". I am curious to know why. It seems pretty powerful to me. Perhaps its is very CPU intensive and knocks application significantly, not that I have noticed. Some folks have suggested that one should use HtmlAgility Pack as a better solution??? – SamJolly Apr 10 '14 at 08:39
  • Regex is notoriously bad at parsing things like html anything more complicated than this and it all starts becoming really hard (for instance if you had a pre inside a pre this stops working and the regex becomes way longer). If your intent is to be parsing a lot of html then using something like the HtmlAgility Pack would end up a lot easier. – David Ewen Apr 10 '14 at 08:44