0

My regex experience is limited and I've been tinkering with a problem that I've not had yet managed to solve. I suspect it'll be relatively easy for someone else with more regex experience and so any pointers would be appreciated.

Context. I need to be able to validate a sentence, which can consist of a-z (both cases), 0-9, spaces, standard punctuation and <br /> and <p></p>.

I wrote some tests in C# as follows.

[TestCase("123345acbcbbc ab")]
[TestCase("123 abc")]
[TestCase("aBcC 123 123! abc; 'k21HdD_-{};:")]
[TestCase("123!")]
[TestCase("aBcC<br />123 123!<br />abc; 'k21HdD_-{};:")]
public void WhenValidatingASentence_ThenStandardPunctuation_IsSupported(string sut)
{
    Assert.That(Regex.IsMatch(sut, @"^[a-zA-Z0-9]+[\sa-zA-Z0-9\p{P}]+?(<br\s/>)+?$"), Is.True);
}

The first four test cases work fine but the introduction of the break in to the pattern and input is causing the fifth case to fail.

Clearly I've misunderstood the use of a capture group or have spec'd it badly. Any guidance would be appreciated.

Needless to say all parts of the string can repeat, so paragraphs and breaks, plus characters, numbers and punctuation can be used many times throughout the sentence, although I expect the start has to be a-z or numerical.

Thanks Butters

Tim Butterfield
  • 565
  • 5
  • 21
  • Side note: please *don't* show that you want to parse HTML with regular expressions. Such practice generally frowned upon and likely give you downvotes. Please enjoy pain of getting regular expression to reasonably parse HTML to yourself. – Alexei Levenkov Sep 22 '14 at 01:40
  • Thanks Alexei, I'm aware that it's a bad idea generally. I'm hoping to change my clients mind about what they want to do, and use standard line endings rather than html breaks, but the verdict is still out. Thanks for the concern though. :) – Tim Butterfield Sep 23 '14 at 10:31

1 Answers1

1

Here's a simple solution:

^(?:[0-9a-zA-Z \p{P}]+|<(?:br|/?p)[^>]*>)+$

This will not ensure the <p> tags are properly nested though, and it will allow attributes on the tags.

If you want to make sure the <p> tags are balanced, the regex gets more complicated:

^(?:
(?>[0-9a-zA-Z \p{P}]+)
|<br\s*>
|(?<para>)<p[^>]*>
|(?<-para>)</p\s*>
)+(?(para)(?!))$

This uses balancing groups (I'd prefer .NET regexes to support recursion but that's a different topic). It will still allow attributes on the opening <p> tag.

RegexHero demo

EDIT: I just noticed you want the start to be alphanumerical. If you want to enforce this, simply add [a-zA-Z0-9] just after the ^ anchor.

Community
  • 1
  • 1
Lucas Trzesniewski
  • 47,154
  • 9
  • 90
  • 138
  • Thanks Lucas. I havent' had the time to indulge the answer fully but will give it a try later mark the answer appropriately. The balanced groups is useful as I knew it ought to be done, but didn't know where to start. – Tim Butterfield Sep 22 '14 at 08:29