14

Lets say I have the following text and I want to extract the text between "Start of numbers" and "End of numbers" there are dynamic amount of lines and the only thing which changes in the numbers in them eg: first, second, etc. Each file I'll be extracting data from has different amount of lines between between "Start of numbers" and "End of numbers". How can I write a regex to match the content between "Start of numbers" and "End of numbers" without knowing how many lines will be in the file between Start of numbers" and "End of numbers"?

Regards!

This is the first line This is the second line

Start of numbers

This is the first line
This is the second line
This is the third line
This is the ...... line
This is the ninth line

End of numbers
Arya
  • 6,981
  • 15
  • 76
  • 146

4 Answers4

31

You should use the SingleLine mode which tells your C# regular expression that . matches any character (not any character except \n).

var regex = new Regex("Start of numbers(.*)End of numbers",
                  RegexOptions.IgnoreCase | RegexOptions.Singleline);
Paul Oliver
  • 7,009
  • 5
  • 28
  • 33
  • I've never heard that. I'm not saying that you're wrong but the documentation [link](http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx) doesn't seem to say that nor does this regex [link](http://regexr.com?30oag): – Paul Oliver Apr 24 '12 at 05:46
  • 2
    @DavidZ.: Nope. SingleLine affects `.`, MultiLine affects `^` and `$`. Yes, there can be situations where it makes sense to specify *both* SingleLine and MultiLine. :-) – Heinzi Apr 24 '12 at 05:51
  • Yep, you're right. MultiLine affects ^ and $, I was under the impression that SingleLine does too but looking at the docs that is not the case. – David Z. Apr 24 '12 at 06:11
3

You should be able to match multi-line strings without issue. Just remember to add the right characters in (\n for new lines).

string pattern = "Start of numbers(.|\n)*End of numbers";
Match m = Regex.Matches(input, pattern);

This is easier if you can think of your string with the hidden characters.

Start of numbers\n\nThis is the first line\nThis is the second line\n ...
David Z.
  • 5,363
  • 2
  • 18
  • 13
0

Something like this:

^(start)([\s\n\d\w]*)(end)$

Where your get the second group. You can even name the group if you like. So the point is that you read the whole thing in one string and then get the regexp result from it.

Edit:

Have to edit a bit. If your match can be in middle somewhere then drop the start (^) and end ($) characters. (start)([\s\n\d\w]*)(end)

And a note that this will leave you only the lines you want to get. Then handle these lines.

Lee Taylor
  • 6,091
  • 14
  • 26
  • 43
japesu
  • 134
  • 1
  • 10
0
/(?<=Start of numbers).*(?=End of numbers)/s

You need to enable the dotall flag.

http://regexr.com?30oaj

Jack
  • 5,322
  • 8
  • 43
  • 70