0

I have HTML source page as text file.

I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits

Eg

  1. 209 016 - should be come up in search result and as 400013(space removed)

  2. 209016 - should also come up in search and unaltered as 209016

  3. any numeric string more then 6 digits long should not come up in search eg 20901677,209016@223, 29016,

I think this can be achieved by regex but I was not able to

A soln in regex is more desirable but anything else is also welcome

Allan
  • 11,170
  • 3
  • 22
  • 43
  • 2
    Why shouldn't `209016@223` result in the match `209016`? What are the relevant boundaries that separate numbers (since spaces don't seem to apply)? Perhaps any non-digit non-space character? Another question is, should `123 456 789` be matched and if so, what should the result be? `123456` or `456789` or none of them or both (actually then also `234567`, `345678` should match)? – a_guest Jul 24 '19 at 07:54

2 Answers2

2

To match 6 digits with any number of spaces in between, you may use the following pattern:

\b(?:\d[ ]*?){6}\b

Or if you want to reject it when it's followed by an @, you may use:

\b(?:\d[ ]*?){6}\b(?!@)

Regex demo.

Then, you can use the replace method to remove the space characters.

Python example:

import re

regex = r"\b(?:\d[ ]*?){6}\b(?!@)"

test_str = ("209016 \n"
    "209 016\n"
    "20901677','209016@223', '29016")

matches = re.finditer(regex, test_str, re.MULTILINE)

for match in matches:
    print (match.group().replace(" ", ""))

Output:

209016
209016

Try it online.

41686d6564
  • 15,043
  • 11
  • 32
  • 63
  • You regex is really good I would recommend to change it into: `\b(?:\d\s*?){6}\b(?!@)` otherwise you might end up having many spaces taken into the result string at the end for example for : `123456 ` – Allan Jul 24 '19 at 08:04
  • @Allan That's a good idea. Although it wouldn't affect the final output since all space characters will be removed anyway. – 41686d6564 Jul 24 '19 at 08:08
  • Also you need to put the same logic with the `@` at the beginning I guess `\b(? – Allan Jul 24 '19 at 08:09
  • This matches `209016!223` for example and I hardly think this is what OP wants. Also why not just do `test_str.replace(' ', ''`) and then use a regex without `\s`? – a_guest Jul 24 '19 at 08:11
  • Yeah I completely agree the question is not clear... – Allan Jul 24 '19 at 08:14
  • Thanks a lot , the suggestions seems to work for me :) – Rahul Singh Jul 24 '19 at 16:00
  • @RahulSingh: you are welcome! Could you please accept either Ahmed or my answer? Thank you – Allan Jul 26 '19 at 06:46
2

You can try the following regex:

\b(?<!@)\d(?:\s*\d){5}\b(?!@)

demo: https://regex101.com/r/ZCcDmF/2/

But note that you might have to modify your boundaries if you need to exclude more than the @. it will become something like:

\b(?<!@|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!@|other char I need to exclude|another one|...)

where you have to replace other char I need to exclude, another one,... by the characters.

Allan
  • 11,170
  • 3
  • 22
  • 43