0

I was using python 3.7 and the re library to find values in a csv file like

23,"1,309",0
23,"134,799",2
,"1,549,089",0
8908.89,"27,989",3

The values I wanted to extract are the ones surrounded by double quotes and with commas as thousand separators. Before doing it, I tested the following regex in VS Code search:

"(\d+,)?\d+,\d+"

Which highlighted the right matches. However, when I used the regex in python:

regex = r'"(\d+,)?\d+,\d+"'
re.findall(regex, text)

I got:

['', '', '1,', '']

At length, I was able to get the right matches by using this expression instead:

regex = r'"\d+,\d+,\d+|\d+,\d+"'

But I am curious to know why the first expression worked in VSCode but not in Python. Why would that be?

dduque
  • 143
  • 1
  • 11

1 Answers1

0

Your regex looks OK to me. More importantly, it works for me, with some limitations. Example:

>>> import re
>>> r = re.compile(r'"(\d+,)?\d+,\d+"')
>>> r.search(' blah "1,543" blah')
<re.Match object; span=(6, 13), match='"1,543"'>
>>> r.search('"31,999" blah')
<re.Match object; span=(0, 8), match='"31,999"'>

You should probably show your code, explaining how the actual output is different from your expected output.

As I said above, there are some limitations with your regex: it will not match numbers below 1,000 (no comma) or above 999,999,999 (too many commas). And it will accept invalid numbers like "1,00" But the version you used with Visual Studio Code has the exact same limitations, so I assume that is deliberate.

To fix those bugs, I think you need one sub-pattern to match numbers greater than 999, and another sub-pattern for 1..999:

>>> r = re.compile(r'"(\d{1,3}(,\d{3})+|\d{1,3})"')
Greg Ward
  • 1,394
  • 10
  • 12
  • I was able to replicate the error, please check the question again. About the flaws in my regex, I didn't have to worry for the cases you mentioned, they didn't appear in the text, but thanks. – dduque Apr 23 '20 at 02:46