Python regex for dollar amounts including commas and decimals

Question

I'm new to python and I'm writing a bit of code that needs to take a block of text and remove anything that's not a dollar amount. For example, the number two thousand may be represented as 2000 2000.00 2,000 and 2k. I'm trying to accomplish this with a single regex replacement.

Right now I have:

f=re.sub([0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk],"",f)

While I understand that this is completely incorrect and does not compile, I'm not proficient enough to know what to do about it. Can anyone give me some guidance? Thanks!

Well, you can start by wrapping the regex within single quotes. And add an `r` in front: `f=re.sub(r'[0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk]',"",f)` (I cannot answer why there's the need for the `r` though). — Jerry, Aug 15 '13 at 16:16
Maybe you could describe in more detail what you are trying to do? What sort of text is the input? Can you give us an example input and desired output string? — Brionius, Aug 15 '13 at 16:20
@Jerry the `r` denotes a raw string literal, in which backslashes are not escape characters - it avoids the 'double backslash' phenomenon present in other languages. — roippi, Aug 15 '13 at 16:21

Joseph Dunn · Accepted Answer · 2013-08-15T16:38:22.863

3

Give this a shot:

import re
blockOfText = 'two thousand may be represented as 2000 2000.00 2,000 and 2k'
' '.join([ ''.join(x[0]) for x in re.findall(r'(\$?\d+([,\.]\d+)?k?)', blockOfText) ])

That gets you a new text string that you could assign to blockOfText if you want, effectively removing anything that's not a dollar amount.

edited Aug 15 '13 at 16:38

answered Aug 15 '13 at 16:29

Joseph Dunn

1,180
8
9

score 2 · Answer 2 · answered Aug 15 '13 at 16:21

2

The regular expression needs to be put into a Python string:

f=re.sub(r"[0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk]","",f)

The r prefix on the string makes this a raw string literal. This will cause all of the backslashes in the string to be interpreted literally, which means you do not need to escape the backslash when you want to use it in a regular expression (for example r'\w' to match a word character instead of '\\w').

So now you should at least be able to run this code and test the regular expression, I am not sure if the behavior is exactly what you want.

answered Aug 15 '13 at 16:21

Andrew Clark

180,281
26
249
286

You can really shorten this regex by replacing `[0-9]` with `\d`. – Aug 15 '13 at 16:29
@iCodez If you want to be on the safe side, you would keep it as `[0-9]`. :) – Jerry Aug 15 '13 at 16:34
@Jerry Why would that be safer? – Joseph Dunn Aug 15 '13 at 16:37
@Jerry That's a `c#` question. What does it have to do with Python? (Besides, since when does faster mean safer?) – Joseph Dunn Aug 15 '13 at 16:43
@Jerry That's not even the case in Python. – arshajii Aug 15 '13 at 16:46
1

@JosephDunn The question was asked for C#, but it is valid for _all_ regular expressions, across any language. If the OP has characters such as persian digits, they will match as well. – Jerry Aug 15 '13 at 16:50
@arshajii Are you sure? Then I guess [this question](http://stackoverflow.com/q/6479423/1578604) is more relevant to you. – Jerry Aug 15 '13 at 16:50
On Python 3.x `\d` will match foreign language digits by default. On Python 2.x you would need to be operating on a unicode string with the `re.UNICODE` flag enabled. – Andrew Clark Aug 15 '13 at 16:59
@Jerry Sure, it's the case in Python 3 but not in Python 2. I guess I spoke too soon. – arshajii Aug 15 '13 at 17:00
@arshajii Well, you don't know the Python version the OP is using, so you'd be taking a risk! – Jerry Aug 15 '13 at 17:01

Python regex for dollar amounts including commas and decimals

2 Answers2