0

I'm new to python and I'm writing a bit of code that needs to take a block of text and remove anything that's not a dollar amount. For example, the number two thousand may be represented as 2000 2000.00 2,000 and 2k. I'm trying to accomplish this with a single regex replacement.

Right now I have:

f=re.sub([0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk],"",f)

While I understand that this is completely incorrect and does not compile, I'm not proficient enough to know what to do about it. Can anyone give me some guidance? Thanks!

macklin
  • 375
  • 2
  • 5
  • 17
  • 1
    Well, you can start by wrapping the regex within single quotes. And add an `r` in front: `f=re.sub(r'[0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk]',"",f)` (I cannot answer why there's the need for the `r` though). – Jerry Aug 15 '13 at 16:16
  • Maybe you could describe in more detail what you are trying to do? What sort of text is the input? Can you give us an example input and desired output string? – Brionius Aug 15 '13 at 16:20
  • 1
    @Jerry the `r` denotes a raw string literal, in which backslashes are not escape characters - it avoids the 'double backslash' phenomenon present in other languages. – roippi Aug 15 '13 at 16:21
  • @roippi Okay! Thank you very much :) – Jerry Aug 15 '13 at 16:25

2 Answers2

3

Give this a shot:

import re
blockOfText = 'two thousand may be represented as 2000 2000.00 2,000 and 2k'
' '.join([ ''.join(x[0]) for x in re.findall(r'(\$?\d+([,\.]\d+)?k?)', blockOfText) ])

That gets you a new text string that you could assign to blockOfText if you want, effectively removing anything that's not a dollar amount.

Joseph Dunn
  • 1,180
  • 8
  • 9
2

The regular expression needs to be put into a Python string:

f=re.sub(r"[0-9]+?(,[0-9])*?[0-9]+?(.[0-9])*?[TtBbMmKk]","",f)

The r prefix on the string makes this a raw string literal. This will cause all of the backslashes in the string to be interpreted literally, which means you do not need to escape the backslash when you want to use it in a regular expression (for example r'\w' to match a word character instead of '\\w').

So now you should at least be able to run this code and test the regular expression, I am not sure if the behavior is exactly what you want.

Andrew Clark
  • 180,281
  • 26
  • 249
  • 286
  • You can really shorten this regex by replacing `[0-9]` with `\d`. –  Aug 15 '13 at 16:29
  • @iCodez If you want to be on the safe side, you would keep it as `[0-9]`. :) – Jerry Aug 15 '13 at 16:34
  • @Jerry Why would that be safer? – Joseph Dunn Aug 15 '13 at 16:37
  • @Jerry That's a `c#` question. What does it have to do with Python? (Besides, since when does faster mean safer?) – Joseph Dunn Aug 15 '13 at 16:43
  • @Jerry That's not even the case in Python. – arshajii Aug 15 '13 at 16:46
  • 1
    @JosephDunn The question was asked for C#, but it is valid for _all_ regular expressions, across any language. If the OP has characters such as persian digits, they will match as well. – Jerry Aug 15 '13 at 16:50
  • @arshajii Are you sure? Then I guess [this question](http://stackoverflow.com/q/6479423/1578604) is more relevant to you. – Jerry Aug 15 '13 at 16:50
  • On Python 3.x `\d` will match foreign language digits by default. On Python 2.x you would need to be operating on a unicode string with the `re.UNICODE` flag enabled. – Andrew Clark Aug 15 '13 at 16:59
  • @Jerry Sure, it's the case in Python 3 but not in Python 2. I guess I spoke too soon. – arshajii Aug 15 '13 at 17:00
  • @arshajii Well, you don't know the Python version the OP is using, so you'd be taking a risk! – Jerry Aug 15 '13 at 17:01