Python and Regex to convert wrtitten numbers to numeric

Question

I am trying to convert written numbers to numeric values.

For example, to extract millions from this string:

text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'

To:

'I need $ 150000000, or 150000000,1000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 5 thousand'

I use this function to remove any separators in the numbers first:

def foldNumbers(text):
    """ to remove "," or "." from numbers """"
    text = re.sub('(?<=[0-9])\,(?=[0-9])', "", text) # remove commas
    text = re.sub('(?<=[0-9])\.(?=[0-9])', "", text) # remove points
return text

And I have written this regex to findall of the possible patterns for common Million notations. This 1) finds digits and does a look ahead for 2) common notation for millions, 3) The "[a-z]?" part is to handle optional "s" on million or millions where I have already removed "'".

re.findall(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)',text)

which correctly matches Million numbers and returns:

['150', '1', '15', '15', '15', '15', '15', '15', '15', '15', '15']

What I need to do now is to write a replacement pattern to insert "000000" after the digits, or to iterate through and multiply the digits by 100000. I have tried this so far:

re.sub(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)', "000000 ", text)

which returns:

'I need $ 150,000,000, or 000000  million,000000  millions, 000000  Million, 000000 million, 000000 Million, 000000  m, 000000  M, 000000 m, 000000 M, 000000  MM, 000000 MM, 5 thousand'

I think I need to do a look behind (?<=), however I haven't worked with this before and after several attempts I cant seem to work it through.

FYI: My plan is to tackle "Millions" first and then to replicate the solution for Thousands (K), Billions (B), Trillions (T) and possibly for other units such as distances, currencies etc. I have searched SO and google for any solutions in NLP, text cleaning and mining articles but did not find anything.

Having done a bit of text parsing, I'd be tempted to use regex to simply tokenize the input string and then work through the individual tokens. That might be easier than lookbehind regexes — user783836, Dec 08 '18 at 09:46

CertainPerformance · Accepted Answer · 2018-12-08T09:56:16.153

1

You can accomplish this with a relatively simple re.sub: match

(?i)\b(\d+) ?m(?:m|illions?)?\b

capturing the initial digits in a group, and replace with that group concatenated with 6 zeros:

r'\g<1>000000'

https://regex101.com/r/IedRP4/1

Code:

text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'
output = re.sub(r'(?i)\b(\d+) ?m(?:m|illions?)?\b', r'\g<1>000000', text)

(because the group in the replacement is followed by digits, make sure to use \g<#> syntax rather than \# syntax)

edited Dec 08 '18 at 09:56

answered Dec 08 '18 at 09:50

CertainPerformance

260,466
31
181
209

Thank you. Out of interest, what does the case insensitive (?i) add? – BenP Dec 08 '18 at 09:56
1

It allows the pattern to be much more concise when you don't have to alternate between all possible capitalization options – CertainPerformance Dec 08 '18 at 09:56

Python and Regex to convert wrtitten numbers to numeric

1 Answers1