1

I am trying to convert written numbers to numeric values.

For example, to extract millions from this string:

text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'

To:

'I need $ 150000000, or 150000000,1000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 5 thousand'

I use this function to remove any separators in the numbers first:

def foldNumbers(text):
    """ to remove "," or "." from numbers """"
    text = re.sub('(?<=[0-9])\,(?=[0-9])', "", text) # remove commas
    text = re.sub('(?<=[0-9])\.(?=[0-9])', "", text) # remove points
return text

And I have written this regex to findall of the possible patterns for common Million notations. This 1) finds digits and does a look ahead for 2) common notation for millions, 3) The "[a-z]?" part is to handle optional "s" on million or millions where I have already removed "'".

re.findall(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)',text)

which correctly matches Million numbers and returns:

['150', '1', '15', '15', '15', '15', '15', '15', '15', '15', '15']

What I need to do now is to write a replacement pattern to insert "000000" after the digits, or to iterate through and multiply the digits by 100000. I have tried this so far:

re.sub(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)', "000000 ", text)

which returns:

'I need $ 150,000,000, or 000000  million,000000  millions, 000000  Million, 000000 million, 000000 Million, 000000  m, 000000  M, 000000 m, 000000 M, 000000  MM, 000000 MM, 5 thousand'

I think I need to do a look behind (?<=), however I haven't worked with this before and after several attempts I cant seem to work it through.

FYI: My plan is to tackle "Millions" first and then to replicate the solution for Thousands (K), Billions (B), Trillions (T) and possibly for other units such as distances, currencies etc. I have searched SO and google for any solutions in NLP, text cleaning and mining articles but did not find anything.

BenP
  • 705
  • 1
  • 6
  • 25
  • Having done a bit of text parsing, I'd be tempted to use regex to simply tokenize the input string and then work through the individual tokens. That might be easier than lookbehind regexes – user783836 Dec 08 '18 at 09:46

1 Answers1

1

You can accomplish this with a relatively simple re.sub: match

(?i)\b(\d+) ?m(?:m|illions?)?\b

capturing the initial digits in a group, and replace with that group concatenated with 6 zeros:

r'\g<1>000000'

https://regex101.com/r/IedRP4/1

Code:

text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'
output = re.sub(r'(?i)\b(\d+) ?m(?:m|illions?)?\b', r'\g<1>000000', text)

(because the group in the replacement is followed by digits, make sure to use \g<#> syntax rather than \# syntax)

CertainPerformance
  • 260,466
  • 31
  • 181
  • 209