1

I'm trying to extract the company names from press releases. As an example, below there is a snippet (in French) of a press release containing a list of seven companies ending in .inc.

En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.

I'm trying to extract all the names using the following code:

aa = re.findall('inc\.,? (.*?inc\.)', text)

I do manage to capture quite a few, but for some reason I can't figure, I can't extract them all. It seems trivial but it has stomped me for a few hours....

Any help is appreciated !

ctwheels
  • 19,377
  • 6
  • 29
  • 60
TheTurp
  • 23
  • 4
  • 1
    Uh, do you mean ending in `'inc.'`? – juanpa.arrivillaga Nov 15 '17 at 19:57
  • Why are you trying to match something containing "inc." *twice*? – hobbs Nov 15 '17 at 19:57
  • 3
    Also, this doesn't look trivial to me. How do we match `"Asphalte Vrac Transport inc.,"` instead of `"Transport inc."` or even `"dont Asphalte Vrac Transport inc."`...? – juanpa.arrivillaga Nov 15 '17 at 19:58
  • Are all press releases of this general format a we could use `str.split(',')` – Joe Iddon Nov 15 '17 at 20:00
  • Possibly a more suitable task for named entity recognition? https://en.wikipedia.org/wiki/Named-entity_recognition – omijn Nov 15 '17 at 20:03
  • There's also edge-cases like: `et Transport Vrac Globe International inc.`. You might think checking for title-case would help, but then there's also `Les entreprises Luc Clément inc.` (unless that's just a typo). – ekhumoro Nov 15 '17 at 20:05
  • Yes indeed, it is complicated, I was basically trying to just extract the text between the different "inc." blocks as a first step, I could clean later. But even this "simple" task doesn't work. – TheTurp Nov 15 '17 at 20:11
  • @TheTurp. What is the format of the press releases? Do they include any markup, or are they just plain text? – ekhumoro Nov 15 '17 at 20:19
  • @ekhumoro Here's the press release [link](http://www.revenuquebec.ca/fr/salle-de-presse/communiques/ev-fisc/2013/4nov.aspx) I extract them using beautiful soup, then extract the text, and try to find all the companies who have been convicted. – TheTurp Nov 15 '17 at 20:26

5 Answers5

6

Brief

Using the regex module (instead of re) you can use this solution.


Code

Option 1

This is the original regex and only matches inc.. This also doesn't allow company names that contain et. See Option 2 for a more comprehensive regular expression.

See regex in use here

[\p{Lu}\p{N}](?:(?!et)[^,])*inc\.

Option 2

For a more comprehensive regular expression that also checks for other company entities such as ltd. or sons, you can use the following regex.

See regex in use here

(?:et|,)[^,]*?([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))

Note: In some flavours of regex you can use the \K token. This token resets the starting point of the reported match (any previously consumed characters are no longer included in the final match). If your regex engine supports the \K token (and doesn't convert it to a literal K), you can use the following (effectively eliminating the need for capture groups).

See regex in use here

(?:et|,)[^,]*?\K[\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)
              ^^

Results

Input

En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc.

Output

Asphalte Vrac Transport inc.
9163-6704 Québec inc.
Entreprise Denis Dupré inc.
Gestion Jean M. Machado inc.
Impact Technologie Environnementale inc.
Les entreprises Luc Clément inc.
Transport Vrac Globe International inc.

Explanation

Option 1

  • [\p{Lu}\p{N}] Match anything in the set (in this case \p{Lu} - any uppercase character in any language (includes Unicode for uppercase French characters and numbers for number companies)
  • (?:(?!et)[^,])* Match the following any number of times (tempered greedy token)
    • (?!et) Negative lookahead ensuring what follows does not match et literally
    • [^,] Match any character except comma , literally
  • inc\. Match inc. literally

Option 2

  • (?:et|,) Match either et or comma , literally
  • [^,]*? Match any character not present in the set (any character except comma , any number of times, but as few as possible
  • ([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)) Capture the following into capture group 1
    • [\p{Lu}\p{N}] Match any Unicode uppercase character or Unicode number (for number companies)
    • [^,]*?Match any character not present in the set (any character except comma , any number of times, but as few as possible
    • \s Match a whitespace character
    • (?:inc\.|sons|ltd\.) Match either of the following
      • inc\. Match inc. literally
      • sons Match sons literally
      • ltd\. Match ltd. literally

Notes

Regex module vs re

Using regex module allows us to use Unicode character classes such as \p{Lu} to ensure we also catch the possibility of company names beginning with uppercase Unicode characters such as É.

Catching Special Cases

The regular expression links (under Code) include an additional string to test against:

, Étoile Simpsons et sons, Étoile Simpsons inc., Étoile et Simpsons inc.

With this additional line added only the following strings should be caught (valid company name according to the OP's specifications):

  • Étoile Simpsons et sons
  • Étoile Simpsons inc.
  • Étoile et Simpsons ltd.

This presents a few challenges including:

  • Company name begins with uppercase Unicode character É.
    • This means we must ensure Unicode uppercase letter compatibility, thus using something like [A-Z] is not possible for ensuring a name begins uppercase characters.
  • Company ends with sons, but also includes sons (cannot stop at first match for sons).
    • Take the case of Étoile Simpsons et sons for example.
      • This should not end at sons in Simpsons. A natural instinct (at least in regex) might be to use \b to assert a word boundary. As much as this might be the preferred method, it doesn't work in this case. Take the French word blésons as an example. Using \b will actually match in blésons since regex engines very seldom match \b correctly with Unicode characters even with u flag enabled (this is why I use \s instead).
  • The word sons appears after the company name ends (in the sentence Their sons et sons, les sons.). It must not extend past the company name's ending.
    • This is a great case for using lazy quantifiers i.e. .*?. Making it lazy will allow it to stop at the first match instead of matching the whole sentence incorrectly.
  • The string Their sons et sons, les sons. contains all the parts of a valid company name (a word starting with an uppercase character, followed by the word sons), but this should not match as it's not a company name.
    • Since the OP specified a , before each company name, I use this to determine what is and is not a company name.
ctwheels
  • 19,377
  • 6
  • 29
  • 60
  • 2
    First of all thank you. Then, wow, you are like a regex jedi master, this is amazing, can't believe how fast you have put this up too. It will take me some time to even wrap my head around your "tempered greedy token", but this looks so promising for all the cases with companies ending in "ltd" or sometimes "sons" ! – TheTurp Nov 15 '17 at 20:23
  • @TheTurp I edited my answer to include more details and *slightly* changed the regex. – ctwheels Nov 15 '17 at 21:55
1

This pattern appears to do the trick:

   >>> string = """En effet, Revenu Québec avait des motifs raisonnables de croire que ces entreprises avaient utilisé de fausses factures provenant de plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704 Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado inc., Impact Technologie Environnementale inc., Les entreprises Luc Clément inc. et Transport Vrac Globe International inc."""
   >>> pattern = r'((?:[A-Z0-9\-]\.?\w*\s?(?:[a-z0-9\-]\w*\s?)?)+ inc\.)'
   >>> m = re.findall(pattern, string)
   >>> print('\n'.join(m))

   Asphalte Vrac Transport inc.
   9163-6704 Québec inc.
   Entreprise Denis Dupré inc.
   Gestion Jean M. Machado inc.
   Impact Technologie Environnementale inc.
   Les entreprises Luc Clément inc.
   Transport Vrac Globe International inc.

Explanation:

   [A-Z0-9\-] # match an uppercase letter or number or dash
   \.?        # match optional dot
   \w*        # match alpha-numeric chars 0 or more times
   \s?        # match optional white-space

   (?:[a-z0-9\-]\w*\s?)? # same again except with lowercase letters
                         # the ? means 0 or 1 times

    inc\.     # match ' inc.'
   (?: ... )  # non-capturing group
   ( ... )    # capturing group (whole thing)
   x?          # match x optional
   x*          # in this case match x 0 or more times
   x+          # match x 1 or more times
Totem
  • 6,563
  • 4
  • 32
  • 60
0

In this case, you can avoid using a regex, instead try:

text.split(“,”)

and then iterate through the list created and look for ".inc".

Joe Iddon
  • 18,600
  • 5
  • 29
  • 49
Usernamenotfound
  • 1,351
  • 2
  • 8
  • 16
  • 1
    Unless the text is different than the example and the commas don't correlate well with company names. – wwii Nov 15 '17 at 20:01
  • Yes indeed, sometimes the companies end in "Sons" or "Ltd". And sometimes it's only one company that is mentioned, for that case, my regex works for now though. – TheTurp Nov 15 '17 at 20:10
0
aa = [s.strip() for s in text.split(',') if s.lower().endswith(' inc.')]
Gamaliel
  • 121
  • 4
  • 8
0

Bit late to the party since an answer has already been accepted, but anyway, here's a solution that uses Python's built-in re module rather than the third-party regex module.

Your attempt correctly anchors the end of the company name on inc. but you need some way to capture the start of the name. Let's define a company name as:

  1. A word starting with a capital letter or a number, followed by,
  2. Optionally one or more additional words, since a firm may have a one-word name. These need not start with an uppercase letter. Then, finally,
  3. inc.

Further, we'll define a word as a string of letters and/or numbers possibly containing one or more hyphens. Normally we would use \w to represent a word character, but that doesn't include hyphens, so we'll need to match that separately.

So:

  1. A word starting with a capital letter or a number: [A-Z0-9](?:\w|-)*
  2. Zero or more additional words, each denoted as: (?:\w|-)+
  3. inc\.

Words are separated by white space, which we will denote as \s+. So for #2's "optional one or more words" we must create a group that includes one or more word characters (including hyphen) followed by one or more space characters, and repeat that group zero or more times: (?:(?:\w|-)+\s+)*

So, putting it all together and adding \b at the start make sure it starts with a whole word:

re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*inc\.", text)

To extend this so you can also catch names ending with Ltd. or Sons and to also catch capitalized Inc. and make the period optional:

re.findall(r"\b[A-Z0-9](?:\w|-)*\s+(?:(?:\w|-)+\s+)*(?:[Ii]nc?|[Ll]td|[Ss]ons)(?:\.|\b)?", text)
kindall
  • 158,047
  • 31
  • 244
  • 289