Brief
Using the regex module (instead of re
) you can use this solution.
Code
Option 1
This is the original regex and only matches inc.
. This also doesn't allow company names that contain et
. See Option 2 for a more comprehensive regular expression.
See regex in use here
[\p{Lu}\p{N}](?:(?!et)[^,])*inc\.
Option 2
For a more comprehensive regular expression that also checks for other company entities such as ltd.
or sons
, you can use the following regex.
See regex in use here
(?:et|,)[^,]*?([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))
Note: In some flavours of regex you can use the \K
token. This token resets the starting point of the reported match (any previously consumed characters are no longer included in the final match). If your regex engine supports the \K
token (and doesn't convert it to a literal K
), you can use the following (effectively eliminating the need for capture groups).
See regex in use here
(?:et|,)[^,]*?\K[\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.)
^^
Results
Input
En effet, Revenu Québec avait des motifs raisonnables de croire que
ces entreprises avaient utilisé de fausses factures provenant de
plusieurs sociétés, dont Asphalte Vrac Transport inc., 9163-6704
Québec inc., Entreprise Denis Dupré inc., Gestion Jean M. Machado
inc., Impact Technologie Environnementale inc., Les entreprises Luc
Clément inc. et Transport Vrac Globe International inc.
Output
Asphalte Vrac Transport inc.
9163-6704 Québec inc.
Entreprise Denis Dupré inc.
Gestion Jean M. Machado inc.
Impact Technologie Environnementale inc.
Les entreprises Luc Clément inc.
Transport Vrac Globe International inc.
Explanation
Option 1
[\p{Lu}\p{N}]
Match anything in the set (in this case \p{Lu}
- any uppercase character in any language (includes Unicode for uppercase French characters and numbers for number companies)
(?:(?!et)[^,])*
Match the following any number of times (tempered greedy token)
(?!et)
Negative lookahead ensuring what follows does not match et
literally
[^,]
Match any character except comma ,
literally
inc\.
Match inc.
literally
Option 2
(?:et|,)
Match either et
or comma ,
literally
[^,]*?
Match any character not present in the set (any character except comma ,
any number of times, but as few as possible
([\p{Lu}\p{N}][^,]*?\s(?:inc\.|sons|ltd\.))
Capture the following into capture group 1
[\p{Lu}\p{N}]
Match any Unicode uppercase character or Unicode number (for number companies)
[^,]*?
Match any character not present in the set (any character except comma ,
any number of times, but as few as possible
\s
Match a whitespace character
(?:inc\.|sons|ltd\.)
Match either of the following
inc\.
Match inc.
literally
sons
Match sons
literally
ltd\.
Match ltd.
literally
Notes
Regex module vs re
Using regex module allows us to use Unicode character classes such as \p{Lu}
to ensure we also catch the possibility of company names beginning with uppercase Unicode characters such as É
.
Catching Special Cases
The regular expression links (under Code) include an additional string to test against:
, Étoile Simpsons et sons, Étoile Simpsons inc., Étoile et Simpsons inc.
With this additional line added only the following strings should be caught (valid company name according to the OP's specifications):
Étoile Simpsons et sons
Étoile Simpsons inc.
Étoile et Simpsons ltd.
This presents a few challenges including:
- Company name begins with uppercase Unicode character
É
.
- This means we must ensure Unicode uppercase letter compatibility, thus using something like
[A-Z]
is not possible for ensuring a name begins uppercase characters.
- Company ends with
sons
, but also includes sons
(cannot stop at first match for sons
).
- Take the case of
Étoile Simpsons et sons
for example.
- This should not end at
sons
in Simpsons
. A natural instinct (at least in regex) might be to use \b
to assert a word boundary. As much as this might be the preferred method, it doesn't work in this case. Take the French word blésons
as an example. Using \b
will actually match in blésons
since regex engines very seldom match \b
correctly with Unicode characters even with u
flag enabled (this is why I use \s
instead).
- The word
sons
appears after the company name ends (in the sentence Their sons et sons, les sons.
). It must not extend past the company name's ending.
- This is a great case for using lazy quantifiers i.e.
.*?
. Making it lazy will allow it to stop at the first match instead of matching the whole sentence incorrectly.
- The string
Their sons et sons, les sons.
contains all the parts of a valid company name (a word starting with an uppercase character, followed by the word sons
), but this should not match as it's not a company name.
- Since the OP specified a
,
before each company name, I use this to determine what is and is not a company name.