1

Disclaimer: this question has been redone, so comments and answers may appear unrelated. I apologize, but I did it for the sake of a clearer and better structured question.

Suppose a given string where I want to find two different groups (of names), where one group A satisfies condition 1 and group B satisfies condition 2 but also condition 1.

To put it in an example: say I have a mathematical function-

'[class.parameterA] * numpy.exp( [x]*module.constantA - constant_B/[x] ) + [parameter_B]'

-where I control the values of the parameters but not the ones for the constants. I want to get (by using re.findall()) a group for the constants and a group for the parameters.

>>> group1
['numpy.exp', 'module.constantA', 'constant_B']
>>> group2
['class.parameterA', 'x', 'x', 'parameter_B']

I know that for this specific case I shouldn't match numpy.exp, but for the sake of the question's purpose, I allow it to be a match.

To clarify, this question aims to seek for a representation of "ignore matching {sequence}" in regex and to know if there is the possibility to approach the problem in a "satisfy condition 1 ONLY" rather than "satisfy condition 1 and NOT condition 2" manner, so the solution can be extended to multiple conditions. Please provide a partially abstractive answer (not one that is overly specific to this example).

After a while, of course, I was able to find a partial solution (see bonus) for only one of the groups, but any other clear ones are very welcome:

c1 = r'\w+\.?\w*' # forces alphanumeric variable structure
# c1 = r'[\w\.\(\)]*?' allows more freedom (can introduce function calls)
# at the cost of matching invalid names, like class..parameterA
c2 = r'(?<=\[)', r'(?=\])'

re_group2 = c2[0] + c1 + c2[1]

>>>> re.findall(re_group2, func)
['class.parameterA', 'x', 'x', 'parameter_B']

The apparently intuitive bracket negation does not work for group1, but I may be introducing it incorrectly:

c1 = r'\w+\.?\w*'
nc2 = r'(?<!\[\w)', r'(?!\w\])' # condition 2 negation approach

re_group1 = nc2[0] + c1 + nc2[1]

>>> re.findall(re_group1, func)
['class.parameterA', 'numpy.exp', 'x', 'module.constantA',
'constant_B', 'x', 'parameter_B']

Bonus: if there was, say, module.submodule.constantA (more than 1 dot), how would the regex change? I supposed c1 = r'\w+(\.\w+)*', but it doesn't do what I expected. Edit: I need to use a non-capturing group since I'm using re.findall. So c1 = r'\w+(?:\.\w+)*'.

mariogarcc
  • 320
  • 2
  • 12
  • 2
    Can you add a few more examples? – jrook Oct 22 '18 at 18:10
  • @jrook I have been messing around with the code and found a couple bugs; give me some time to rethink the question so it's worth the time to solve the problem properly. – mariogarcc Oct 22 '18 at 22:12
  • can you try this ? use double findall `[^-+* ]+(?= \*)` , `(?<=\[).*?(?=\])`.One for `g1` another for `g2` – KC. Oct 23 '18 at 06:57
  • @kcorlidy it works, but I think I'm not understanding or I forgot about multiple syntax in the way of `.*?`. Does this mean that it takes 0 or 1 "rigid" strings of any number of characters between brackets? What are the limits of these kinds of combinations? However, if I change `rho_1 * x` into `rho_1 / x`, g1 skips `rho_1` even after adding `/` into the first part (-+* exceptions). g2 seems to work perfectly in various different cases, which is what I was originally looking for. – mariogarcc Oct 23 '18 at 14:51
  • this is different of [.*? and .*](https://stackoverflow.com/questions/3075130/what-is-the-difference-between-and-regular-expressions) . in my word , it mean matching string as less as it can – KC. Oct 23 '18 at 15:07
  • @jrook, question has been redone. Sorry for the delay. – mariogarcc Dec 23 '18 at 12:45

2 Answers2

0

I did two changes: I anchored the search at the start of a word and converted your first assertion to a lookbehind. I tried it in Notepad++ (no Python here) and it worked for the sample

\b(?<!\[)[a-wzA-Z_0-9]+(?!\])

I Hope your formulas have a consistent spacing...

Antoni Gual Via
  • 634
  • 1
  • 6
  • 12
  • My output is `['rho_1', 'R', 'p']`. I think I was trying to get the regex expression for "any alphanumerical (except letters x and y) string, but ignoring everything between square brackets". I will redo the question to try to provide a better insight on the problem. – mariogarcc Oct 22 '18 at 22:23
  • Question has been edited, you may want to give it another try? – mariogarcc Dec 23 '18 at 12:46
0

Use double findall will be great.

import re
a = "rho_1 * x + R * [np.R] + rho_1 / x + R * [np.R]"

print(re.findall(r"\w+(?= \*| \/)",a))
print(re.findall("(?<=\[).*?(?=\])",a))
  1. Group 1
    • \w+ Matches a non-alphanumeric character, excluding "_"
    • (?= *| /) string end with * or /
  2. Group 2
    • (?<=[) begin with [
    • .*? match any character as less as it can
    • (?=]) end with ]
KC.
  • 2,588
  • 2
  • 10
  • 21
  • Edited the question to make it more clear, sorry for the inconvenience. – mariogarcc Dec 23 '18 at 12:46
  • @mariogarcc `r'\w+(\.\w+)*'` will show the part of content that it matched. I means it matched which you want, but this regex will only show the last group which match(priority). You should try `(\w+(\.\w+)*)` to know more, i think it will be better than what i said. – KC. Dec 25 '18 at 01:40
  • @mariogarcc may be you can read https://www.regular-expressions.info/lookaround.html – KC. Dec 25 '18 at 01:57