0

I am very new to python, I am trying to write a regex that will find all instances of a period, space, then capital letter in a corpora. I have this:

print (re.findall(r'(\.|\!|\?) (A-Z\w+\b)',text))

I got it to print when there was only one capital (i.e. I went to the movie.) but not when its a capitalized word.

Thoughts?

anubhava
  • 664,788
  • 59
  • 469
  • 547
kmorgz
  • 9
  • 2
  • 1
    `re.findall(r'[.!?]\s+[A-Z]',text)`? Note `A-Z` matches a `A-Z` string, `[A-Z]` is a character class that matches uppercase ASCII letters. – Wiktor Stribiżew Oct 01 '19 at 19:53
  • Welcome to Stack Overflow! Please [edit] to add some example input and desired output. See [mre] for reference. By the way, please take the [tour]. – wjandrea Oct 01 '19 at 19:53
  • 2
    Read a regexp tutorial, like www.regular-expressions.info. – Barmar Oct 01 '19 at 19:54
  • You can also use the regex module (as explained in this question: https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties) with `\p{Lu}` for Unicode uppercase (multiple scripts & languages) – ctwheels Oct 01 '19 at 20:04

1 Answers1

1

Could use findall using this

(\.|!|\?) ([A-Z]\w+)

The word boundary is not needed here.
The alternations can be substituted for a class [.!?] but not necessary.
The A-Z is a class item but it needs to be enclosed in square brackets [].

Findall will make two elements per match, the punctuation and the alphanum string.

  • 2
    You can simplify to `[.!?]` – ctwheels Oct 01 '19 at 20:31
  • `[.!?]` is slower than `\.|!|\?` –  Oct 01 '19 at 20:34
  • 2
    No it's not. It's much faster. 12 steps for [`\.|!|\?`](https://regex101.com/r/anrAx5/1) vs 7 steps for [`[.!?]`](https://regex101.com/r/anrAx5/2) - nearly half the amount of steps. – ctwheels Oct 01 '19 at 20:38
  • `Regex1: \.|!|\? Elapsed Time: 0.15 s, 145.68 ms, 145681 µs Matches per sec: 1,029,646 Regex2: [.!?] Elapsed Time: 0.15 s, 149.08 ms, 149077 µs Matches per sec: 1,006,191` –  Oct 01 '19 at 20:45
  • How are you coming up with those tests, cause they're wrong. – ctwheels Oct 01 '19 at 20:46
  • https://imgur.com/9BSEPN4 –  Oct 01 '19 at 20:52
  • OK now run it 100 more times, it'll fluctuate. See [here](https://tio.run/##lZBBS8NAEIXP2V8x6WV3ISxKb4WSs6AI2puVsCWbOpjdDZPxUKy/Pa6makuLtO8yMDy@x3vdhl9imA4D@i4SA6N3yAWQE6KHOcjyUOZA@b6kELVrINKdS8xaBT0TGSUGyaXZ5ttlKUXWRAIEDEA2rN3oycjxG6WXMw2G2ratoqLXI67Fno@BTyYvny/FdYSB1WSRKgLbVxdgtYH7B/Df@Bm8f0xMAnrLapzBjEfJ307XV1oWIBuKHqrKWwxVBbvpfkxSa3066/bmcXFG2l7l//L@bF@Jw/AJ). I'm telling you character classes are faster. – ctwheels Oct 01 '19 at 21:09
  • 1
    See [this](https://stackoverflow.com/questions/22132450/why-is-a-character-class-faster-than-alternation) question and series of answers. Character classes don't cause backtracking whereas alternations do. – ctwheels Oct 01 '19 at 21:12
  • https://imgur.com/a/f0IcD1A –  Oct 01 '19 at 21:21
  • 1
    Forget about megabench. Write your own script or check out the one I wrote for you. I can tell you with absolute certainty that character classes are faster since they don't backtrack. If you want you can ask the regex god himself (Wiktor Stribizew - he wrote in the comments); he'll go into more details. – ctwheels Oct 01 '19 at 21:25
  • No overhead, not benchmarking language none involved. Class items are OR 'd just like alternatons, but are slower in the way they are handled. –  Oct 01 '19 at 21:29