Python returns no matches on working regex

Question

I have a string (not raw) in python similar to the following:

Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index

Harrer
Plenary Papers

One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index

Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers

Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

..

I want to scrape the metadata of those three papers, i.e. those few lines after each title (e.g. "One Some title" "John W. Doe", and 2018 Physics Something Proceedings").

I though of using two patterns for the beginning and end of the selection:

'r"\n\n"' and 'r"Show Abstract - Show Citation"'.

This (almost) works on https://regex101.com/using this regular expression:

\n\n(.*?)Show Abstract - Show Citation

A minor issue is that it is greedy on the first two papers.

but not in python:

    pattern=r"\n\n(.*?)Show Abstract - Show Citation"

    re.findall(pattern, titles) #titles is the text above

    #output is []
    pattern_only_one_line=r"\nShow Abstract - Show Citation"

    re.findall(pattern_only_one_line, titles)

    #output shows three lines

Could this be another problem with raw strings?

Your regex finds no matches - https://regex101.com/r/X9AUw9/1 — Wiktor Stribiżew, Nov 08 '19 at 11:51
probably a problem with the link. https://regex101.com/r/iN6pX6/193 works? — aless80, Nov 08 '19 at 11:52
Your regex is using the flag single line (dot matches newline) so you will need to do `re.findall(pattern, titles, re.DOTALL)` — Wolph, Nov 08 '19 at 11:54
@Wolph yes it is working! Now I am trying to figure out how to use it in .finditer . you can add an answer if you want — aless80, Nov 08 '19 at 11:57
Duplicate of [How do I match any character across multiple lines in a regular expression?](https://stackoverflow.com/questions/159118) then — Wiktor Stribiżew, Nov 08 '19 at 12:04
Also, [matching any character including newlines in a Python regex subexpression, not globally](https://stackoverflow.com/questions/33312175/) — Wiktor Stribiżew, Nov 08 '19 at 12:05

Wolph · Accepted Answer · 2019-11-14T12:22:12.657

1

The re.DOTALL flag is missing. Without it . won't match newlines.

But we can do better (depending on what you need exactly of course): https://regex101.com/r/iN6pX6/199

import re
import pprint

titles = '''
[Omitted for brevity]
..
'''

pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''

# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'

for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
    title = match.group('title')
    subtitle = match.group('subtitle')
    etc = match.group('etc')
    print('## %r' % title)
    print('# %r' % subtitle)
    if etc:
      print(etc)
    print()
    # pprint.pprint(match.groupdict())

edited Nov 14 '19 at 12:22

answered Nov 08 '19 at 12:03

Wolph

69,888
9
125
143

I am still struggling with finditer. Using the simple pattern I can't get results. With your pattern (I like the idea) I only get the third paper right. – aless80 Nov 08 '19 at 12:44
I noticed I still had your pattern in my code block, that's probably what killed it. Try the new version. You can see the results here: https://repl.it/repls/GentleGoldenBlocks – Wolph Nov 08 '19 at 16:04
Almost there: notice that the second paper is not right: the title is 'Doe · Doe2 · Doe3 · Jonathan' instead of 'Two some title' because the algorithm is greedy. Any idea how to fix this? I tried poking around with '.*?' but still not working – aless80 Nov 10 '19 at 18:13
Sorry for the slow response. The problem is that it always requires the `etc` part. That seems optional so we’ll have to make it optional, take a look at the update :) – Wolph Nov 13 '19 at 12:31
Thanks, the updated code is not solving the problem I mentioned above but I accept your answer. I was able to extract what I need using re.finditer(..., re.DOTALL) as previously suggested – aless80 Nov 14 '19 at 10:00
@aless80 now I get it. That means the "Show Abstract..." part is optional too. I've made a new version: https://repl.it/repls/ExternalBigheartedHertz – Wolph Nov 14 '19 at 12:21

Python returns no matches on working regex

1 Answers1