I have a string (not raw) in python similar to the following:
Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index
Harrer
Plenary Papers
One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index
Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers
Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
..
I want to scrape the metadata of those three papers, i.e. those few lines after each title (e.g. "One Some title" "John W. Doe", and 2018 Physics Something Proceedings").
I though of using two patterns for the beginning and end of the selection:
'r"\n\n"' and 'r"Show Abstract - Show Citation"'.
This (almost) works on https://regex101.com/using this regular expression:
\n\n(.*?)Show Abstract - Show Citation
A minor issue is that it is greedy on the first two papers.
but not in python:
pattern=r"\n\n(.*?)Show Abstract - Show Citation"
re.findall(pattern, titles) #titles is the text above
#output is []
pattern_only_one_line=r"\nShow Abstract - Show Citation"
re.findall(pattern_only_one_line, titles)
#output shows three lines
Could this be another problem with raw strings?