1

I have a string (not raw) in python similar to the following:

Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index

Harrer
Plenary Papers

One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index

Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers

Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

..

I want to scrape the metadata of those three papers, i.e. those few lines after each title (e.g. "One Some title" "John W. Doe", and 2018 Physics Something Proceedings").

I though of using two patterns for the beginning and end of the selection:

'r"\n\n"' and 'r"Show Abstract - Show Citation"'.

This (almost) works on https://regex101.com/using this regular expression:

\n\n(.*?)Show Abstract - Show Citation

A minor issue is that it is greedy on the first two papers.

but not in python:

    pattern=r"\n\n(.*?)Show Abstract - Show Citation"

    re.findall(pattern, titles) #titles is the text above

    #output is []
    pattern_only_one_line=r"\nShow Abstract - Show Citation"

    re.findall(pattern_only_one_line, titles)

    #output shows three lines

Could this be another problem with raw strings?

Max Voitko
  • 1,276
  • 1
  • 12
  • 26
aless80
  • 2,198
  • 3
  • 24
  • 46

1 Answers1

1

The re.DOTALL flag is missing. Without it . won't match newlines.

But we can do better (depending on what you need exactly of course): https://regex101.com/r/iN6pX6/199

import re
import pprint

titles = '''
[Omitted for brevity]
..
'''

pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''

# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'

for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
    title = match.group('title')
    subtitle = match.group('subtitle')
    etc = match.group('etc')
    print('## %r' % title)
    print('# %r' % subtitle)
    if etc:
      print(etc)
    print()
    # pprint.pprint(match.groupdict())
Wolph
  • 69,888
  • 9
  • 125
  • 143
  • I am still struggling with finditer. Using the simple pattern I can't get results. With your pattern (I like the idea) I only get the third paper right. – aless80 Nov 08 '19 at 12:44
  • I noticed I still had your pattern in my code block, that's probably what killed it. Try the new version. You can see the results here: https://repl.it/repls/GentleGoldenBlocks – Wolph Nov 08 '19 at 16:04
  • Almost there: notice that the second paper is not right: the title is 'Doe · Doe2 · Doe3 · Jonathan' instead of 'Two some title' because the algorithm is greedy. Any idea how to fix this? I tried poking around with '.*?' but still not working – aless80 Nov 10 '19 at 18:13
  • Sorry for the slow response. The problem is that it always requires the `etc` part. That seems optional so we’ll have to make it optional, take a look at the update :) – Wolph Nov 13 '19 at 12:31
  • Thanks, the updated code is not solving the problem I mentioned above but I accept your answer. I was able to extract what I need using re.finditer(..., re.DOTALL) as previously suggested – aless80 Nov 14 '19 at 10:00
  • @aless80 now I get it. That means the "Show Abstract..." part is optional too. I've made a new version: https://repl.it/repls/ExternalBigheartedHertz – Wolph Nov 14 '19 at 12:21