3

Is there a regex pattern to match titles in the following reStructuredText-like text ? The difficulty is that the numbers of equal signs must be equal to the length of the title.

Some basic text.


=========
One Title
=========

For titles the numbers of sign `=` must be equal to the length of the text title. 


============= 
Another title
============= 

And so on...
  • 2
    This can't be done in full regex with python. The only thing you can do is to capture each of the three lines and check the length after. – Casimir et Hippolyte Dec 01 '13 at 13:32
  • 1
    You'll most certainly need a callback. For example, match each title and then check the length. As a start `(?s)(={3,})\r?\n(.*?)\r?\n\1`, [demo](http://regex101.com/r/tP9eK0) -- compare the length of group 1 and group 2. Otherwise you'll end up with mega regexes like [“vertical” regex matching in an ASCII “image”](http://stackoverflow.com/questions/17039670/vertical-regex-matching-in-an-ascii-image) which shouldn't be possible in Python like @CasimiretHippolyte said. – HamZa Dec 01 '13 at 13:36
  • @HamZa Thanks. The link about is very interesting. Thanks twice ! –  Dec 01 '13 at 13:43
  • @CasimiretHippolyte Does the peral regexes can do that ? –  Dec 01 '13 at 13:43
  • 2
    doable in [.NET](http://regexhero.net/tester/?id=a7600042-9c2d-477b-a921-fcb321995cf4) btw – OGHaza Dec 01 '13 at 13:44
  • 1
    @OGHaza explain explain \*excited\* :D – HamZa Dec 01 '13 at 13:52
  • [reStructuredText (a single word btw) titles have more complex syntax](http://docutils.sourceforge.net/docs/user/rst/quickref.html#section-structure). Do you want to support it or just the simple syntax implied by your examples is enough? – jfs Dec 01 '13 at 13:54
  • @J.F.Sebastian You're right. I know that the syntax is more complex but in this forum, I use to give simpler question that the real one I'm facing to. –  Dec 01 '13 at 13:56
  • 1
    @HamZa, [explanation](http://regexhero.net/tester/?id=bf0a75b4-34c8-4756-9d3d-3930bbb98a85) - it's hardly clean ;) – OGHaza Dec 01 '13 at 14:05
  • @projetmbc: is there any reason not to use `docutils` package to extract section titles? – jfs Dec 01 '13 at 14:10
  • @J.F.Sebastian Yes because I'm working for a tool to help me to analyze code. So my question and all your answers help me to see what can be done and what must be done. –  Dec 01 '13 at 14:14
  • (Per J.F.S. suggestion: edited the title, added tag and link to the official site.) – Jongware Dec 01 '13 at 14:50

2 Answers2

2

Search for match(es) of (?:^|\n)(=+)\r?\n(?!=)([^\n\r]+)\r?\n(=+)(?:\r?\n|$). If match found, check if lengths of first, second and third groups are same. If so, title is a content of second group.

Ωmega
  • 37,727
  • 29
  • 115
  • 183
  • Doing a call back is not a painful constraint but I was hoping for something more direct. Thanks ! –  Dec 01 '13 at 13:45
  • I know that analyzing code is not an easy task. Indeed I'm working on a tool that will help to define semantic and syntaxic rules for DSL. I know ant but I don find it very easy to use. I know pygments but it odes not supported context grammar and it is not very pythonic from my point of view. But your solution is simpler enough to be used in my context. –  Dec 01 '13 at 13:54
  • 1
    On checking the lengths: the specs say "*at least* as long as the title text", so you may test if G2.length <= G1.length AND G2.length <= G3.length. – Jongware Dec 01 '13 at 14:04
  • 1
    @Jongware - Where do you read that *"at least"* ..? I don't see it. – Ωmega Dec 01 '13 at 14:07
  • 1
    http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#sections: "... that begins in column 1 and forms a line extending at least as far as the right edge of the title text.." Admittedly, it depends on how exact the OP is planning to follow the specs. – Jongware Dec 01 '13 at 14:10
  • 1
    Afterthought: .. Those same specs suggest instead of `(=+)`, one would use `(([!"#$%=])\2*)` (with the full list of allowed characters), but again it depends on the strictness of the OP. – Jongware Dec 01 '13 at 14:27
2

To support full syntax for section titles, you could use docutils package:

#!/usr/bin/env python3
"""
some text

=====
Title
=====
Subtitle
--------

Titles are underlined (or over- and underlined) with a printing
nonalphanumeric 7-bit ASCII character. Recommended choices are "``= -
` : ' " ~ ^ _ * + # < >``".  The underline/overline must be at least
as long as the title text.

A lone top-level (sub)section is lifted up to be the document's (sub)title.
"""
from docutils.core import publish_doctree

def section_title(node):
    """Whether `node` is a section title.

    Note: it DOES NOT include document title!
    """
    try:
        return node.parent.tagname == "section" and node.tagname == "title"
    except AttributeError:
        return None # not a section title

# get document tree
doctree = publish_doctree(__doc__)    
titles = doctree.traverse(condition=section_title)
print("\n".join([t.astext() for t in titles]))

Output:

Title
Subtitle
jfs
  • 346,887
  • 152
  • 868
  • 1,518
  • Intersting. Thanks for this tip. –  Dec 01 '13 at 14:58
  • @projetmbc: note: the Node interface is more complicated than it could be for many tasks. If you want to work with the doc tree more closely, I'd convert it to `xml.etree.ElementTree` first (serialize to xml/parse it as etree). – jfs Dec 01 '13 at 15:02