I am trying to analyze some documents by a grammar generated via Grako that should parse simple sentences for further analysis but face some difficulties with some special tokens.
The (Grako-style) EBNF looks like:
abbr::str = "etc." | "feat.";
word::str = /[^.]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
I used the upper grammar on following content:
This is a sentence. This is a sentence feat. an abbrevation. I don't now feat. etc. feat. know English.
The result using a simple NodeWalker:
[
'This is a sentence.',
'This is a sentence feat.',
'an abbrevation.',
"I don't know feat.",
'etc. feat. know English.'
]
My expectation:
[
'This is a sentence.',
'This is a sentence feat. an abbrevation.',
"I don't know feat. etc. feat. know English."
]
I have no clue why this happens, especially in the last sentence where the abbreviations are part of the sentence while they are not in the prior sentences. To be clear, I want the abbr rule in the sentence definition to have a higher priority than the word rule, but I don't know how to achieve this. I played around with the negative and positive lookahead without success. I know how to achieve my expected results with regular expressions, but a context-free grammar is required for the further analysis, so I want to put everything in one grammar for the sake of readability. It has been a while since I last used grammars this way, but I don't remember running in that kind of problem. I searched a while via Google with no success, so maybe the community might share some insight.
Thanks in advance.
Code I used for testing, if required:
from grako.model import NodeWalker, ModelBuilderSemantics
from parser import MyParser
class MyWalker(NodeWalker):
def walk_Page(self, node):
content = [self.walk(c) for c in node.content]
print(content)
def walk_Sentence(self, node):
return ' '.join(node.content) + "."
def walk_str(self, node):
return node
def main(filename: str):
parser = MyParser(semantics=ModelBuilderSemantics())
with open(filename, 'r', encoding='utf-8') as src:
result = parser.parse(src.read(), 'page')
walker = HRBWalker()
walker.walk(result)
Packages used: Python 3.5.2 Grako 3.16.5