14

My objective is to parse like Python does with strings.

Question: How to write a lex to support the following:

  1. "string..."
  2. 'string...'
  3. """multi line string \n \n end"""
  4. '''multi line string \n \n end'''

Some code:

states = (
        ('string', 'exclusive'),
        )

# Strings
def t_begin_string(self, t):
    r'(\'|(\'{3})|\"|(\"{3}))'
    t.lexer.push_state('string')

def t_string_end(self, t):
    r'(\'|(\'{3})|\"|(\"{3}))'
    t.lexer.pop_state()

def t_string_newline(self, t):
    r'\n'
    t.lexer.lineno += 1

def t_string_error(self, t):
    print("Illegal character in string '%s'" % t.value[0])
    t.lexer.skip(1)


My current idea is to create 4 unique states that will match the 4 different string cases, but I'm wondering if there's a better approach.

Thanks for your help!

treddy
  • 2,421
  • 1
  • 15
  • 28
Steve Peak
  • 2,447
  • 1
  • 14
  • 18
  • You have 4 distinct string types so I would expect you would need 4 different states. Presumably ``'string"`` is ill-formed? – nimish Dec 14 '13 at 17:06
  • You could use two unique states, one for single quotes and one for triple quotes, but you would need to store the quote character somewhere. It's debatable which method is better. – Thayne Dec 14 '13 at 17:59
  • I was fearing building 4 states... Can two work through? Because the start/end states are not matching the initial start quote type. Ex `"string..'...string..."` the parser will see `string..` as a string then see `...string..."` as a parse error. – Steve Peak Dec 15 '13 at 01:16
  • If you only used two states you would need to store what quotation mark you started with, and then if you encounter a quotation you check if it is the starting mark, if not, continue in the same state. – Thayne Dec 21 '13 at 15:33

2 Answers2

1

isolate the common string to make a single state and try to build an automaton with lesser states however u can have a look on py lex yacc if u are not worried about using an external library that makes ur job easier

However u need basics of lex yacc ///the sample code is as shown

tokens = (
    'NAME','NUMBER',
    'PLUS','MINUS','TIMES','DIVIDE','EQUALS',
    'LPAREN','RPAREN',
    )
    enter code here

# Tokens

t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_EQUALS  = r'='
t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_NAME    = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_NUMBER(t):
    r'\d+'
    try:
        t.value = int(t.value)
    except ValueError:
        print("Integer value too large %d", t.value)
        t.value = 0
    return t

# Ignored characters
t_ignore = " \t"

def t_newline(t):
    r'\n+'
    t.lexer.lineno += t.value.count("\n")

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Parsing rules

precedence = (
    ('left','PLUS','MINUS'),
    ('left','TIMES','DIVIDE'),
    ('right','UMINUS'),
    )

# dictionary of names
names = { }

def p_statement_assign(t):
    'statement : NAME EQUALS expression'
    names[t[1]] = t[3]

def p_statement_expr(t):
    'statement : expression'
    print(t[1])

def p_expression_binop(t):
    '''expression : expression PLUS expression
                  | expression MINUS expression
                  | expression TIMES expression
                  | expression DIVIDE expression'''
    if t[2] == '+'  : t[0] = t[1] + t[3]
    elif t[2] == '-': t[0] = t[1] - t[3]
    elif t[2] == '*': t[0] = t[1] * t[3]
    elif t[2] == '/': t[0] = t[1] / t[3]

def p_expression_uminus(t):
    'expression : MINUS expression %prec UMINUS'
    t[0] = -t[2]

def p_expression_group(t):
    'expression : LPAREN expression RPAREN'
    t[0] = t[2]

def p_expression_number(t):
    'expression : NUMBER'
    t[0] = t[1]

def p_expression_name(t):
    'expression : NAME'
    try:
        t[0] = names[t[1]]
    except LookupError:
        print("Undefined name '%s'" % t[1])
        t[0] = 0

def p_error(t):
    print("Syntax error at '%s'" % t.value)

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = input('calc > ')   # Use raw_input on Python 2
    except EOFError:
        break
    yacc.parse(s)
IamSeekingAns
  • 73
  • 1
  • 1
  • 6
0

Try using the pyparsing module. With this module you can easily parse strings with good style without using regular expressions.

The following example should help you parsing expressions like "string..." and """string""" as well.

from pyparsing import Word, OneOrMore, alphas

string = """string"""
w = OneOrMore('\"') + Word(alphas + '.') + OneOrMore('\"')
w.parseString(string)
PaulOverflow
  • 871
  • 1
  • 8
  • 11