0

I am comparing two strings but excluding the punctuation marks in both.

Here is my code snippet:

punctuation = r"[.?!,;:-']"
string1 = re.sub(punctuation, r"", string1)
string2 = re.sub(punctuation, r"", string2)

After running this code I get following exception

bad character range :-' at position 6

How to get rid of this exception? What's the meaning of "bad character range"?

Tomerikoo
  • 12,112
  • 9
  • 27
  • 37
jdk
  • 149
  • 1
  • 12

2 Answers2

5

- has special meaning inside [] in regular expression pattern - for example [A-Z] are ASCII uppercase letters (from A to Z), so if you need literal - you need to escape it i.e.

punctuation = r"[.?!,;:\-']"

I also want to point regex101.com which is useful for testing regular patterns.

Daweo
  • 10,139
  • 2
  • 5
  • 10
  • 1
    @Mandy8055 Both are equally valid. – alani Jul 08 '20 at 07:39
  • 1
    @Mandy8055 True. – alani Jul 08 '20 at 07:40
  • 1
    @Mandy8055 for me this is matter of preference - solution with `-` as last insidec `[]` is minimally shorter (no need to use backslash), solution with `\-` show explicitly intent i.e. use literal `-`. – Daweo Jul 08 '20 at 08:05
  • ```-``` is a special character, after escaping it as ```\-``` solved my problem. Thanks @Daweo – jdk Jul 08 '20 at 08:39
2

A - inside a character class [...] is used to denote a range of characters, for example: [0-9] would be equivalent to [0123456789].

Here, the :-' would mean any character between : and '. However, if you look up the character numbers, you see that they are in the wrong order for that to be a valid range:

>>> ord(":")
58
>>> ord("'")
39

In the opposite order '-: (inside the []) it would be a valid character range.

In any case, it is not what you want. You want the - to be interpreted as a literal - character.

There are two ways to achieve this. Either:

  • escape the - by writing \-

  • or put the - as the first or last character inside the [], e.g. r"[.?!,;:'-]"

alani
  • 11,147
  • 2
  • 8
  • 20