1

How does one add three digits within an alphanumeric string using regular expressions in Python?

For instance, I want to add three zeroes after the dash sign -, but before the last digit in the string, in way to make A1-1 be A1-0001 instead.

My guess was:

df['column'].str.replace('(^C3-\d{1)$)', ???)
accdias
  • 3,827
  • 2
  • 15
  • 28
Seunghoon Jung
  • 381
  • 2
  • 10
  • If there has to be a last digit, you could try `^([A-Z]\d-(?=\d+$))` and replace with `\1000` – The fourth bird Jan 29 '20 at 22:00
  • 2
    Not quite sure I understand. it looks like you are looking for C3-#{1 where the # is a number. But your example A1-1 doesn;t match that. Can you give a real example of your data. – AlwaysData Jan 29 '20 at 22:06

2 Answers2

1

You may use

df['column'] = df['column'].str.replace(r'^(C3-)(\d)$', r'\g<1>000\2')

See the regex demo. If C can be any uppercase ASCII letter, replace it with [A-Z].

Or, a bit more generic for 1-3 digit numbers:

df['column'] = df['column'].str.replace(r'^(C3-)(\d{1,3})$', lambda x: "{}{}".format(x.group(1), x.group(2).zfill(4))) 

Details

  • ^ - start
  • (C3-) - Group 1: C3-
  • (\d) - Group 2: a digit (\d{1,3} matches 1 to 3 digits)
  • $ - end of string
  • \g<1> - value of Group 1
  • 000 - three zeros
  • \2 - value of Group 2

A Python test:

import pandas as pd
df = pd.DataFrame({'column': ['C3-1', 'C3-12', 'C3-123', 'C3-1234']})
df['column'] = df['column'].str.replace(r'^(C3-)(\d{1,3})$', lambda x: "{}{}".format(x.group(1), x.group(2).zfill(4))) 

Output:

>>> df
    column
0  C3-0001
1  C3-0012
2  C3-0123
3  C3-1234
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • I have a question on r'\g<1>000\2' : how come there is only <> for 1 and not for 2 at the end? I am not used to using g<> regex. So can you please elaborate on this? – Seunghoon Jung Jan 30 '20 at 16:08
  • @SeunghoonJung `\g<>` is an unambiguous version of a `\1` backreference. It is only necessary if there is a number after a backreference. See [this thread](https://stackoverflow.com/questions/5984633/python-re-sub-group-number-after-number) for more details. – Wiktor Stribiżew Jan 30 '20 at 16:10
1

Here is an alternative without regular expressions:

df = pd.DataFrame({'C': ['A2-2', 'A3-001', 'C3-1', 'C3-12', 'C3-123', 'C3-1234']})
df

Output:

    C
0     A2-2
1   A3-001
2     C3-1
3    C3-12
4   C3-123
5  C3-1234
df.C = df.C.apply(lambda _: _[:_.index('-') + 1] + _[_.index('-') + 1:].zfill(4))
df

Output:

    C
0  A2-0002
1  A3-0001
2  C3-0001
3  C3-0012
4  C3-0123
5  C3-1234
accdias
  • 3,827
  • 2
  • 15
  • 28