How can I extract the information I want using this RegEx or better?

Question

So here's the Regular Expression I have so far.

r"(?s)(?<=([A-G][1-3])).*?(?=[A-G][1-3]|$)"

It looks behind for a letter followed by a number between A-G and 1-3 as well as doing the same when looking ahead. I've tested it using Regex101. Here's what it returns for each match

This is the string I'm testing it against,

"A1 **ACBFEKJRQ0Z+-** F2 **.,12STLMGHD** F1 **9)(** D2 **!?56WXP** C1 **IONVU43\"\'** E1 **Y87><** A3 **-=.,\'\"!?><()@**"

(the string shouldn't have any spaces but I needed to embolden the values between each Letter followed by a number so it is easier to see what I want)

What I want it to do is store the values between each of the matches for the group (The "Full Matches") and the matches for the group they coincide with to use later.

In the end I would like to end up with either a list of tuples or a dictionary for example:

dict = {"A1":"ACBFEKJRQ0Z+-", "F2":",12STLMGHD", "F1":"9)(", "next group match":"characters that follow"}

or

list_of_tuples = (["A1","ACBFEKJRQ0Z+-"], ["F2","12STLMGHD"], ["F1","9)("], ["next group match","characters that follow"])

The string being compared to the RegEx won't ever have something like "C1F2" btw

P.S. Excuse the terrible explanation, any help is greatly appreciated

Something like http://ideone.com/kvL59F with [`(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)`](https://regex101.com/r/xlC4tZ/1)? — Wiktor Stribiżew, Oct 18 '16 at 19:59

Wiktor Stribiżew · Accepted Answer · 2016-10-18T20:06:20.933

1

I suggest

(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)

See the regex demo

The (?s) will enable . to match linebreaks, ([A-G][1-3]) will capture the uppercase letter+digit into Group 1 and ((?:(?![A-G][1-3]).)*) will match all text that is not starting the uppercase letter+digit sequence.

The same regex can be unrolled as ([A-G][1-3])([^A-G]*(?:[A-G](?![1-3])[^A-G]*)*) for better performance (no re.DOTALL modifier or (?s) is necessary with it). See this demo.

Python demo:

import re
regex = r"(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)"
test_str = """A1 ACBFEKJRQ0Z+-F2.,12STLMGHDF19)(D2!?56WXPC1IONVU43"'E1Y87><A3-=.,'"!?><()@"""
dct = dict(re.findall(regex, test_str))
print(dct)

edited Oct 18 '16 at 20:06

answered Oct 18 '16 at 20:04

Wiktor Stribiżew

484,719
26
302
397

1

The `(?:(?![A-G][1-3]).)*` is called a [tempered greedy token](http://stackoverflow.com/a/37343088/3832970) and it is useful when you need to match up to a sequence of chars. I'd advise using the unrolled version if you have long strings, and the concise version if you have shorter strings. – Wiktor Stribiżew Oct 18 '16 at 20:09
If you noticed, all of the alphanumeric characters and some symbols are contained in that string plus the identifiers (A1, F2, C1, A3 etc.). I'm attempting to make something similar to what the ADFGVX Cipher does so I needed a way to split the data. So thank you for all the help! – Callum Oct 18 '16 at 20:16

How can I extract the information I want using this RegEx or better?

1 Answers1