1

I'm new to coding can anyone help me to convert the below set of text to a dictionary using regex or any other technique.

Bus Number: Departure , will be common in all messages / blocks

KPN_Sleeper: Bus Number: Departure 
Bus code: Kpn-866489 KA-01-7233 Bangalore 
AC Sleeper/56 Seats
24 Seats booked 

SRS: Bus Number: Departure 
Bus code: SRS-5858 KA-31-5985 Bangalore 


SAM: Bus Number: Departure 
Bus code: SAM-0077 TN-23-0777 Chennai 
{0:{
  "Bus_name": "KPN_Sleeper",
  "Bus code":"Kpn-866489",
  "Bus Number": "KA-01-7233",
  "Departure": "Bangalore",
  "others": "AC Sleeper/56 Seats 24 Seats booked "
},
1:{
  "Bus_name": "SRS",
  "Bus code":"SRS-5858",
  "Bus Number": "KA-31-5985",
  "Departure": "Bangalore",
  "others": ""
}}

Since I'm new to coding and regex, I'm feeling difficult to construct.

Rizwan M.Tuman
  • 9,424
  • 2
  • 24
  • 40
Dev DB
  • 45
  • 3
  • in the above text "Bus Number: Departure " are common and repeating words can we build any regex based on this? – Dev DB Feb 11 '21 at 07:25
  • 1
    It looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Feb 11 '21 at 08:27
  • 2
    It's a pity this question was closed so quickly, I'm not sure if we should reduce this to a matter of finding the right regex. Solving this OP's problem in a single, multiline, regex seems quite daunting (I tried, couldn't do it), it's quite an advanced use of regexes. I was going to suggest pre-processing the data to split the input data before looking for a (simpler) regex to parse a single object, seems like a more helpful strategy for an OP new to coding and regex. – joao Feb 11 '21 at 09:09

1 Answers1

0

Given your comment, I think you can try with this:

^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?

Regex Demo

Sample Code (run here ):

regex = r"^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?"

test_str = ("KPN_Sleeper: Bus Number: Departure \n"
    "Bus code: Kpn-866489 KA-01-7233 Bangalore dfdf\n"
    "AC Sleeper/56 Seats\n"
    "24 Seats booked \n\n"
    "SRS: Bus Number: Departure \n"
    "Bus code: SRS-5858 KA-31-5985 Bangalore dfdf dfd\n\n\n"
    "SAM: Bus Number: Departure \n"
    "Bus code: SAM-0077 TN-23-0777 Chennai \n"
    "asdfadf ;kasdjlfads;f lkadsjf")

matches = re.finditer(regex, test_str, re.MULTILINE)


for match in matches:
    print("Bus Name: "+match.group(1)+"Bus Code: "+match.group(2)+" Bus No: "+match.group(3)+" Departure: "+match.group(4))


#you can have other's value in match.group(5) , however, having it is conditional

Explanation:

  1. ^(.*):\s* (.*) --> First Capturing group to get Bus Name. \s* to cover empty spaces

  2. Bus Number: Departure\s*\n --> Bus Number: Departure followed by blank spaces and a newline

  3. Bus code:\s* The next line begin with Bus Code a colon and option blank spaces

  4. ([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*

    a) ([^ ]+) --> bus code \s --> blank space

    b) ([^ ]+) --> bus number \s--> blank space

    c) ([^\n]+) --> Departure , It may have multiple word

    d) [ \t]* --> it covers trailing spaces after departure

  5. (?:\n|$) --> It covers the newline or end of string

  6. ((?:[^\n]+(?:\n|$))+)?

    a) [^\n]+(?:\n|$ --> matches anything but newline followed by a newline or end of string

    b) ?: makes it non capturing group

    c) + means there can be multiple lines

    d) the final () sums all the other line in a group

    e) ? makes this entire other process optional

Rizwan M.Tuman
  • 9,424
  • 2
  • 24
  • 40