-1

I'm working in a project and unfortunately data extracted from another software needs more format. Take a look at this line

Instructor :    95371    XXX XXX XXX     Associate Professor    Course Name EE 311   Microprocessors     lecture    834 1   32  3   3      1             08:00 AM - 08:50 AM       1             09:00 AM - 09:50 AM       3             10:00 AM - 10:50 AM        21  Total : 3   Section Position :  Serial  Campus  Hrs Weekly      Activity    Semester:   Time    Schedule Type : 411 Reg.    Regular First Semester 41/42    Rank :  Course  

Each line must start with Instructor followed by : and ID. The name may not be available. After that the rank of the teacher is stated in the following group

Associate Professor
Assistant Professor
Lecturer
Teacher
Teaching Assistant

after the words lecture or exercise or practical there are six number places, I need to extract the first one from the right. Could you please suggest a startup regular expression for this? Qt library is welcomed.

CroCo
  • 4,982
  • 7
  • 47
  • 74
  • Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/a/2759417/3832970) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Sep 09 '19 at 12:29
  • What is the delimiter between the logical columns? I see a bunch of spaces, but then again the role and other fields themselves also contain spaces. – Tim Biegeleisen Sep 09 '19 at 12:30
  • 1
    What are the separators between the fields? Spaces? Tabs? Will there always be multiple spaces as separators? And while regular expressions can be powerful, they are also often very hard to get right. There's a saying going something like: I have one problem, I tried to solve it with a regex, now I have *two* problems. – Some programmer dude Sep 09 '19 at 12:31
  • @TimBiegeleisen no specific format. sometimes tab sometimes white space. – CroCo Sep 09 '19 at 12:31
  • 1
    You need to start with cleaning up your source data and getting proper delimiters. Until you do that, forget about regex. – Tim Biegeleisen Sep 09 '19 at 12:33
  • @TimBiegeleisen it is hard as I said. Data comes in this form. As you can see, a name could have several strings. I am totally sure regular expression is the right choice for this problem. – CroCo Sep 09 '19 at 12:35
  • typically text should be starting on the same column for all the file if there's no specific format, if there actually are tabs used try to "guess" the tab width so that the whole file looks aligned (most text editors will have an option for this)... once you do that you only need to determine the starting column for the data and the width of each column and than remove trailing spaces from each field – xception Sep 09 '19 at 12:37
  • OK can we say for certain that data would only have at most one contiguous space, while two or more spaces, or maybe a tab, would constitute a field separator? – Tim Biegeleisen Sep 09 '19 at 12:41
  • @TimBiegeleisen yes of course. – CroCo Sep 09 '19 at 12:41
  • You can start with this https://regexr.com/4klef – Thomas Sablik Sep 09 '19 at 12:51

1 Answers1

1

This regex will match your text and extract the value as group

Instructor :\s*\d+\s+(?:\w+(?: \w+)*)\s+(?:Associate Professor|Assistant Professor|Lecturer|Teacher|Teaching Assistant)\s+Course Name\s+\w+ \d+\s+\w+(?: \w+)*\s+(?:lecture|exercise|practical)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+(\d+)\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d+\s+Total : \d\s+Section Position : \s+Serial\s+Campus\s+Hrs Weekly\s+Activity\s+Semester:\s+Time\s+Schedule Type : \d+ Reg\.\s+Regular First Semester \d{2}\/\d{2}\s+Rank :\s+Course\s+
Thomas Sablik
  • 15,040
  • 7
  • 26
  • 51