I am currently trying to parse a text file in Python from the AER which shows the daily issued well licenses in Alberta. Basically I want to separate out the data for each license based on the type (well name, unique identifier, license number, etc.) shown in the file header, and add each of those to a list which can then be moved into a database.
The problem is the formatting on the text file in question (see below for a section of it) is not particularly friendly for parsing. There is no delimiter and it is meant to be human-readable. My experience with string manipulation is limited and I do not know how to go about solving this problem.
Here is a snippet of the text file in question:
DATE: 02 July 2019
--------------------------------------------------------------------------------------------
WELL NAME LICENCE NUMBER MINERAL RIGHTS GROUND ELEVATION
UNIQUE IDENTIFIER SURFACE CO-ORDINATES BOARD FIELD CENTRE PROJECTED DEPTH
LAHEE CLASSIFICATION FIELD TERMINATING ZONE
DRILLING OPERATION WELL PURPOSE WELL TYPE SUBSTANCE
LICENSEE SURFACE LOCATION
--------------------------------------------------------------------------------------------
MEG K7N HARDY 4-7-77-5 0483923 ALBERTA CROWN 571.7M
106/04-07-077-05W4/02 S 572.4M W 278.3M BONNYVILLE 1600.0M
DEV (NC) HARDY MCMURRAY FM
HORIZONTAL RESUMPTIONPRODUCTION (SCHEME) CRUDE BITUMEN
MEG ENERGY CORP. 09-07-077-05W4
SPL 11-24 HZ MARTEN 14-25-76-6 0494994 ALBERTA CROWN 705.3M
100/14-25-076-06W5/00 S 566.0M E 800.6M ST. ALBERT 2700.0M
OUT (C) MARTEN CLEARWATER FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SPUR PETROLEUM LTD. 11-24-076-06W5
SPL 10-24 HZ MARTEN 5-23-76-6 0494995 ALBERTA CROWN 705.5M
100/05-23-076-06W5/00 S 566.3M W 800.1M ST. ALBERT 2700.0M
OUT (C) MARTEN CLEARWATER FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SPUR PETROLEUM LTD. 10-24-076-06W5
SURGE ENERGY HZ103 VALHALLA 6-7-75-8 0494996 ALBERTA CROWN 770.8M
103/06-07-075-08W6/00 S 372.0M E 324.5M GRANDE PRAIRIE 3350.0M
DEV (NC) VALHALLA DOIG FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SURGE ENERGY INC. 13-06-075-08W6
CNRL ET AL HZ KARR 4-16-66-3 0494997 ALBERTA CROWN 770.7M
100/04-16-066-03W6/00 N 623.4M E 127.5M GRANDE PRAIRIE 5295.0M
DEV (NC) KARR DUNVEGAN FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
CANADIAN NATURAL RESOURCES LIMITED 05-14-066-03W6
I do not need anything from the header info between the dotted lines, or the date. I need to extract only the text from each section of each line for each block, as laid out by the header. I have attempted some methods, including basic string manipulation in Python and RegEx, but none have come close and I am at a loss.. Let me know if you need more detail in explaining this task, I understand that this is a big ask and is a bit convoluted.