0

I am currently trying to parse a text file in Python from the AER which shows the daily issued well licenses in Alberta. Basically I want to separate out the data for each license based on the type (well name, unique identifier, license number, etc.) shown in the file header, and add each of those to a list which can then be moved into a database.

The problem is the formatting on the text file in question (see below for a section of it) is not particularly friendly for parsing. There is no delimiter and it is meant to be human-readable. My experience with string manipulation is limited and I do not know how to go about solving this problem.

Here is a snippet of the text file in question:





    DATE: 02 July 2019                                                                                  


    --------------------------------------------------------------------------------------------        
    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                
    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 
    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                
    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       
    LICENSEE                                                            SURFACE LOCATION                
    --------------------------------------------------------------------------------------------        

    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          
    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         
    DEV (NC)                             HARDY                          MCMURRAY FM                     
    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   
    MEG ENERGY CORP.                                                    09-07-077-05W4                  

    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          
    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  

    SPL 10-24 HZ MARTEN 5-23-76-6        0494995   ALBERTA CROWN        705.5M                          
    100/05-23-076-06W5/00  S  566.3M  W  800.1M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 10-24-076-06W5                  

    SURGE ENERGY HZ103 VALHALLA 6-7-75-8 0494996   ALBERTA CROWN        770.8M                          
    103/06-07-075-08W6/00  S  372.0M  E  324.5M    GRANDE PRAIRIE       3350.0M                         
    DEV (NC)                             VALHALLA                       DOIG FM                         
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SURGE ENERGY INC.                                                   13-06-075-08W6                  

    CNRL ET AL HZ KARR 4-16-66-3         0494997   ALBERTA CROWN        770.7M                          
    100/04-16-066-03W6/00  N  623.4M  E  127.5M    GRANDE PRAIRIE       5295.0M                         
    DEV (NC)                             KARR                           DUNVEGAN FM                     
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    CANADIAN NATURAL RESOURCES LIMITED                                  05-14-066-03W6     

I do not need anything from the header info between the dotted lines, or the date. I need to extract only the text from each section of each line for each block, as laid out by the header. I have attempted some methods, including basic string manipulation in Python and RegEx, but none have come close and I am at a loss.. Let me know if you need more detail in explaining this task, I understand that this is a big ask and is a bit convoluted.

KPubs
  • 9
  • 1

1 Answers1

0

This expression or some derivatives of that is likely to extract the desired data:

[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

yet maybe we'd be better off, if we would remove the header before passing it thru regular expressions.


In the right panel of this demo, the expression is further explained, if you might be interested.

Test

import re

regex = r"[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}"

test_str = (" DATE: 02 July 2019                                                                                  \n\n\n"
    "    --------------------------------------------------------------------------------------------        \n"
    "    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                \n"
    "    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 \n"
    "    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                \n"
    "    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       \n"
    "    LICENSEE                                                            SURFACE LOCATION                \n"
    "    --------------------------------------------------------------------------------------------        \n\n"
    "    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          \n"
    "    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         \n"
    "    DEV (NC)                             HARDY                          MCMURRAY FM                     \n"
    "    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   \n"
    "    MEG ENERGY CORP.                                                    09-07-077-05W4                  \n\n"
    "    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          \n"
    "    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         \n"
    "    OUT (C)                              MARTEN                         CLEARWATER FM                   \n"
    "    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       \n"
    "    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Advice

The fourth bird is advising that:

[The above expression] is not anchored and causes a lot of backtracking. Perhaps anchor it with ^[ \t]* could make it a bit more efficient.

^[ \t]*[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

See a demo

Based on the current example data, this might also be an option

^[ \t]*[A-Z]+(?: [A-Z0-9-]+)+[ \t]+[0-9]{7}[ \t]+.*(?:\r?\n(?![ \t]*$).*)* 

See a demo

Emma
  • 1
  • 9
  • 28
  • 53
  • 2
    Hi Emma, your pattern is not anchored and causes a lot of backtracking. Perhaps anchor it with `^[ \t]*` could make it a bit more efficient. See a [demo](https://regex101.com/r/zLwJve/1/). Based on the current example data, this might also be an option `^[ \t]*[A-Z]+(?: [A-Z0-9-]+)+[ \t]+[0-9]{7}[ \t]+.*(?:\r?\n(?![ \t]*$).*)*` See a [demo](https://regex101.com/r/xKaGZz/1) – The fourth bird Jul 06 '19 at 17:04