How to extract numbers from a text file with appropriate labels in python

Question

boundary
        layer 2
        datatype 0
        xy  15   525270 8663518   525400 8663518   525400 8664818   525660 8664818
                 525660 8663518   525790 8663518   525790 8664818   526050 8664818
                 526050 8663518   526180 8663518   526180 8665398   525980 8665598
                 525470 8665598   525270 8665398   525270 8663518
        endel

I have coordinates of polygons in this format shown above. Each polygon starts with "boundary" and ends with "endel". I am having trouble extracting the layer number, number of points, and the coordinates into either a numpy array or a pandas dataframe.

To be specific to this example, I need the layer number (2), number of points (15), and the x-y coordinate pairs.

with open('source1.txt', encoding="utf-8") as f:
    for line in f:
        line = f.readline()
        srs= line.split("\t")
        print(srs)

Doing this doesnt split the numbers even thoe they are separated by tabs

['        layer 255\n']
['        xy   5   0 0   22800000 0   22800000 22800000   0 22800000\n']
['        endel\n']

This is the result i got with that

with open('source1.txt', encoding="utf-8") as f:
    for line in f:
        line = f.readline()
        srs= line.split(" ")
        print(srs)

This isnt what i wanted but i tried that too and yet got a bad split

['', '', '', '', '', '', '', '', 'layer', '255\n']
['', '', '', '', '', '', '', '', 'xy', '', '', '5', '', '', '0', '0', '', '', '22800000', '0', '', '', '22800000', '22800000', '', '', '0', '22800000\n']
['', '', '', '', '', '', '', '', 'endel\n']

I couldnt go to numpy part as im stuck in processing the string from the file

Edited as per request

You say you've been stuck on this for hours. So *edit* the question to show us what you tried and explain why its results are unsatisfactory — Paul H, Jan 06 '18 at 15:25
Should we assume that there's more than polygon represented in the file? — Bill Bell, Jan 06 '18 at 15:28
There is more than one polygon in the file, my edit clarifying that didn't save. Each polygon starts with a boundary and end with endel — jax, Jan 06 '18 at 15:30
OK. The main thing is to show us your code. People here hate the idea of being asked to program for nothing. — Bill Bell, Jan 06 '18 at 15:31
Thanks bill, i'm just learning so forgive me for being a newbie. i did add what i tried — jax, Jan 06 '18 at 15:35
No worries! (Somebody has to repeat these suggestions dozens of times a day.) One more thing: is layer number unique to a collection of data within a 'boundary'. — Bill Bell, Jan 06 '18 at 15:39
Yes, layer is a unique number or an identifier for each polygon. — jax, Jan 06 '18 at 15:41
You should also tell us what output should be like, so that people like Maciek can answer properly, first time. Please put a sample of a few lines in your question. — Bill Bell, Jan 06 '18 at 15:41
Datatype is a junk value, xy contains n co ordinates, 15 in this case — jax, Jan 06 '18 at 15:41
[['2', '15', '525270', '8663518','525400','8663518',... and so on for 15 points] — jax, Jan 06 '18 at 15:43

Maciek · Accepted Answer · 2018-01-06T15:57:04.887

1

You could use some trivial code such as:

res = []
coords = []
xy = False
with open('data.txt') as f:
    for line in f.readlines():
        if 'layer' in line:
            arr = line.split()
            layer = int(arr[-1].strip())
        elif 'xy' in line:
            arr = line.split()
            npoints = int(arr[1])
            coords = arr[2:]
            xy = True
        elif 'endel' in line:
            res.append([layer, npoints, coords[0:npoints]])
            xy = False
            coords = []
        elif xy:
            coords.extend(line.split())
print(res)

Then, you can convert the resulting list to numpy array, or whatever you like, but note that coords are still strings in the code above.

edited Jan 06 '18 at 15:57

answered Jan 06 '18 at 15:32

Maciek

583
3
12

Thanks for the amazing code. There are multiple polygons, each polygon in the same format i specified. Thing is instead of reading arr[2] and arr[3] ( single co ordinate), i want to append all the 15 co ordinates – jax Jan 06 '18 at 15:39
Making my point clear, im trying to bring an ouput like [['2', '15', '525270', '8663518','525400','8663518',... and so on for 15 points] – jax Jan 06 '18 at 15:43
That would be the first row for the first polygon. The second polygon need to be stored in second row :( – jax Jan 06 '18 at 15:47
I edited my reply accordingly. As far as I understood "15" is dynamic and can vary for different polygons, but in each case there will be sufficient (at least 15 in this case) coordinates. It works for many polygons, just paste your "polygon entry" several times to "data.txt" – Maciek Jan 06 '18 at 15:58
You are welcome - but actually it might not be working like a charm. It is because your explanation was not clear to me - note that now the code grabs 15 coordinates, whereas I just realized that you might be interested in grabbing 30 (15 for x and 15 for y). Still not sure, but if this is the case just multiply npoints by 2. – Maciek Jan 06 '18 at 17:02
I added a few more if conditions to read other lines of points, but now im facing another wierd issue where im running out of memory because im dealing with 3.8 million polygons of these. Any suggestions on fixing that? :( – jax Jan 06 '18 at 17:20
It could be rather easy, you can try changing `with open('data.txt') as f: for line in f.readlines():` to `with open('data.txt') as f: for line in f:` (just remove `.readlines()`) as suggested here: https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python I haven't tested, though. Pls, let us know if it helped. – Maciek Jan 07 '18 at 07:16

dawg · Answer 2 · 2018-01-06T18:13:38.720

You can use a regex to parse that file into blocks of the relevant data then parse each block:

for block in re.findall(r'^boundary([\s\S]+?)endel', f.read()):
    m1=re.search(r'^\s*layer\s+(\d+)', block, re.M)
    m2=re.search(r'^\s*datatype\s+(\d+)', block, re.M)
    m3=re.search(r'^\s*xy\s+(\d+)\s+([\s\d]+)', block, re.M)
    if m1 and m2 and m3:
        layer=int(m1.group(1))
        datatype=int(m2.group(1))
        xy=int(m3.group(1))
        coordinates=[(int(x),int(y)) for x,y in zip(*[iter(m3.group(2).split())]*2)]
    else:
        print "can't parse {}".format(block)

A variable number of coordinates are supported after the xy and it is trivial to test if the number of coordinates parsed is the number expected with len(coordinates)==xy.

As written, this requires reading the entire file into memory. If size is an issues, (and it usually is not for small to moderate size files), you can use mmap to make the file appear to be in memory.

How to extract numbers from a text file with appropriate labels in python

2 Answers2