Use python to extract a specific line from multiple files in the same directory

Question

I have multiple text files named ParticleCoordW_10000.dat, ParticleCooordW_20000.dat, etc... The files all look like this:

ITEM: TIMESTEP
10000
ITEM: NUMBER OF ATOMS
1000
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
ITEM: ATOMS id x y z 
673 1.03559 0.495714 0.575399 
346 2.74458 1.30048 0.0566235 
991 0.570383 0.589025 1.44128 
793 0.654365 1.33452 1.91347 
969 0.217201 0.6852 0.287291
.
. 
. 
.

I'd like to use python to extract the coordinate of a single particle, let us say ATOM ID: 673. The problem is that the line position of ATOM ID:673 changes in every text file. So I'd like to have Python be able to locate ATOM #673 in every text files of the directory and save the associated x y z coordinates.

Previously I was using something like this to obtain all the coordinates:

filenames = glob.glob('*.dat')
for f in filenames:
    x_data = np.loadtxt(f,usecols=[1],skiprows = 9)
    y_data = np.loadtxt(f,usecols=[2],skiprows = 9)
    z_data = np.loadtxt(f,usecols=[3],skiprows = 9)
    coord  = np.vstack((x_data,y_data,z_data)).T

Is there a way to modify this script in order to perform the task previously described?

EDIT: Based on the various comment I wrote the following:

coord = []
filenames = natsort.natsorted(glob.glob('*.dat'))
for f in filenames:
    buff = open(f, 'r').readlines()
    for row in buff:
        if row.startswith('673'):
            coord.append(row)
np.savetxt("xyz.txt",coord,fmt,delimiter=' ')

Which allows me to group all the coordinates of a single particle throughout all the text files in the directory. However I'd like to have this process done for all the particles ID (1000 particles). What would be the most efficient way to do that?

the quick and dirty way would be to buffer each file `buff = open(filename, 'r').readlines()` then iterate over each like `for row in buff`, `if 'ATOMS' in row`, `row.split(' ')`, `x=row[3]`, etc. — J'e, Jul 15 '19 at 22:45

Las · Answer 1 · 2019-07-15T23:04:44.047

0

Without more background i can't imagine a method to find the correct line without reading to the line where your Atom Id is located.

You do something like:

with open(FILE) as f:
    for line in f:
        if line.startswith(ID,0,log10(NumberOfAtoms)):
            saverownumber() or extract information

Else you could save/read in the "Mapping" ID <-> row number for each file

However i think you should think about a way to save the positons in an ordered way. Maybe you can also give information in your question, what prevents you from saving the positions ordered by Atom ID.

I can recommend using hdf5 library for storing large datasets with metadata.

edited Jul 15 '19 at 23:04

answered Jul 15 '19 at 22:49

Las

56
5

Thank you for the answer, what kind of additional information would you like? There is nothing that prevent me to save the coordinates, I just don't know a why to read a a specific line from multiple files where the line position changes randomly from file to file. – Alessandro Perego Jul 15 '19 at 22:56
why==way? I also had a typo in my post: order-> ordered. The important part was: what prevents you from saving the positions **ordered by Atom ID** – Las Jul 15 '19 at 23:06
The apparent point being, when they are ordered you can give up and quit if you see an ID larger than the one you want, instead of having to read through the entire file every time. – tripleee Jul 15 '19 at 23:14

NFR · Answer 2 · 2019-07-17T23:28:31.887

You can user Regular Expression to get the data off all the files and then process them as you wish. Something like this may work.

I've assumed that there's nothing after the coordinate values in the file. You will have to run this script from the directory all the files are in.

import os, re

regex = r"^ITEM: ATOMS \d+ x y z.*" # basing on this line being "ITEM: ATOMS 675 x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[temp[0].split()[2]] = temp[1:]

This will give you a dictionary with ATOM ID as key and a list of all coordinates as value. Sample ouput:

output

{'675': ['673 1.03559 0.495714 0.575399 ',
  '346 2.74458 1.30048 0.0566235 ',
  '991 0.570383 0.589025 1.44128 ',
  '793 0.654365 1.33452 1.91347 ',
  '969 0.217201 0.6852 0.287291',
  '']}

Upon reviewing the question, I think I've mis-interpreted the input. The line ITEM: ATOMS id x y z is static across all files. So, I've changed the code a bit.

import os, re

regex = r"^ITEM: ATOMS id x y z.*" # basing on this line being exactly "ITEM: ATOMS id x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[file] = temp[1:] # storing against filename as key

Thank you, can you tell me what the variable 'text' should be as it is not defined — Alessandro Perego, Jul 16 '19 at 19:22
@alessandro-perego sorry it should be `data`, not `text`. I've edited the code. — NFR, Jul 17 '19 at 00:35
Ok! I am trying to run the new code however, i am getting the following error: TypeError: expected string or bytes-like object — Alessandro Perego, Jul 17 '19 at 13:46
I've fixed it. Since `f.readlines()` returns a list of strings, you have to join to make it a single strings. `data` should be `''.join(data)` in `re.findall()` — NFR, Jul 17 '19 at 23:30

Use python to extract a specific line from multiple files in the same directory

2 Answers2