Using Python regular expressions to find and edit the contents of a text file

Question

I have a text file full of amino acids (CA-Final.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C

Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation. Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-final.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub('^ATOM\s+\d+\s+CA\s+A[ADTSEPGCVMILYFHKRWQN]', r'^ATOM\s+\d+\s+CA\s+[ADTSEPGCVMILYFHKRWQN]', content, flags = re.M)

When I run the function, it returns an error

 File "C:\Users\UserName\AppData\Local\conda\conda\envs\biopython\lib\sre_parse.py", line 1024, in parse_template
    raise s.error('bad escape %s' % this, len(this)) 

error: bad escape \s

My idea is that this function will find every instance of an A in front of a string of 3 characters and replace it with just the 3 other characters. Why exactly am I getting this error?

Do not use a regex pattern in the replacement string. It is not supposed to work like this. — Wiktor Stribiżew, Nov 01 '18 at 19:30
Try `re.sub(r'^(ATOM\s+\d+\s+CA\s+)A', r'\1', content, flags = re.M)` — Wiktor Stribiżew, Nov 01 '18 at 19:32
Is the file tab delimited? Why not parse the file a bit instead of applying regex to each row? Also your "for record" and "with open" lines are redundant (they do the same thing). — Ghoti, Nov 02 '18 at 15:01

score 1 · Answer 1 · answered Nov 02 '18 at 00:51

As far as I know, the easiest way to achieve your goal right now is to parse it using biopython (Since it's a PDB file).

Let's analyze the following script:

#!/usr/bin/env python3
import Bio
print("Biopython v" + Bio.__version__)

from Bio.PDB import PDBParser
from Bio.PDB import PDBIO

# Parse and get basic information
parser=PDBParser()
protein_1p49 = parser.get_structure('STS', '1p49.pdb')
protein_1p49_resolution = protein_1p49.header["resolution"]
protein_1p49_keywords = protein_1p49.header["keywords"]

print("Sample name: " + str(protein_1p49))
print("Resolution: " + str(protein_1p49_resolution))
print("Keywords: " + str(protein_1p49_keywords))
print("Model: " + str(protein_1p49[0]))

#initialize IO 
io=PDBIO()

#custom select
class Select():
    def accept_model(self, model):
        return True
    def accept_chain(self, chain):
        return True
    def accept_residue(self, residue):
        # print("residue id:" + str(residue.get_id()))
        print("residue name:" + str(residue.get_resname()))
        if len(str(residue.get_resname()))>3:
            print("Alert! abbr longer that 3 letters" + residue.get_resname())
            exit(1)
        return True       
    def accept_atom(self, atom):
        # print("atom id:" + atom.get_id())
        # print("atom name:" + atom.get_name())
        if atom.get_name() == 'CA':  
            return True
        else:
            return False

#write to output file
io.set_structure(protein_1p49)
io.save("1p49_out.pdb", Select())

exit(0)

It parses a PDB structure and uses a build-in biopython class PDBIO to save a custom parts of protein structure. Notice that you can put custom logic within the Select sub-class.

In this example, I used accept_residue method to fetch me information about abnormally named residues in my protein structure. You can easily extend this and perform a simple string trimming inside this function.

score 0 · Answer 2 · answered Nov 01 '18 at 19:48

0

Your regex will fail, if the first of three letters is an 'A'. Try this instead:

(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)

It creates 2 Groups with what's before and after the extra 'A'

Then replace with the 2 Groups:

\1\2

answered Nov 01 '18 at 19:48

Poul Bak

7,390
4
20
40

This generated a 16.7 MB text file – flannel_bioinformatician Nov 01 '18 at 19:59

Using Python regular expressions to find and edit the contents of a text file

2 Answers2