Finding the right rules to filter certain strings in python list

Question

I have a large plain text file and I have to 'clean' it in python3.

For now I've read it into a list with the following strings:

['[chr10:43612033[C', '[chr10:61665880[G', 'C[chr20:3835205[', ']chr20:3870375]T', 'G]chr6:117650611]']

My goal is to turn it into a list containing only the string of the middle part 'chr_XX:number'. That means I need to find a way to either remove 1 or 2 characters from the beginning and the end of the origninal string.

['chr10:43612033', 'chr10:61665880', 'chr20:3835205', 'chr20:3870375', 'chr6:117650611']

My problem here is, that I can not slice by index as the pattern is:

<chr>+ any number between 1-22 or X or Y E.g. chr1 or chr22 or chrX or chrY

The part after the : can be any integer number spanning up to 9 digits. Thus, I cannot just slice by removing the the first x characters or the last x characters.

This is because sometimes I have 2 characters before my relevant string and sometimes only one. As in:

<any_letter>]chr10:<the_number>

or

]chr10:<the_integer>

or the same story but with the opening square bracket [ .

The same goes for the final part of the string. After my famous integer of any length between 1 and 9 digits i got either ]<any_letter> or just a single ] or same pattern but with the opening square bracket.

Any elegant ideas?

yes `'chrX:43612033'` is acceptable the other option would be `'chr10:43612033'` - so without the X after 10. — ilam engl, Apr 20 '21 at 15:19

baduker · Accepted Answer · 2021-04-20T15:47:39.967

1

As suggested in the comments, you could simply use regex by utilizing this pattern:

chr<digits>or<XY>:<digits>

Check out this if you want to learn more about regular expressions

Here's a working example:

import re

strings = [
    '[chr10:43612033[C',
    '[chr10:61665880[G',
    'C[chr20:3835205[',
    ']chr20:3870375]T',
    'G]chr6:117650611]',
    'G]chrX:117650611]',
    'G]chrY:117650611]',
]
print([re.search(r"chr(\d{1,2}?|[X-Y]):\d{,9}", s).group(0) for s in strings])

Output:

['chr10:43612033', 'chr10:61665880', 'chr20:3835205', 'chr20:3870375', 'chr6:117650611', 'chrX:117650611', 'chrY:117650611']

edited Apr 20 '21 at 15:47

answered Apr 20 '21 at 15:24

baduker

12,203
9
22
39

1

Oh wow this is an entire universe. I'd like to link some more regex references for other newbies like me: [python docs](https://docs.python.org/3/library/re.html) and this [page on stack overflow which shows many resources for python](https://stackoverflow.com/tags/regex/info/). For the last link scroll to heading "Further Reading"! – ilam engl Apr 21 '21 at 07:34

Finding the right rules to filter certain strings in python list

1 Answers1