I have a large plain text file and I have to 'clean' it in python3.
For now I've read it into a list with the following strings:
['[chr10:43612033[C', '[chr10:61665880[G', 'C[chr20:3835205[', ']chr20:3870375]T', 'G]chr6:117650611]']
My goal is to turn it into a list containing only the string of the middle part 'chr_XX:number'
. That means I need to find a way to either remove 1 or 2 characters from the beginning and the end of the origninal string.
['chr10:43612033', 'chr10:61665880', 'chr20:3835205', 'chr20:3870375', 'chr6:117650611']
My problem here is, that I can not slice by index as the pattern is:
<chr>+ any number between 1-22 or X or Y
E.g. chr1
or chr22
or chrX
or chrY
The part after the :
can be any integer number spanning up to 9 digits.
Thus, I cannot just slice by removing the the first x characters or the last x characters.
This is because sometimes I have 2 characters before my relevant string and sometimes only one. As in:
<any_letter>]chr10:<the_number>
or
]chr10:<the_integer>
or the same story but with the opening square bracket [
.
The same goes for the final part of the string. After my famous integer of any length between 1 and 9 digits i got either ]<any_letter>
or just a single ]
or same pattern but with the opening square bracket.
Any elegant ideas?