I'm writing a Java program searching a DNA sequence 15 characters at a time, finding the sections with the most occurrences of C and G. I figured it would be fastest to check the entire DNA sequence for any areas where a substring of 15 consists completely of C's and G's, and if those don't exist, looking for substrings with 14 C's and G's and 1 A or T. Then if that doesn't come up, 13 CG and 2 AT, etc...
Trying to find a regex solution for this has proven difficult for me. I've come up with a test case using this code, but I can't get the RegEx to work. I think the syntax might be wrong, I've never used RegExes in Java. Sorry for that, I can probably figure out the syntax, I just need help with the regular expression itself matching the correct thing.
public class DNAChecker{
public static void main(String []args){
String checkThis= "ggccggccaggccgg";
if (checkThis.matches( “(?=.*[CcGg]{14})(?=.*[AaTt]{1})” ) ) {
System.out.println("This program works.");
} else {
System.out.println("This program doesn't work.");
}
}
}
The way I understand it and from what I've seen in related threads, if this can be done with a regex, I'm getting at least close with this. Now that I'm thinking about it, I don't think this makes sure that the total match is 15 characters in length... i.e. if checkThis was more than 15 characters long and had 14 CG and 1 AT total in it, not consecutively, this would still be true. So xxxxggccggxxccaggccggxxxxxx would be true. Would using .contains instead of .matches assure length restrictions?
Anyway, would a one-liner RegEx like this even be faster than counting the C's and G's of each substring? I haven't taken an algorithms class yet.
Please bear in mind that this program in its final form will be accepting a string of variable length, and searching substrings of length n, rather than 15 every time. (I know how to handle those requirements, so no need to tell me about Scanner or how arguments work!) I'm just a RegEx noob trying to use Jedi-level RegEx stuff... if you could recommend a book for me to become a wizard of RegExes, too, that'd be radical. Thank you very much in advance for your responses!