0

I'm writing a Java program searching a DNA sequence 15 characters at a time, finding the sections with the most occurrences of C and G. I figured it would be fastest to check the entire DNA sequence for any areas where a substring of 15 consists completely of C's and G's, and if those don't exist, looking for substrings with 14 C's and G's and 1 A or T. Then if that doesn't come up, 13 CG and 2 AT, etc...

Trying to find a regex solution for this has proven difficult for me. I've come up with a test case using this code, but I can't get the RegEx to work. I think the syntax might be wrong, I've never used RegExes in Java. Sorry for that, I can probably figure out the syntax, I just need help with the regular expression itself matching the correct thing.

public class DNAChecker{

     public static void main(String []args){
        String checkThis= "ggccggccaggccgg";

        if (checkThis.matches( “(?=.*[CcGg]{14})(?=.*[AaTt]{1})” ) ) {
            System.out.println("This program works.");
        } else {
            System.out.println("This program doesn't work.");
        }
     }
}

The way I understand it and from what I've seen in related threads, if this can be done with a regex, I'm getting at least close with this. Now that I'm thinking about it, I don't think this makes sure that the total match is 15 characters in length... i.e. if checkThis was more than 15 characters long and had 14 CG and 1 AT total in it, not consecutively, this would still be true. So xxxxggccggxxccaggccggxxxxxx would be true. Would using .contains instead of .matches assure length restrictions?

Anyway, would a one-liner RegEx like this even be faster than counting the C's and G's of each substring? I haven't taken an algorithms class yet.

Please bear in mind that this program in its final form will be accepting a string of variable length, and searching substrings of length n, rather than 15 every time. (I know how to handle those requirements, so no need to tell me about Scanner or how arguments work!) I'm just a RegEx noob trying to use Jedi-level RegEx stuff... if you could recommend a book for me to become a wizard of RegExes, too, that'd be radical. Thank you very much in advance for your responses!

ABvsPred
  • 57
  • 6
  • 1
    I'm afraid that Regex probably isn't going to be too useful to you for this problem. I'd recommend just counting the number of `c`s and `g`s in the string instead – Sam I am says Reinstate Monica Sep 09 '14 at 15:37
  • As for information/tutorials on regular expressions, try this website: [regular-expressions.info](http://regular-expressions.info) – Thomas Sep 09 '14 at 15:40
  • And the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496)! – aliteralmind Sep 09 '14 at 15:41
  • Could you provide some examples of input strings and what you want to get out? If I understand you correctly you have quite a big input string and look for the sequences of at most 15 characters consisting of only `c`'s and `g`'s with an `a` or `t` at the end. Is that correct? – Thomas Sep 09 '14 at 15:43
  • 1
    One thing to keep in mind is the difference between doing a `String.matches(regex)` and doing a `Pattern.compile(regex).matcher(String).find()`. The former looks for an **exact** match, and the latter just looks for the regex as a substring of the original input. So in your case, you probably want to use `Pattern.compile(regex).matcher(String).find()` to determine if the regex matches any substring of the input. – user3062946 Sep 09 '14 at 15:44
  • Thomas, I'm looking for strings of exactly length n (in the example case, 15) consisting of a particular number of c's and g's and a particular number of a's and t's, and they can be in any order. So if I was looking at strings of length n, wanting 2 c's or g's and 1 a or t, "cca" or "cac" or "acc" would all be correct. It's pretty complicated... I revised this question like 10 times and still feel dissatisfied with its clarity. Does this explanation help? – ABvsPred Sep 09 '14 at 17:41

2 Answers2

2

Regexes are one of the most seductive features of any language. However, just because they're cool and sexy and look very powerful doesn't mean they're the correct tool. For something like this, a simple state machine suffices and is likely to be MUCH faster. The code below finds the longest substring containing only c and g, and can be easily adapted to keep multiple substrings by adding them to a collection.

    String data = "acgtcgcgagagagggggcccataatggg";
    int    longestPos = 0;
    int    longestLen = 0;
    int p=-1;
    for (int i=0; i<data.length(); i++)
    {
        char c = data.charAt(i);
        if (c == 'c' || c == 'g')  // Is this the droid you're looking for?
        {
            if (p==-1)  // Are we not yet in an interesting string?
                p = i;  // If so, save the position of this start of substring.
        }
        else  // Not a c or g
        {
            if (p != -1 && i-p > longestLen)  // Are we in an interesting string longer than the previous longest?
            {
                longestPos = p;     // Save the starting position
                longestLen = i-p;   // Save the length
            }
            p = -1;   // We're no longer inside an interesting string
        }
    }

    // Handle the case where the last substring was 'interesting'
    if (p != -1 && i-p > longestLen)
    {
        longestPos = p;     // Save the starting position
        longestLen = i-p;   // Save the length
    }

    System.out.printf("Longest string is at position %d for length %d", longestPos, longestLen);

For the canonical response to "let's use a regex where it does not apply" see this post

Community
  • 1
  • 1
Jim Garrison
  • 81,234
  • 19
  • 144
  • 183
0

I'm not entirely sure whether I correctly understand your problem, so I'll assume you want to find the longest sequence of characters consisting of cs and gs followed by an a or t.

I further assume your input string only contains those characters.

Thus you might try and use Pattern.compile(regex).matcher(input).find() to get all groups that are matching. Then sort them by length and you get the longest sequences.

To achieve that, you could use the following regex: (?i)([cg]+[at]) (the (i?) makes the expression case insensitive).

Example:

String input = "ccgccgCggatccgCATccggcccgggggtatt";

List<String> sequences = new ArrayList<>();

//find the sequences
Matcher m = Pattern.compile("(?i)([cg]+[at])").matcher( input );
while( m.find() ) {
  sequences.add( m.group().toLowerCase() );
}

//sort by descending length
Collections.sort( sequences, new Comparator<String>() {
  public int compare( String lhs, String rhs ) {
    //switch arguments for descending sort
    return Integer.compare( rhs.length(), lhs.length());
  }
});

System.out.println( sequences );

Ouput would be: [ccggcccgggggt, ccgccgcgga, ccgca]

If you want to just allow a specific length of those sequences, you'd need to alter the regex:
(?i)(?<=^|[^cg])([cg]{10,15}[at])

Changes:

(?<=^|[^cg]) means that the sequence must be preceeded by the start of the input or anything except a c or g. To match parts of longer sequences, i.e. gcga out of cccgcga you just remove that from your regex.

[cg]{10,15} means that the sequence of cs and gs must be between 10 and 15 characters long, i.e. shorter sequences won't be matched while longer sequences might be matched if you don't use (?<=^|[^cg]). To use an exact length, e.g. 15 characters, use the condition above and change this condition to [cg]{15}.

Thomas
  • 80,843
  • 12
  • 111
  • 143