I am trying to plot the frequency of how often viral biological sequences combination of isolation year differences and nucleotide differences occurs. I am trying to find an elegant way to do it have having trouble.
So I have an alignment and I compare each sequence against each other to get an integer value of how different they are. I also check to see how different their years of isolation are. So for a set of sequences that are isolated two years apart and have three differences you get the coordinates (2,3). I want to count how many times (2,3) occurs as well as all other combinations and plot it (and get the plot data). I have been trying to convert a list of frequencies to a dataframe to no avail and I am wondering if there is a better way to do it.
I can show some code but I am not sure this is the best way so I want to hear other ideas.
One problem is how to represent the frequencies in the beginning. I can create a list of all of the occurrences or create a dictionary of the occurrences and increment a counter.
Sample data: (year difference, sequence residue differences): (1,2), (2,5), (1,2), (5, 5), (4, 5)
Output is shown in the picture but it does NOT have to be in a table structure. CSV is preferred.