-1

I am trying to plot the frequency of how often viral biological sequences combination of isolation year differences and nucleotide differences occurs. I am trying to find an elegant way to do it have having trouble.

So I have an alignment and I compare each sequence against each other to get an integer value of how different they are. I also check to see how different their years of isolation are. So for a set of sequences that are isolated two years apart and have three differences you get the coordinates (2,3). I want to count how many times (2,3) occurs as well as all other combinations and plot it (and get the plot data). I have been trying to convert a list of frequencies to a dataframe to no avail and I am wondering if there is a better way to do it.

I can show some code but I am not sure this is the best way so I want to hear other ideas.

One problem is how to represent the frequencies in the beginning. I can create a list of all of the occurrences or create a dictionary of the occurrences and increment a counter.

Sample data: (year difference, sequence residue differences): (1,2), (2,5), (1,2), (5, 5), (4, 5)

Output is shown in the picture but it does NOT have to be in a table structure. CSV is preferred. output

burkesquires
  • 845
  • 1
  • 9
  • 18
  • Can you post a sample of your data and what the solution should look like – Chris Jan 09 '16 at 21:01
  • Possible duplicate of [custom matplotlib plot : chess board like table with colored cells](http://stackoverflow.com/questions/10194482/custom-matplotlib-plot-chess-board-like-table-with-colored-cells) – Reti43 Jan 09 '16 at 21:43
  • Reti43 that does look similar but the input data is a step ahead and one of my problems. – burkesquires Jan 09 '16 at 21:45

2 Answers2

1

Assuming your (year, discrepancy) tuples are in a list called samples as in the example below

import random
samples = [(random.randint(0,10), random.randint(0,10)) for i in range(100) ]

you can get the frequency of each pair as described in this other stackoverflow post How to count the frequency of the elements in a list?

import collections
counter=collections.Counter(samples)

To visualize this frequency table, you can convert it to a numpy matrix and use matshow from matplotlib

import numpy as np
import matplotlib.pyplot as plt

x_max = max([x[0] for x in samples])
y_max = max([x[1] for x in samples])
freq = np.zeros((x_max+1, y_max+1))
for coord, f in counter.iteritems():
    freq[coord[0]][coord[1]] = f
plt.matshow(freq, cmap=plt.cm.gray)
plt.show()
Community
  • 1
  • 1
  • Thanks for the help. I have tried some versions of the counter. When I tr the solution above I get the error: AttributeError: 'Counter' object has no attribute 'iteritems' – burkesquires Jan 10 '16 at 00:26
  • 1
    @burkesquires If you are on Python 3, the syntax has changed to `counter.items()`. The method `iteritems()` exists only in Python 2. – Reti43 Jan 10 '16 at 01:08
1

I'm heavily borrowing the table construction of this post.

The difference here is in constructing the array data. By initialising an array with zeros, for every coordinate (i, j), you increment that array element by one, to represent the incremented frequency.

zip(*coords) will group all is together in a tuple and all js in another. By finding the maximum value in each, we know the size of our array. Note, this must be bigger by 1 from x and y to account for 0, i.e from 0 to x is x+1 rows.

import matplotlib.pyplot as plt
import numpy as np

from matplotlib.table import Table

def table_plot(data):
    fig, ax = plt.subplots()
    ax.set_axis_off()

    tb = Table(ax, bbox=[0,0,1,1])

    nrows, ncols = data.shape
    width, height = 1.0 / ncols, 1.0 / nrows

    for (i, j), val in np.ndenumerate(data):
        tb.add_cell(i, j, width, height, text=str(val) if val else '', loc='center')

    for i in range(data.shape[0]):
        tb.add_cell(i, -1, width, height, text=str(i), loc='right',
                    edgecolor='none', facecolor='none')
    for i in range(data.shape[1]):
        tb.add_cell(-1, i, width, height/2, text=str(i), loc='center',
                    edgecolor='none', facecolor='none')

    tb.set_fontsize(16)
    ax.add_table(tb)
    return fig

coords = ((1,2), (2,5), (1,2), (5, 5), (4, 5))

# get maximum value for both x and y to allocate the array
x, y = map(max, zip(*coords))
data = np.zeros((x+1, y+1), dtype=int)

for i, j in coords:
    data[i,j] += 1

table_plot(data)
plt.show()

Output:

enter image description here

Community
  • 1
  • 1
Reti43
  • 8,010
  • 3
  • 22
  • 40
  • Thanks for your answer. You are so good! I edited the question to say that I am looking for csv output but in rows and columns. I do not need the lines. So sorry for teh confusion! – burkesquires Jan 10 '16 at 00:16
  • Thank you for your continued help. I want to be able to count the frequency of the tuples (1,2) and get out a csv file that shows frequencies in rows and columns. Is that clearer? – burkesquires Jan 10 '16 at 00:28
  • @burkesquires I think you want something like [this](http://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file). The 2d array constructing from the coordinates should be the same as in my code example. You should also keep in mind that the content of the question shouldn't drastically change from that originally posed, because it'd invalidate any current answers. The questions and answers here should focus on displaying 2d coordinates in a plot. If you have a question about csv files that you can't resolve, you should create a new question. – Reti43 Jan 10 '16 at 00:33
  • The line: x, y = map(max, zip(*coords)) is very helpful. What is the purpose of the asterisk? – burkesquires Jan 10 '16 at 00:44
  • 1
    @burkesquires This [unpacks](https://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists) the list elements as individual arguments. So `zip(*[(a, b), (c, d), (e, f)])` becomes `zip((a, b), (c, d), (e, f))`, which results to `[(a, c, e), (b, d, f)]`. – Reti43 Jan 10 '16 at 01:05