How to deal with large amount of data in Python

Question

I have a textfile with large amount of data (3 GB). Each line of this textfile contains time, source IP, destination IP and size. As you know the digits in the last section of IP address shows port address. I want to bring those port addresses to a histogram which I did it for 10 000 lines of data but as I could guess the Python code cannot be executed for that large amount of data. I briefly explain the code I have written. First I read that 10 000 data point, later I split them and put all in a list named as everything_list. Just ignore the condition that while loop works. Later I put all the port addresses in a list and draw the histogram of those. Now suppose I have a million of data lines, I cannot read them in the first place let alone to categorize them. Some people told me to use arrays and some told me to process a chunk of data and after that process another chunk of data. I am confused with all people said. Can anybody help me with this issue?

text_file = open("test.data", "r")
a = text_file.read()
text_file.close()

everything_list = a.split()
source_port_list = []
i=0
while 6+7*i<len(everything_list):

    source_element = everything_list[2+7*i]
    source_port_position = source_element.rfind('.')
    source_port_number = int(source_element[source_port_position + 1:])
    source_port_list.append(source_port_number)

    i=i+1


import matplotlib.pyplot as plt
import pylab


numBins = 20
plt.hist(source_port_list, numBins, color='red', alpha=0.8)
plt.show()

This is the lines format:

15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719205 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719209 IP 129.63.57.175.45241 > 62.85.5.142.55455: tcp 0
15:42:42.719213 IP 24.34.41.8.1236 > 129.63.1.23.443: tcp 394
15:42:42.719217 IP 59.167.148.152.25918 > 129.63.57.40.36075: tcp 0
15:42:42.719260 IP 129.63.223.16.2823 > 80.67.87.25.80: tcp 682
15:42:42.719264 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0
15:42:42.719269 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0

how much ram you have would decide how much you can store in memory, also the last digits in an actual ip have nothing to do with a port — Padraic Cunningham, Dec 27 '14 at 23:47
can you dump some lines of your source file, to get the exact structure? — Jivan, Dec 27 '14 at 23:48
@PadraicCunningham I think he meant `127.0.0.1:8080` for instance — Jivan, Dec 27 '14 at 23:51
@Jivan, but `source_port_position + 1` is not going to find it, for instance `10.10.10.10:22` — Padraic Cunningham, Dec 27 '14 at 23:52
@PadraicCunningham I totally agree. That's why I'd like to see the source :) — Jivan, Dec 27 '14 at 23:52
Anyway reading the whole file into memory is the wrong approach, iterate over the file object and parse line by line, that will save a couple of gig straight away — Padraic Cunningham, Dec 27 '14 at 23:54
The lines are like this:15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460 15:42:42.719205 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460 — Cristopher Van Paul, Dec 27 '14 at 23:54
@PadraicCunningham You are definitly right about reading the whole data. — Cristopher Van Paul, Dec 27 '14 at 23:58
Also consider with that many data points, the resolution of the graph is most likely going to be restricted by the number of pixels in the graph. Each pixel being a point of resolution, or a little box that contains some sort of density/average, so it might not even be necessary to plot every point. See: http://stackoverflow.com/questions/5854515/large-plot-20-million-samples-gigabytes-of-data — jmunsch, Dec 28 '14 at 00:02
I think if you just loop over the file object a lot of your problems will be gone. can you add a snippet of the file exactly as is to your question, it is hard to see the actual format in your comment — Padraic Cunningham, Dec 28 '14 at 00:02
You actually store the whole file with read and then store all the contents again using split — Padraic Cunningham, Dec 28 '14 at 00:04
what do you want from `15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460`? — Padraic Cunningham, Dec 28 '14 at 00:08
@PadraicCunningham, I want to see have the histogram of port addresses or sizes. — Cristopher Van Paul, Dec 28 '14 at 00:12
I mean which just port `47843` or what exact data are you trying to extract? — Padraic Cunningham, Dec 28 '14 at 00:14
but just the first ip port or both as your code suggests a src which I presume is the first? — Padraic Cunningham, Dec 28 '14 at 00:29
@PadraicCunningham, I am not sure I understand your question! — Cristopher Van Paul, Dec 28 '14 at 00:32
you have two ip's `129.241.138.133.47843 > 129.63.27.12.2674` do you want both ports or just the first — Padraic Cunningham, Dec 28 '14 at 00:32

score 3 · Answer 1 · edited May 23 '17 at 11:51

3

I don't know what the data looks like, but I think the issue is that you try to hold it all in memory at once. You need to do it little by little, read the lines one by one and build the histogram as you go.

histogram = {}
with open(...) as f:
    for line in f:
        ip = ...
        if ip in histogram:
            histogram[ip] += 1
        else:
            histogram[ip] = 1

You can now plot the histogram, but use plt.plot not plt.hist since you already have the frequencies in the histogram dictionary.

edited May 23 '17 at 11:51

Community

1
1

answered Dec 27 '14 at 23:56

spelufo

597
2
18

a defaultdict would be better – Padraic Cunningham Dec 27 '14 at 23:57
1

Interesting. There's also a Counter dictionary [in there](https://docs.python.org/2/library/collections.html). For me, sometimes simplicity trumps performance though. – spelufo Dec 28 '14 at 00:04
Would you mind rewriting the whole code? Since I am a new python user I do not know how to exert these changes! – Cristopher Van Paul Dec 28 '14 at 00:09
Oops, last line was wrong, fixed now. What it does is iterate through the lines, keeping a count on how many times it has seen a given ip. The histogram dictionary has for keys the ips and for values the corresponding counts. The last line simply starts the count for the ip at 1, and the other branch of the if statement increments the count for the ip it has found on that line. How to take the ip from the line is described in the other answers. I left it as `ip = ...` for you to fill in – spelufo Dec 28 '14 at 00:14
Don't mind the first two comments. They are just posible performance improvements using other data structures for the result – spelufo Dec 28 '14 at 00:16

Jivan · Answer 2 · 2014-12-28T01:17:41.747

3

You could use a regex and compile it outside your loop.

Altogether with reading your file in lazy mode, line by line.

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')
ports = []

for line in open("test.data", "r"):
    ports.append(re.search(r, line).group(0))

# determines the number of lines you want to take into account
i = (len(ports) - 6) // 7

# keeps only the first i elements
ports = ports[0:i]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

This code takes into account the fact that you want only the (n-6) / 7 first items, n being the number of lines of your source file. Try with some +1/-1 if it's not totally accurate. Getting rid of the unwanted items at the end allows your loop not to be bothered with checking a condition on each iteration.

EDIT:

You can combine several things above to get a more concise and efficient code:

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')

ports = [ re.search(r, line).group(0) for line in open("test.data", "r") ]
ports = ports[0:(len(ports) - 6) // 7]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

EDIT:

If you think your list of ports will be too large to fit in RAM (which I find unlikely), my advice would be to use a dict of ports:

ports = {}
for line in open("test.data", "r"):
    port = re.search(r, line).group(0)
    if not ports.get(port, False):
        ports[port] = 0
    ports[port] += 1

Which will give you something like:

>>> ports
{
    "8394": 182938,
    "8192": 839288,
    "1283": 9839
}

Note that in such a case, your call to plt.hist will have to be modified.

edited Dec 28 '14 at 01:17

answered Dec 28 '14 at 00:10

Jivan

16,401
7
56
89

But the point is, I will have a long list of ports which I might not have enough RAM capacity to store all. – Cristopher Van Paul Dec 28 '14 at 00:30
1

@CristopherVanPaul, you would want a serious amount of data to use all 24 gigs of ram – Padraic Cunningham Dec 28 '14 at 00:31
@CristopherVanPaul you could put them in a dictionary where keys are port number and values are the number of ports found - which would save you tons of space potentially – Jivan Dec 28 '14 at 00:32
@Jivan, as far as I know if I want to do this, I should not do it by hist because hist takes all data and then plot it. I should do this by processing a chunk of data and after that another chunk and ... – Cristopher Van Paul Dec 28 '14 at 00:36
@Jivan about 2 or 3 million port addresses. – Cristopher Van Paul Dec 28 '14 at 00:37
1

:@Jivan, `my_list[n:]` actually returns from n to the end not up to n – Padraic Cunningham Dec 28 '14 at 00:40
@Jivan regular expressions are cached in Python. Compiling it outside the loop shouldn't make much of a difference in speed (though it can be used to improve readability). See http://stackoverflow.com/a/452143/1935144. – IanH Dec 28 '14 at 00:41
@CristopherVanPaul an integer takes let's say 2 bytes. A string 4n bytes, n being the length of the string. So let's say 18 bytes per port. Which gives approx. 54Mb. Let's triple this for other stuff, and my guess is that 2 or 3 million ports will easily fit into 24Gb of RAM. – Jivan Dec 28 '14 at 00:41
@CristopherVanPaul ok. Knowing that, I advise you to keep with the list method, so that you can use it to call `plt.hist` as before – Jivan Dec 28 '14 at 00:55
@CristopherVanPaul any follow-up? – Jivan Jan 03 '15 at 01:46
@Jivan, Thank you, that was incredibly helpful. – Cristopher Van Paul Jan 04 '15 at 23:32

Padraic Cunningham · Answer 3 · 2014-12-28T01:49:14.063

1

You can use split and a defaultdict which will be more efficient:

from collections import defaultdict

d = defaultdict(int)
with open("a_file.txt") as f:
    for line in f:
         d[line.split()[2].rsplit(".",1)[-1]] += 1 
print(d)

defaultdict(<type 'int'>, {'1236': 1, '2300': 1, '47843': 2, '45241': 1, '25918': 1, '2823': 1})

Might also be worth checking out different ways to plot, matplotlib is not the most efficient:

pyqtgraph, guiqwt, gnuplot.py

edited Dec 28 '14 at 01:49

answered Dec 28 '14 at 00:49

Padraic Cunningham

160,756
20
201
286

This `split` method is more efficient but I personally find it less readable and less maintainable than a regex. – Jivan Dec 28 '14 at 00:57
Do you have an idea of how much more efficient? Like, say for 3M lines? – Jivan Dec 28 '14 at 00:58
1

well parsing a string is 3-4 times faster using the split method, i could probably make it more efficient I have not looked into it too much, I would actually probably use pandas myself to do what the OP wants – Padraic Cunningham Dec 28 '14 at 01:00
I would too, would definitely be a lot faster and cleaner. – Jivan Dec 28 '14 at 01:11

score 0 · Answer 4 · answered Dec 27 '14 at 23:58

0

Sounds like you should be iterating by line and using regex to find the port. Try something like this:

import re

ports = []
with open("path/to/your/text/file.txt", 'r') as infile:
    for line in infile:
        ports.append(re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\.(\d+)", line))
        # that regex explained:
        # # re.compile(r"""
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     (               # BEGIN CAPTURING GROUP
        # #       \d+           #   1 or more digits
        # #     )               # END CAPTURING GROUP""", re.X)

This is assuming your IP/port is formatted as you explain in your comment

IP.IP.IP.IP.PORT

answered Dec 27 '14 at 23:58

Adam Smith

45,072
8
62
94

`_ ,port = line.rsplit(".",1)` would be a lot better, I don't think this is really the issue though – Padraic Cunningham Dec 28 '14 at 00:00
@PadraicCunningham certainly the problem is `a = text_file.read()`, but I prefer using regex in instances like this: not the least bit because there are two ports in each line, and OP's code will only find one of them. – Adam Smith Dec 28 '14 at 00:02
Would you mind rewriting the whole code? Since I am a new python user I do not know how to exert these changes! BTW I do not know what does the last line of your code does! – Cristopher Van Paul Dec 28 '14 at 00:04
@CristopherVanPaul I edited with an extended definition of the regex. It basically finds anything that looks like the IP addresses in your logs (1-3 digits four times, separated by dots) then separates the port numbers and saves them in a list. I can't actually rewrite your code because I don't know anything about matplotlib, so I don't know what that is doing vs. what it might be able to do better !:) – Adam Smith Dec 28 '14 at 00:07

Kurt Schroeder · Answer 5 · 2014-12-28T07:53:53.510

I know this is not an immediate response to your question, but as being new to python there is a nice Coursera course dealing with that very subject. "Programming for Everybody (Python)" it is free to take and wont use too much of your time. the course starts February 2 2015. Also the text book "Python for Informatics: Exploring Information" is a Free Creative Commons download. at http://www.pythonlearn.com/book.php I hope this helps.

How to deal with large amount of data in Python

5 Answers5