2

I have a textfile with large amount of data (3 GB). Each line of this textfile contains time, source IP, destination IP and size. As you know the digits in the last section of IP address shows port address. I want to bring those port addresses to a histogram which I did it for 10 000 lines of data but as I could guess the Python code cannot be executed for that large amount of data. I briefly explain the code I have written. First I read that 10 000 data point, later I split them and put all in a list named as everything_list. Just ignore the condition that while loop works. Later I put all the port addresses in a list and draw the histogram of those. Now suppose I have a million of data lines, I cannot read them in the first place let alone to categorize them. Some people told me to use arrays and some told me to process a chunk of data and after that process another chunk of data. I am confused with all people said. Can anybody help me with this issue?

text_file = open("test.data", "r")
a = text_file.read()
text_file.close()

everything_list = a.split()
source_port_list = []
i=0
while 6+7*i<len(everything_list):

    source_element = everything_list[2+7*i]
    source_port_position = source_element.rfind('.')
    source_port_number = int(source_element[source_port_position + 1:])
    source_port_list.append(source_port_number)

    i=i+1


import matplotlib.pyplot as plt
import pylab


numBins = 20
plt.hist(source_port_list, numBins, color='red', alpha=0.8)
plt.show()

This is the lines format:

15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719205 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719209 IP 129.63.57.175.45241 > 62.85.5.142.55455: tcp 0
15:42:42.719213 IP 24.34.41.8.1236 > 129.63.1.23.443: tcp 394
15:42:42.719217 IP 59.167.148.152.25918 > 129.63.57.40.36075: tcp 0
15:42:42.719260 IP 129.63.223.16.2823 > 80.67.87.25.80: tcp 682
15:42:42.719264 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0
15:42:42.719269 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0
Community
  • 1
  • 1

5 Answers5

3

I don't know what the data looks like, but I think the issue is that you try to hold it all in memory at once. You need to do it little by little, read the lines one by one and build the histogram as you go.

histogram = {}
with open(...) as f:
    for line in f:
        ip = ...
        if ip in histogram:
            histogram[ip] += 1
        else:
            histogram[ip] = 1

You can now plot the histogram, but use plt.plot not plt.hist since you already have the frequencies in the histogram dictionary.

Community
  • 1
  • 1
spelufo
  • 597
  • 2
  • 18
  • a defaultdict would be better – Padraic Cunningham Dec 27 '14 at 23:57
  • 1
    Interesting. There's also a Counter dictionary [in there](https://docs.python.org/2/library/collections.html). For me, sometimes simplicity trumps performance though. – spelufo Dec 28 '14 at 00:04
  • Would you mind rewriting the whole code? Since I am a new python user I do not know how to exert these changes! – Cristopher Van Paul Dec 28 '14 at 00:09
  • Oops, last line was wrong, fixed now. What it does is iterate through the lines, keeping a count on how many times it has seen a given ip. The histogram dictionary has for keys the ips and for values the corresponding counts. The last line simply starts the count for the ip at 1, and the other branch of the if statement increments the count for the ip it has found on that line. How to take the ip from the line is described in the other answers. I left it as `ip = ...` for you to fill in – spelufo Dec 28 '14 at 00:14
  • Don't mind the first two comments. They are just posible performance improvements using other data structures for the result – spelufo Dec 28 '14 at 00:16
3

You could use a regex and compile it outside your loop.

Altogether with reading your file in lazy mode, line by line.

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')
ports = []

for line in open("test.data", "r"):
    ports.append(re.search(r, line).group(0))

# determines the number of lines you want to take into account
i = (len(ports) - 6) // 7

# keeps only the first i elements
ports = ports[0:i]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

This code takes into account the fact that you want only the (n-6) / 7 first items, n being the number of lines of your source file. Try with some +1/-1 if it's not totally accurate. Getting rid of the unwanted items at the end allows your loop not to be bothered with checking a condition on each iteration.

EDIT:

You can combine several things above to get a more concise and efficient code:

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')

ports = [ re.search(r, line).group(0) for line in open("test.data", "r") ]
ports = ports[0:(len(ports) - 6) // 7]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

EDIT:

If you think your list of ports will be too large to fit in RAM (which I find unlikely), my advice would be to use a dict of ports:

ports = {}
for line in open("test.data", "r"):
    port = re.search(r, line).group(0)
    if not ports.get(port, False):
        ports[port] = 0
    ports[port] += 1

Which will give you something like:

>>> ports
{
    "8394": 182938,
    "8192": 839288,
    "1283": 9839
}

Note that in such a case, your call to plt.hist will have to be modified.

Jivan
  • 16,401
  • 7
  • 56
  • 89
  • But the point is, I will have a long list of ports which I might not have enough RAM capacity to store all. – Cristopher Van Paul Dec 28 '14 at 00:30
  • 1
    @CristopherVanPaul, you would want a serious amount of data to use all 24 gigs of ram – Padraic Cunningham Dec 28 '14 at 00:31
  • @CristopherVanPaul you could put them in a dictionary where keys are port number and values are the number of ports found - which would save you tons of space potentially – Jivan Dec 28 '14 at 00:32
  • @Jivan, as far as I know if I want to do this, I should not do it by hist because hist takes all data and then plot it. I should do this by processing a chunk of data and after that another chunk and ... – Cristopher Van Paul Dec 28 '14 at 00:36
  • @Jivan about 2 or 3 million port addresses. – Cristopher Van Paul Dec 28 '14 at 00:37
  • 1
    :@Jivan, `my_list[n:]` actually returns from n to the end not up to n – Padraic Cunningham Dec 28 '14 at 00:40
  • @Jivan regular expressions are cached in Python. Compiling it outside the loop shouldn't make much of a difference in speed (though it can be used to improve readability). See http://stackoverflow.com/a/452143/1935144. – IanH Dec 28 '14 at 00:41
  • @CristopherVanPaul an integer takes let's say 2 bytes. A string 4n bytes, n being the length of the string. So let's say 18 bytes per port. Which gives approx. 54Mb. Let's triple this for other stuff, and my guess is that 2 or 3 million ports will easily fit into 24Gb of RAM. – Jivan Dec 28 '14 at 00:41
  • @CristopherVanPaul ok. Knowing that, I advise you to keep with the list method, so that you can use it to call `plt.hist` as before – Jivan Dec 28 '14 at 00:55
  • @CristopherVanPaul any follow-up? – Jivan Jan 03 '15 at 01:46
  • @Jivan, Thank you, that was incredibly helpful. – Cristopher Van Paul Jan 04 '15 at 23:32
1

You can use split and a defaultdict which will be more efficient:

from collections import defaultdict

d = defaultdict(int)
with open("a_file.txt") as f:
    for line in f:
         d[line.split()[2].rsplit(".",1)[-1]] += 1 
print(d)

defaultdict(<type 'int'>, {'1236': 1, '2300': 1, '47843': 2, '45241': 1, '25918': 1, '2823': 1})

Might also be worth checking out different ways to plot, matplotlib is not the most efficient:

pyqtgraph, guiqwt, gnuplot.py

Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
  • This `split` method is more efficient but I personally find it less readable and less maintainable than a regex. – Jivan Dec 28 '14 at 00:57
  • Do you have an idea of how much more efficient? Like, say for 3M lines? – Jivan Dec 28 '14 at 00:58
  • 1
    well parsing a string is 3-4 times faster using the split method, i could probably make it more efficient I have not looked into it too much, I would actually probably use pandas myself to do what the OP wants – Padraic Cunningham Dec 28 '14 at 01:00
  • I would too, would definitely be a lot faster and cleaner. – Jivan Dec 28 '14 at 01:11
0

Sounds like you should be iterating by line and using regex to find the port. Try something like this:

import re

ports = []
with open("path/to/your/text/file.txt", 'r') as infile:
    for line in infile:
        ports.append(re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\.(\d+)", line))
        # that regex explained:
        # # re.compile(r"""
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     (               # BEGIN CAPTURING GROUP
        # #       \d+           #   1 or more digits
        # #     )               # END CAPTURING GROUP""", re.X)

This is assuming your IP/port is formatted as you explain in your comment

IP.IP.IP.IP.PORT
Adam Smith
  • 45,072
  • 8
  • 62
  • 94
  • `_ ,port = line.rsplit(".",1)` would be a lot better, I don't think this is really the issue though – Padraic Cunningham Dec 28 '14 at 00:00
  • @PadraicCunningham certainly the problem is `a = text_file.read()`, but I prefer using regex in instances like this: not the least bit because there are two ports in each line, and OP's code will only find one of them. – Adam Smith Dec 28 '14 at 00:02
  • Would you mind rewriting the whole code? Since I am a new python user I do not know how to exert these changes! BTW I do not know what does the last line of your code does! – Cristopher Van Paul Dec 28 '14 at 00:04
  • @CristopherVanPaul I edited with an extended definition of the regex. It basically finds anything that looks like the IP addresses in your logs (1-3 digits four times, separated by dots) then separates the port numbers and saves them in a list. I can't actually rewrite your code because I don't know anything about matplotlib, so I don't know what that is doing vs. what it might be able to do better !:) – Adam Smith Dec 28 '14 at 00:07
0

I know this is not an immediate response to your question, but as being new to python there is a nice Coursera course dealing with that very subject. "Programming for Everybody (Python)" it is free to take and wont use too much of your time. the course starts February 2 2015. Also the text book "Python for Informatics: Exploring Information" is a Free Creative Commons download. at http://www.pythonlearn.com/book.php I hope this helps.