I have a textfile with large amount of data (3 GB). Each line of this textfile contains time, source IP, destination IP and size. As you know the digits in the last section of IP address shows port address. I want to bring those port addresses to a histogram which I did it for 10 000 lines of data but as I could guess the Python code cannot be executed for that large amount of data. I briefly explain the code I have written. First I read that 10 000 data point, later I split them and put all in a list named as everything_list. Just ignore the condition that while loop works. Later I put all the port addresses in a list and draw the histogram of those. Now suppose I have a million of data lines, I cannot read them in the first place let alone to categorize them. Some people told me to use arrays and some told me to process a chunk of data and after that process another chunk of data. I am confused with all people said. Can anybody help me with this issue?
text_file = open("test.data", "r")
a = text_file.read()
text_file.close()
everything_list = a.split()
source_port_list = []
i=0
while 6+7*i<len(everything_list):
source_element = everything_list[2+7*i]
source_port_position = source_element.rfind('.')
source_port_number = int(source_element[source_port_position + 1:])
source_port_list.append(source_port_number)
i=i+1
import matplotlib.pyplot as plt
import pylab
numBins = 20
plt.hist(source_port_list, numBins, color='red', alpha=0.8)
plt.show()
This is the lines format:
15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719205 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719209 IP 129.63.57.175.45241 > 62.85.5.142.55455: tcp 0
15:42:42.719213 IP 24.34.41.8.1236 > 129.63.1.23.443: tcp 394
15:42:42.719217 IP 59.167.148.152.25918 > 129.63.57.40.36075: tcp 0
15:42:42.719260 IP 129.63.223.16.2823 > 80.67.87.25.80: tcp 682
15:42:42.719264 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0
15:42:42.719269 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0