How to count words from strings read from a file?

Question

I am trying to create a program which takes all text file in a given path and save all strings in one list:

import os
import collections

vocab = set()
path = 'a\\path\\'

listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
for file in listing:
    #print('Current file : ', file)
    pos_review = open(path+file, "r", encoding ='utf8')
    words = pos_review.read().split()
    #print(type(words))
    vocab.update(words)
pos_review.close()

print(vocab)
pos_dict = dict.fromkeys(vocab,0)
print(pos_dict)

Input

file1.txt: A quick brown fox.
file2.txt: a quick boy ran.
file3.txt: fox ran away.

Output

A : 2
quick : 2
brown : 1
fox : 2
boy : 1
ran : 2
away : 1

Until now I am able to make a dictionary of those strings. But now not sure how to make key, value pair of strings and their frequency in all text files combined.

Parse the dictionary by looping over the key,values `for k, v in dict:`, then just print out the information — Frontear, Oct 03 '19 at 17:29
Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [Minimal, complete, verifiable example](https://stackoverflow.com/help/minimal-reproducible-example) applies here. We cannot effectively help you until you post your MCVE code and accurately specify the problem. We should be able to paste your posted code into a text file and reproduce the problem you specified. — Prune, Oct 03 '19 at 17:37

Sudharsana Rajasekaran · Accepted Answer · 2019-10-03T18:08:24.197

0

Hope this helps,

import os
import collections

vocab = set()
path = 'a\\path\\'

listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
whole=[]
for file in listing:
    #print('Current file : ', file)
    pos_review = open(path+file, "r", encoding ='utf8')
    words = pos_review.read().split()
    whole.extend(words)
pos_review.close()

print(vocab)

d={} #Creating an Empty dictionary
for item in whole:
    if item in d.keys():
        d[item]+=1 #Update count
    else:
        d[item]=1
print(d)

edited Oct 03 '19 at 18:08

answered Oct 03 '19 at 17:50

Sudharsana Rajasekaran

184
1
10

I am able to create a dictionary with count value for one file but as I am taking all the files in a given directory, it is creating list of the list. For which straight forward method isn't working – user3863292 Oct 03 '19 at 17:55
modified the answer. is it the one you are looking for – Sudharsana Rajasekaran Oct 03 '19 at 18:10
1

Thanks, this is exactly what I was trying. – user3863292 Oct 03 '19 at 18:21

Trenton McKinney · Answer 2 · 2019-10-03T21:46:22.007

Use `collections.Counter`:

Counter is a dict subclass for counting iterables

Data

Given 3 files, named t1.txt, t2.txt & t3.txt
Each file contains the following 3 lines of text

file1 txt A quick brown fox.
file2 txt a quick boy ran.
file3 txt fox ran away.

Code:

Get the files:

pathlib

from pathlib import Path

files = list(Path('e:/PythonProjects/stack_overflow/t-files').glob('t*.txt'))
print(files)

# Output
[WindowsPath('e:/PythonProjects/stack_overflow/t-files/t1.txt'),
 WindowsPath('e:/PythonProjects/stack_overflow/t-files/t2.txt'),
 WindowsPath('e:/PythonProjects/stack_overflow/t-files/t3.txt')]

Collect and count the words:

Create a separate function, clean_str, for cleaning each line of text
str.lower for lowercase letters
str.translate, str.maketrans & string.punctuation for highly optimized punctuation removal
- From Best way to strip punctuation from a string

from collections import Counter
import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

words = Counter()
for file in files:
    with file.open('r') as f:
        lines = f.readlines()
        for line in lines:
            line = clean_string(line)
            words.update(line)

print(words)

# Output
Counter({'file1': 3,
         'txt': 9,
         'a': 6,
         'quick': 6,
         'brown': 3,
         'fox': 6,
         'file2': 3,
         'boy': 3,
         'ran': 6,
         'file3': 3,
         'away': 3})

List of `words`:

list_words = list(words.keys())
print(list_words)

>>> ['file1', 'txt', 'a', 'quick', 'brown', 'fox', 'file2', 'boy', 'ran', 'file3', 'away']

score 0 · Answer 3 · answered Oct 03 '19 at 19:15

this also works

import pandas as pd
import glob.glob

files = glob.glob('test*.txt')
txts = []

for f in files:
    with open (f,'r') as t: txt = t.read()
    txts.append(txt)

texts=' '.join(txts)
df = pd.DataFrame({'words':texts.split()})
out = df.words.value_counts().to_dict()