Home > Back-end >  Count the number of occurrences of each word in a file and load into pandas
Count the number of occurrences of each word in a file and load into pandas

Time:03-15

How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?

CodePudding user response:

Considering that you have in test.txt this data :

stack monkey zimbra
flow zimbra zimbra help Edit Name
Name

You can do like this :

import string
import pandas as pd

# Open the file in read mode
text = open("test.txt", "r")
  
# Create an empty dictionary
dic = dict()
  
# Loop through each line of the file
for line in text:
    # Remove the leading spaces and newline character
    line = line.strip()
  
    # Convert the characters in line to 
    # lowercase to avoid case mismatch
    line = line.lower()
  
    # Split the line into words
    words = line.split(" ")
  
    # Iterate over each word in line
    for word in words:
        # Check if the word is already in dictionary
        if word in dic:
            # Increment count of word by 1
            dic[word] = dic[word]   1
        else:
            # Add the word to dictionary with count 1
            dic[word] = 1

#Convert dict into a dataframe
pd = pd.DataFrame(dic.items(), columns=['Name', 'Occurrence'])
print(pd)

Output :

     Name  Occurrence
0   stack           1
1  monkey           1
2  zimbra           3
3    flow           1
4    help           1
5    edit           1
6    name           2

CodePudding user response:

Use nltk:

# pip install nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
import pandas as pd

text = """How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?"""

tokenizer = RegexpTokenizer(r'\w ')
words = tokenizer.tokenize(text)

sr = pd.Series(FreqDist(words))

Output:

>>> sr
How            1
do             1
I              1
count          3
the            3
number         1
of             2
occurrences    1
each           1
word           1
in             1
a              1
txt            1
file           1
and            2
also           2
load           1
it             1
into           1
pandas         1
dataframe      2
with           1
columns        1
name           1
sort           1
on             1
column         1
dtype: int64
  • Related