Count the number of occurrences of each word in a file and load into pandas-CodePudding

How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?

CodePudding user response：

Considering that you have in test.txt this data :

stack monkey zimbra
flow zimbra zimbra help Edit Name
Name

You can do like this :

import string
import pandas as pd

# Open the file in read mode
text = open("test.txt", "r")
  
# Create an empty dictionary
dic = dict()
  
# Loop through each line of the file
for line in text:
    # Remove the leading spaces and newline character
    line = line.strip()
  
    # Convert the characters in line to 
    # lowercase to avoid case mismatch
    line = line.lower()
  
    # Split the line into words
    words = line.split(" ")
  
    # Iterate over each word in line
    for word in words:
        # Check if the word is already in dictionary
        if word in dic:
            # Increment count of word by 1
            dic[word] = dic[word]   1
        else:
            # Add the word to dictionary with count 1
            dic[word] = 1

#Convert dict into a dataframe
pd = pd.DataFrame(dic.items(), columns=['Name', 'Occurrence'])
print(pd)

Output :

     Name  Occurrence
0   stack           1
1  monkey           1
2  zimbra           3
3    flow           1
4    help           1
5    edit           1
6    name           2

CodePudding user response：

Use nltk:

# pip install nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
import pandas as pd

text = """How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?"""

tokenizer = RegexpTokenizer(r'\w ')
words = tokenizer.tokenize(text)

sr = pd.Series(FreqDist(words))

Output:

>>> sr
How            1
do             1
I              1
count          3
the            3
number         1
of             2
occurrences    1
each           1
word           1
in             1
a              1
txt            1
file           1
and            2
also           2
load           1
it             1
into           1
pandas         1
dataframe      2
with           1
columns        1
name           1
sort           1
on             1
column         1
dtype: int64