Home > other >  Creating Signature for Text
Creating Signature for Text

Time:02-10

I'm creating a program where I need to read all the lines and words in a txt file, count the number of times the words show up and do this for multiple txt files. Then I need to create a "signature" which is the 25 most common words and compare each txt file's signature to the other files. I worked all day today to get the program that counts the words in each text but I was stuck on the part on how to get the signature. Essentially when the program runs, this is what is shown: enter image description here

The program creates a column called word, which shows all the words that are in the text and the number of times they show up for each text file. I have 2 for now, but I'll have more later on. I need to sort this list of words out, so the top 25 words that show up the most will be part of the "signature" and stored in a list, one list for each text. I am not sure how to sort out this many words. I have been thinking about how I could do this, and I thought of creating a list of lists, but I don't think that will work. Could anyone suggest something and show some code with it? I could also show you the program in private and show the change in code that way. Any help would really be great considering how long this has taken me today. Thanks in advance!

CodePudding user response:

you have all the words with the repeating frequency just apply

from collections import Counter

count = Counter(your whole code text file to string)

this will give you the frequency of each word and then sort this out in descending order with this code

list(reversed(sorted(count.keys())))

and then you have the list of the most repeating numbers.

CodePudding user response:

You could try this

import pandas as pd

df = pd.DataFrame([['word1',1,222], ['word2',10,20],['word3',111,1],['word4',11,62]], columns =['word', 'file1','file2'])

#Convert the columns containing word count to numeric
df['file1'] = pd.to_numeric(df['file1'])
df['file2'] = pd.to_numeric(df['file2'])

wordlist =[]
for column in df.columns:
    if column != 'word':
        #sort datatable columnwise and pick the top words from word column. Replace the value 3 by the required number.
        #append it to a list of lists
        wordlist.append([df.sort_values(column, ascending=False)['word'].head(3)])

print(wordlist)
  • Related