Signature Creating for txt files-CodePudding

I am making a code that is counting all the words in a text file and is supposed to sort the top 25 most occurring words in the text file, which is known as a signature, and then store it in a list. Next, it is supposed to compare the signatures using the Jaccard Similarity Measure. I have the code for Jaccard similarity but I need to modify it for my program since I took it from a different example. The code that is creating the signature is giving me this error: Column 'Prophet' has dtype object, cannot use method 'nlargest' with this type

The code that I have for it is this:

#creating a signature for each txt file
    signature_list = []
    for column in df:
        if column != "Word": #for all columns that aren't word
            print(df.nlargest(25, column))

I was doing research on ways to how I can sort all the words by the top 25 most common ones and this is the most efficient way but it's giving me this error. Is there another way that I can sort this out? Also, how would I then add the 25 words into a new list for each text? Any feedback is greatly appreciated. Please show the change in code. Thanks in advance!

CodePudding user response：

Cause of error:

nlargest cannot act on object column such as a string column, and because in your df there is at least one column (except for "Word") that is an object column, so it raised the error at that column.

Fix of error:

print(df.dtypes) to inspect the data type of each column, and either not to apply nlargest on object column, or convert object column to other type (such as int, float).

how I can sort all the words by the top 25 most common ones and this is the most efficient way ... Is there another way that I can sort this out?

I think nlargest should be your choice, just make sure the column that you are applying nlargest is the number of counts of the corresponding word. Also, if you have such column, you don't actually need to use a loop to iterate over all columns, because you only need to do nlargest on that particular column.

Also, how would I then add the 25 words into a new list for each text?

Refers to this example

df = pd.DataFrame({
    'Word': ['happy', 'hello', 'you', 'he', 'she', 'it'],
    'Count': [27, 32,19, 6,80, 5]
})

largest = df.set_index('Word')['Count'].nlargest(3)
print(largest)

Output:

Word
she      80
hello    32
happy    27
Name: Count, dtype: int64

You will get a pd.Series of the (3) largest counts, and the corresponding word as the index. Then you can extract the index and convert it into a list by

largest.index.tolist()

Lastly, if your dataframe has a different column format from my example above, it would be better for you to convert yours into mine. If you are not sure how to convert, you need to share a subdataframe here for us to look at. You can export the first 10 rows of your dataframe by print(df.head(10).to_dict('list')) and paste the outcome as text in your question.