I am making a code that is counting all the words in a text file and is supposed to sort the top 25 most occurring words in the text file, which is known as a signature, and then store it in a list. Next, it is supposed to compare the signatures using the Jaccard Similarity Measure. I have the code for Jaccard similarity but I need to modify it for my program since I took it from a different example. The code that is creating the signature is giving me this error: Column 'Prophet' has dtype object, cannot use method 'nlargest' with this type
The code that I have for it is this:
#creating a signature for each txt file
signature_list = []
for column in df:
if column != "Word": #for all columns that aren't word
print(df.nlargest(25, column))
I was doing research on ways to how I can sort all the words by the top 25 most common ones and this is the most efficient way but it's giving me this error. Is there another way that I can sort this out? Also, how would I then add the 25 words into a new list for each text? Any feedback is greatly appreciated. Please show the change in code. Thanks in advance!
CodePudding user response:
Cause of error:
nlargest
cannot act on object
column such as a string column, and because in your df
there is at least one column (except for "Word"
) that is an object
column, so it raised the error at that column.
Fix of error:
print(df.dtypes)
to inspect the data type of each column, and either not to apply nlargest
on object
column, or convert object
column to other type (such as int
, float
).
how I can sort all the words by the top 25 most common ones and this is the most efficient way ... Is there another way that I can sort this out?
I think nlargest
should be your choice, just make sure the column that you are applying nlargest
is the number of counts of the corresponding word. Also, if you have such column, you don't actually need to use a loop to iterate over all columns, because you only need to do nlargest
on that particular column.
Also, how would I then add the 25 words into a new list for each text?
Refers to this example
df = pd.DataFrame({
'Word': ['happy', 'hello', 'you', 'he', 'she', 'it'],
'Count': [27, 32,19, 6,80, 5]
})
largest = df.set_index('Word')['Count'].nlargest(3)
print(largest)
Output:
Word
she 80
hello 32
happy 27
Name: Count, dtype: int64
You will get a pd.Series
of the (3) largest counts, and the corresponding word as the index. Then you can extract the index and convert it into a list by
largest.index.tolist()
Lastly, if your dataframe has a different column format from my example above, it would be better for you to convert yours into mine. If you are not sure how to convert, you need to share a subdataframe here for us to look at. You can export the first 10 rows of your dataframe by print(df.head(10).to_dict('list'))
and paste the outcome as text in your question.