Have a pandas dataframe:
text
["string1","sttring2"]
["string","string3"]
["string2"]
I am running this code to extract the strings in the df['text']
column:
strings_list = set(df['text'].str.extractall('\"([^"] )\"')[0])
And I get a list of distinct string: string1
, string2
,string3
, string
What I need from here:
I need to count how many times each of these string appears in the whole dataframe.
My problem is that some rows, as you can see, has more than one string and also I need to remove the []
and ""
and split then individually.
In the example above, the output should be:
string1 : 1,
string2 : 2,
string3 : 1,
string : 1
I am having a hard time to do this task. Can someone help me?
CodePudding user response:
We can try explode
then value_counts
out = df.text.explode().value_counts()
Out[49]:
string2 2
string1 1
string3 1
string 1
Name: text, dtype: int64
In case you have str
import ast
df.text.map(ast.literal_eval).explode().value_counts()
Out[54]:
string2 2
string1 1
string3 1
string 1
Name: text, dtype: int64
CodePudding user response:
You could try this.
import pandas as pd
df = pd.DataFrame([['"string1","string2"'],['"string","string3"'],['"string2"']], columns=['text'])
word_count = {}
strings_list = set(df['text'].str.extractall('\"([^"] )\"')[0])
for word in strings_list:
word_count[word] = df.text.str.count('"' word '"').sum()
print(word_count)
CodePudding user response:
Can you try the following:
import collections
import pandas as pd
df = pd.DataFrame([[["string1","string2"]], [["string","string3"]], [["string2"]]], columns=['text'])
print(df)
mylist = [t for text in df['text'].values for t in text]
print(mylist)
print(collections.Counter(mylist))
Output:
text
0 [string1, string2]
1 [string, string3]
2 [string2]
['string1', 'string2', 'string', 'string3', 'string2']
Counter({'string1': 1, 'string2': 2, 'string': 1, 'string3': 1})
Example 2:
import ast
import collections
import pandas as pd
df = pd.DataFrame([['["string1","string2"]'],['["string","string3"]'],['["string2"]']], columns=['text'])
temp = [ast.literal_eval(text) for text in df['text'].values]
mylist = [t for text in temp for t in text]
print(collections.Counter(mylist))