Counting strings after pre processing of a dataframe column-CodePudding

Have a pandas dataframe:

       text
["string1","sttring2"]
["string","string3"]
["string2"]

I am running this code to extract the strings in the df['text'] column:

strings_list = set(df['text'].str.extractall('\"([^"] )\"')[0])

And I get a list of distinct string: string1, string2,string3, string

What I need from here:

I need to count how many times each of these string appears in the whole dataframe. My problem is that some rows, as you can see, has more than one string and also I need to remove the [] and "" and split then individually.

In the example above, the output should be:

string1 : 1,
string2 : 2,
string3 : 1,
string  : 1

I am having a hard time to do this task. Can someone help me?

CodePudding user response：

We can try explode then value_counts

out = df.text.explode().value_counts()
Out[49]: 
string2    2
string1    1
string3    1
string     1
Name: text, dtype: int64

In case you have str

import ast

df.text.map(ast.literal_eval).explode().value_counts()
Out[54]: 
string2    2
string1    1
string3    1
string     1
Name: text, dtype: int64

CodePudding user response：

You could try this.

import pandas as pd

df = pd.DataFrame([['"string1","string2"'],['"string","string3"'],['"string2"']], columns=['text'])    
word_count = {}
strings_list = set(df['text'].str.extractall('\"([^"] )\"')[0])
for word in strings_list:
    word_count[word] = df.text.str.count('"'   word   '"').sum()

print(word_count)

CodePudding user response：

Can you try the following:

import collections
import pandas as pd

df = pd.DataFrame([[["string1","string2"]], [["string","string3"]], [["string2"]]], columns=['text'])
print(df)
mylist = [t for text in df['text'].values for t in text]
print(mylist)
print(collections.Counter(mylist))

Output:

                  text
0   [string1, string2]
1   [string, string3]
2   [string2]

['string1', 'string2', 'string', 'string3', 'string2']

Counter({'string1': 1, 'string2': 2, 'string': 1, 'string3': 1})

Example 2:

import ast
import collections
import pandas as pd

df = pd.DataFrame([['["string1","string2"]'],['["string","string3"]'],['["string2"]']], columns=['text'])
temp = [ast.literal_eval(text) for text in df['text'].values]
mylist = [t for text in temp for t in text]
print(collections.Counter(mylist))