How can I count the number of occurrences of a given string in a string array in pandas-CodePudding

I want to see which tags occur most frequently in my dataset. When i try to do this on my own i get something like this:

df['tags'].value_counts()

['Startup'] 80
['Bitcoin'] 79
['The Daily Pick'] 78
['Addiction', 'Health', 'Body', 'Alcohol', 'Mental Health'] 62

Some articles have many tags but I would like to count the tracking count for each tag separately.

CodePudding user response：

IIUC, You need to use ast.literal_eval, explode(), and then use value_counts().

from ast import literal_eval
import pandas as pd

res = df['tags'].apply(literal_eval).explode().value_counts()
print(res)

Output:

Startup      4
Bitcoin      3
Addiction    2
Health       2
Name: tags, dtype: int64

Sample input DataFrame:

df = pd.DataFrame({
    "tags" : [
        "['Startup']", "['Startup']", "['Startup']", "['Startup']",
        "['Bitcoin']", "['Bitcoin']", "['Bitcoin']", 
        "['Addiction', 'Health']", "['Addiction', 'Health']"
    ]
})

By thanks @ljmc:

NB. ast.literal_eval is not safe always. from doc:

This function had been documented as “safe” in the past without defining what that meant. That was misleading. This is specifically designed not to execute Python code, unlike the more general eval(). [...] But it is not free from attack: A relatively small input can lead to memory exhaustion or to C stack exhaustion, crashing the process. There is also the possibility for excessive CPU consumption denial of service on some inputs. Calling it on untrusted data is thus not recommended.

CodePudding user response：

You can use a collections.Counter and apply or agg to your series.

import pandas as pd
from collections import Counter

df = pd.DataFrame({
    "tags": [['Startup'], ["Bitcoin"], ["Startup", "Ethereum"]]
})

c = Counter()
df["tags"].apply(c.update)

c contains

Counter({'Startup': 2, 'Bitcoin': 1, 'Ethereum': 1})