I want to see which tags occur most frequently in my dataset. When i try to do this on my own i get something like this:
df['tags'].value_counts()
['Startup'] 80
['Bitcoin'] 79
['The Daily Pick'] 78
['Addiction', 'Health', 'Body', 'Alcohol', 'Mental Health'] 62
Some articles have many tags but I would like to count the tracking count for each tag separately.
CodePudding user response:
IIUC, You need to use ast.literal_eval
, explode()
, and then use value_counts()
.
from ast import literal_eval
import pandas as pd
res = df['tags'].apply(literal_eval).explode().value_counts()
print(res)
Output:
Startup 4
Bitcoin 3
Addiction 2
Health 2
Name: tags, dtype: int64
Sample input DataFrame:
df = pd.DataFrame({
"tags" : [
"['Startup']", "['Startup']", "['Startup']", "['Startup']",
"['Bitcoin']", "['Bitcoin']", "['Bitcoin']",
"['Addiction', 'Health']", "['Addiction', 'Health']"
]
})
By thanks @ljmc:
NB. ast.literal_eval
is not safe always. from doc:
This function had been documented as “safe” in the past without defining what that meant. That was misleading. This is specifically designed not to execute Python code, unlike the more general eval(). [...] But it is not free from attack: A relatively small input can lead to memory exhaustion or to C stack exhaustion, crashing the process. There is also the possibility for excessive CPU consumption denial of service on some inputs. Calling it on untrusted data is thus not recommended.
CodePudding user response:
You can use a collections.Counter
and apply
or agg
to your series.
import pandas as pd
from collections import Counter
df = pd.DataFrame({
"tags": [['Startup'], ["Bitcoin"], ["Startup", "Ethereum"]]
})
c = Counter()
df["tags"].apply(c.update)
c
contains
Counter({'Startup': 2, 'Bitcoin': 1, 'Ethereum': 1})