I have a large dataset containing football data and I would like to figure something out. The dataset contains data from a lot of games, and for every game, all the players in the club is mentioned in a column in the format of list. I would like to find out how I can get an output like this
Player clubs
Tom 3
Car 2
Jon 2
Tex 1
etc.
This is the code I have tried for it, but i get an error: unhashable type: Series
df = pd.DataFrame({'club': ['Bath', 'Bath', 'Bristol', 'Bristol', 'Bristol', 'Swindon'],
'Players': [[ 'Tom', 'Jon', 'Tex'],[ 'Tom', 'Jon', 'Tex'],[ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Tom', 'Car', 'Jon']]})
tr = df.groupby('club')
trt = pd.Series([bg for bgs in tr.players_as_list for bg in bgs])
trt.value_counts()
CodePudding user response:
As you have a Series of list, it will be slow to use explode
and drop_duplicates
.
Here pure python should be more efficient:
from collections import defaultdict
d = defaultdict(set)
for c, l in zip(df['club'], df['Players']):
for k in l:
d[k].add(c)
out = pd.Series({k: len(s) for k,s in d.items()}).sort_values(ascending=False)
output:
Tom 3
Jon 2
Car 2
Tex 1
Snow 1
dtype: int64
comparison on 600K rows:
# explode groupby
605 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# explode value_counts
476 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pure python
259 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
You can try explode the Players
column then count the unique values of club
column in each group
df = (df.explode('Players')
.groupby('Players')['club'].nunique()
.to_frame('clubs').reset_index())
# or
df = (df.explode('Players').drop_duplicates(['Players', 'club'])
.groupby('Players')['club'].count()
.to_frame('clubs').reset_index())
# or
df = (df.explode('Players')
.groupby('Players')['club'].unique().apply(len)
.to_frame('clubs').reset_index())
print(df)
Players clubs
0 Car 2
1 Jon 2
2 Snow 1
3 Tex 1
4 Tom 3
CodePudding user response:
You can use value_counts
instead of groupby
after explode
your dataframe:
out = (df.explode('Players').drop_duplicates(['Players', 'club'])
.value_counts('Players').rename('clubs').reset_index())
print(out)
# Output
Players clubs
0 Tom 3
1 Car 2
2 Jon 2
3 Snow 1
4 Tex 1