Issue with value count in Python-CodePudding

I have a large dataset containing football data and I would like to figure something out. The dataset contains data from a lot of games, and for every game, all the players in the club is mentioned in a column in the format of list. I would like to find out how I can get an output like this

Player     clubs
    Tom     3 
    Car     2 
    Jon     2
    Tex     1

etc.

This is the code I have tried for it, but i get an error: unhashable type: Series

df = pd.DataFrame({'club': ['Bath', 'Bath', 'Bristol', 'Bristol', 'Bristol', 'Swindon'], 
                   'Players': [[ 'Tom', 'Jon', 'Tex'],[ 'Tom', 'Jon', 'Tex'],[ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Tom',  'Car',  'Jon']]})

tr = df.groupby('club')
trt = pd.Series([bg for bgs in tr.players_as_list for bg in bgs])
trt.value_counts()

CodePudding user response：

As you have a Series of list, it will be slow to use explode and drop_duplicates.

Here pure python should be more efficient:

from collections import defaultdict
d = defaultdict(set)
for c, l in zip(df['club'], df['Players']):
    for k in l:
        d[k].add(c)
out = pd.Series({k: len(s) for k,s in d.items()}).sort_values(ascending=False)

output:

Tom     3
Jon     2
Car     2
Tex     1
Snow    1
dtype: int64

comparison on 600K rows:

# explode   groupby
605 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# explode   value_counts
476 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pure python
259 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

You can try explode the Players column then count the unique values of club column in each group

df = (df.explode('Players')
      .groupby('Players')['club'].nunique()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players').drop_duplicates(['Players', 'club'])
      .groupby('Players')['club'].count()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players')
      .groupby('Players')['club'].unique().apply(len)
      .to_frame('clubs').reset_index())

print(df)

  Players  clubs
0     Car      2
1     Jon      2
2    Snow      1
3     Tex      1
4     Tom      3

CodePudding user response：

You can use value_counts instead of groupby after explode your dataframe:

out = (df.explode('Players').drop_duplicates(['Players', 'club'])
         .value_counts('Players').rename('clubs').reset_index())
print(out)

# Output
  Players  clubs
0     Tom      3
1     Car      2
2     Jon      2
3    Snow      1
4     Tex      1