Get pandas df column value distribution-CodePudding

I am trying to create a new column in a pandas dataframe from the values of another column. What I intend to obtain is from the column of goals per team, to distribute the goal count in an array of 10 positions. Where the 0 position of the array shows how many have scored between 0 and 10 goals and so on until the last position (9) that counts how many between 90 and 100.

For example:

From a dataframe like this one:

| player_id | team    | goals  |
|-----------|---------|--------|
| ply_1     | Arsenal | 100    |
| ply_2     | Arsenal | 2      |
| ply_3     | Chelsea | 21     | 
| ply_4     | Chelsea | 13     |  
| ply_5     | Arsenal | 50     |

Get one like the following:

| player_id | team    | goals | goals_distribution_by_team| 
|-----------|---------|-------|---------------------------|
| ply_1     | Arsenal | 100   | [1,0,0,0,0,1,0,0,0,1]     | 
| ply_2     | Arsenal | 2     | [1,0,0,0,0,1,0,0,0,1]     | 
| ply_3     | Chelsea | 21    | [0,1,1,0,0,0,0,0,0,0]     | 
| ply_4     | Chelsea | 13    | [0,1,1,0,0,0,0,0,0,0]     | 
| ply_5     | Arsenal | 50    | [1,0,0,0,0,1,0,0,0,1]     |

In this case we can see that for Arsenal team the number of goals are distributed for 0-10 (ply_2), 50-60 (ply_5) and 90-100 (ply_1).

So far I have achieved this by doing a for that goes through the goals column and checks if the number of goals is in which range.


for goal in goals:
 if 0 < goal <=10:
    count()
 if 10 < goal <=20:
    count()
....
 if 90 < goal <=100:
    count()

Is there more pythonic way to achive this?

Thanks!

CodePudding user response：

Use get_dummies with divide column by integers division by 10, then add 0 by missing range values in DataFrame.reindex, redistibute 1 per groups by GroupBy.transform and last convert to lists:

df1 = pd.get_dummies((df['goals'] // 10))
max1 = range(df1.columns.max()   1)

df['goals_distribution_by_team'] = (df1.reindex(max1, axis=1, fill_value=0)
                                       .groupby(df['team']).transform('max')
                                       .to_numpy().tolist())
print (df)
  player_id     team  goals         goals_distribution_by_team
0     ply_1  Arsenal    100  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
1     ply_2  Arsenal      2  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
2     ply_3  Chelsea     21  [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
3     ply_4  Chelsea     13  [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
4     ply_5  Arsenal     50  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

CodePudding user response：

One way is to group the dataframe by team, then apply a function to get those ones and zeroes as lists:

out = (df.groupby('team')['goals']
        .apply(lambda x: [int((i==x//10).any()) for i in range(11)])
         )

team
Arsenal    [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
Chelsea    [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Name: goals, dtype: object

Once you have it, you can merge it back to the original dataframe

df.reset_index().merge(out.rename('goals_distribution_by_team'), on='team')

  player_id     team  goals         goals_distribution_by_team
0     ply_1  Arsenal    100  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
1     ply_2  Arsenal      2  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
2     ply_5  Arsenal     50  [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
3     ply_3  Chelsea     21  [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
4     ply_4  Chelsea     13  [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]