Home > OS >  Merge column values as a list in DataFrames based on a condition
Merge column values as a list in DataFrames based on a condition

Time:04-09

I have a DataFrame like

| Item | Match | Score |
| ---- | ----- | ----- |
| A    | B     | 0.9   |
| A    | C     | 0.8   |
| A    | D     | 0.7   |
| A    | E     | 0.6   |
| B    | A     | 0.9   |
| B    | X     | 0.7   |
| C    | Y     | 0.45  |
| C    | Z     | 0.4   |

I want to create a new DataFrame which has all the matches for column - 'Item' as a list for Match, Score.

| Item | MatchList | ScoreList       | Total Matches |
| ---- | --------- | --------------- | ------------- |
| A    | [B,C,D]   | [0.9,0.8,0.7]   | 4             |
| B    | [A,X]     | [0.9,0.7]       | 2             |
| C    | [Y,Z]     | [0.45,0.4]      | 2             |

I also want to know number of such matches but only keep the top 3 based on score. I have already sorted the values of score for each item in descending order. Now, is it possible to get a list created like this for Match, Score and also have the total matches so in case someone wants to look for more than top3 for an item they can do that based on how many matches an item has.

I posted an answer too that is using aggregate. Open to more suggestions.

CodePudding user response:

We can do two groupby , 1st get the top 3 line , 2nd get the agg format

out = df.sort_values('Score',ascending=False).\
            groupby('Item').head(3).\
                groupby('Item').\
                   agg(MatchList = ('Match', list), 
                       ScoreList = ('Score', list), 
                       TotalMatches = ('Score', 'count')).reset_index()
Out[172]: 
  Item  MatchList        ScoreList  TotalMatches
0    A  [B, C, D]  [0.9, 0.8, 0.7]             3
1    B     [A, X]       [0.9, 0.7]             2
2    C     [Y, Z]      [0.45, 0.4]             2

CodePudding user response:

I was able to get the list using aggregate function as:

df_result = df.groupby(['Item']).agg(lambda x: x.tolist())

I can now count the values in the list to get the final column and then filter top 3 for 'match' and 'score'.

  • Related