Home > front end >  How to merge rows into one selecting best data from each row?
How to merge rows into one selecting best data from each row?

Time:11-09

I am trying to make a function that goes through a dataframe, this df has a column, let's call it "matched" which is numerical and it's an identifier for rows that I need to merge, so it looks something like this:

Id matched name animals flying hierarchy
1 1 peter cow yes 1
2 1 pedro no 2
3 2 angel dog yes 1
4 3 joshua cat no 3
5 3 harry no 1
6 3 senna bird 2
7 4 maria no 2
8 4 juan no 3
9 4 luis lama yes 2

The most important 2 columns are "matched" and "hierarchy", we are iterating on matched to create the groups in which we want to merge and hierarchy tells us what the best data is. I would like to have this groups:

Id matched name animals flying hierarchy
1 1 peter cow yes 1
2 1 pedro no 2
Id matched name animals flying hierarchy
3 2 angel dog yes 1
Id matched name animals flying hierarchy
4 3 joshua cat no 3
5 3 harry no 1
6 3 senna bird 2
Id matched name animals flying hierarchy
7 4 maria no 2
8 4 juan no 3
9 4 luis lama yes 2

And in the end I would like this outcome for my data: | Id | matched | name | animals | flying | hierarchy| | -------- | -------- | -------- | -------- | -------- | -------- | | 1 | 1 | peter | cow | yes | 1 |

Id matched name animals flying hierarchy
3 2 angel dog yes 1
Id matched name animals flying hierarchy
5 3 harry bird no 1
Id matched name animals flying hierarchy
7 4 maria no 2
9 4 luis lama yes 2

it should end up all in one df so the outcome really should look like this:

Id matched name animals flying hierarchy
1 1 peter cow yes 1
3 2 angel dog yes 1
5 3 harry bird no 1
7 4 maria no 2
9 4 luis lama yes 2

As you can see, I want one row per "matched" group unless there are 2 rows (or more) that share the best possible hierarchy in that group. The hierarchy is an int column and the best is 1, then 2 then 3 that is the worst value, this being the worst doesn't mean I want to remove it, but it means it would be better to replace if something better exists for that group, also keeping empty values is not a problem.

I would prefer it if the solution didn't use pandas iterrows() since it is a big dataframe and it would run really slow to go through the links

I tried using loops through the columns but it is way too slow or straight up doesn't work as I expected it to.

CodePudding user response:

You can use:

# is the hierarchy equal to the min per group?
m = df.groupby('matched')['hierarchy'].transform('min').eq(df['hierarchy'])

# keep the matching conditions
out = df[m]

output:

   Id  matched   name animals flying  hierarchy
0   1        1  peter     cow    yes          1
2   3        2  angel     dog    yes          1
4   5        3  harry     NaN     no          1
6   7        4  maria     NaN     no          2
8   9        4   luis    lama    yes          2

CodePudding user response:

You can try with a join after using groupby():

output = df.merge(df.groupby('matched',as_index=False)['hierarchy'].min(),
                 how='inner',
                 on = ['matched','hierarchy'])

Returning:

   Id  matched   name animals flying  hierarchy
0   1        1  peter     cow    yes          1
1   3        2  angel     dog    yes          1
2   5        3  harry     NaN     no          1
3   7        4  maria     NaN     no          2
4   9        4   luis    lama    yes          2
  • Related