How to merge rows into one selecting best data from each row?-CodePudding

I am trying to make a function that goes through a dataframe, this df has a column, let's call it "matched" which is numerical and it's an identifier for rows that I need to merge, so it looks something like this:

Id	matched	name	animals	flying	hierarchy
1	1	peter	cow	yes	1
2	1	pedro		no	2
3	2	angel	dog	yes	1
4	3	joshua	cat	no	3
5	3	harry		no	1
6	3	senna	bird		2
7	4	maria		no	2
8	4	juan		no	3
9	4	luis	lama	yes	2

The most important 2 columns are "matched" and "hierarchy", we are iterating on matched to create the groups in which we want to merge and hierarchy tells us what the best data is. I would like to have this groups:

Id	matched	name	animals	flying	hierarchy
1	1	peter	cow	yes	1
2	1	pedro		no	2

Id	matched	name	animals	flying	hierarchy
3	2	angel	dog	yes	1

Id	matched	name	animals	flying	hierarchy
4	3	joshua	cat	no	3
5	3	harry		no	1
6	3	senna	bird		2

Id	matched	name	animals	flying	hierarchy
7	4	maria		no	2
8	4	juan		no	3
9	4	luis	lama	yes	2

And in the end I would like this outcome for my data: | Id | matched | name | animals | flying | hierarchy| | -------- | -------- | -------- | -------- | -------- | -------- | | 1 | 1 | peter | cow | yes | 1 |

Id	matched	name	animals	flying	hierarchy
3	2	angel	dog	yes	1

Id	matched	name	animals	flying	hierarchy
5	3	harry	bird	no	1

Id	matched	name	animals	flying	hierarchy
7	4	maria		no	2
9	4	luis	lama	yes	2

it should end up all in one df so the outcome really should look like this:

Id	matched	name	animals	flying	hierarchy
1	1	peter	cow	yes	1
3	2	angel	dog	yes	1
5	3	harry	bird	no	1
7	4	maria		no	2
9	4	luis	lama	yes	2

As you can see, I want one row per "matched" group unless there are 2 rows (or more) that share the best possible hierarchy in that group. The hierarchy is an int column and the best is 1, then 2 then 3 that is the worst value, this being the worst doesn't mean I want to remove it, but it means it would be better to replace if something better exists for that group, also keeping empty values is not a problem.

I would prefer it if the solution didn't use pandas iterrows() since it is a big dataframe and it would run really slow to go through the links

I tried using loops through the columns but it is way too slow or straight up doesn't work as I expected it to.

CodePudding user response：

You can use:

# is the hierarchy equal to the min per group?
m = df.groupby('matched')['hierarchy'].transform('min').eq(df['hierarchy'])

# keep the matching conditions
out = df[m]

output:

   Id  matched   name animals flying  hierarchy
0   1        1  peter     cow    yes          1
2   3        2  angel     dog    yes          1
4   5        3  harry     NaN     no          1
6   7        4  maria     NaN     no          2
8   9        4   luis    lama    yes          2

CodePudding user response：

You can try with a join after using groupby():

output = df.merge(df.groupby('matched',as_index=False)['hierarchy'].min(),
                 how='inner',
                 on = ['matched','hierarchy'])

Returning:

   Id  matched   name animals flying  hierarchy
0   1        1  peter     cow    yes          1
1   3        2  angel     dog    yes          1
2   5        3  harry     NaN     no          1
3   7        4  maria     NaN     no          2
4   9        4   luis    lama    yes          2