I am trying to make a function that goes through a dataframe, this df has a column, let's call it "matched" which is numerical and it's an identifier for rows that I need to merge, so it looks something like this:
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
1 | 1 | peter | cow | yes | 1 |
2 | 1 | pedro | no | 2 | |
3 | 2 | angel | dog | yes | 1 |
4 | 3 | joshua | cat | no | 3 |
5 | 3 | harry | no | 1 | |
6 | 3 | senna | bird | 2 | |
7 | 4 | maria | no | 2 | |
8 | 4 | juan | no | 3 | |
9 | 4 | luis | lama | yes | 2 |
The most important 2 columns are "matched" and "hierarchy", we are iterating on matched to create the groups in which we want to merge and hierarchy tells us what the best data is. I would like to have this groups:
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
1 | 1 | peter | cow | yes | 1 |
2 | 1 | pedro | no | 2 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
3 | 2 | angel | dog | yes | 1 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
4 | 3 | joshua | cat | no | 3 |
5 | 3 | harry | no | 1 | |
6 | 3 | senna | bird | 2 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
7 | 4 | maria | no | 2 | |
8 | 4 | juan | no | 3 | |
9 | 4 | luis | lama | yes | 2 |
And in the end I would like this outcome for my data: | Id | matched | name | animals | flying | hierarchy| | -------- | -------- | -------- | -------- | -------- | -------- | | 1 | 1 | peter | cow | yes | 1 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
3 | 2 | angel | dog | yes | 1 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
5 | 3 | harry | bird | no | 1 |
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
7 | 4 | maria | no | 2 | |
9 | 4 | luis | lama | yes | 2 |
it should end up all in one df so the outcome really should look like this:
Id | matched | name | animals | flying | hierarchy |
---|---|---|---|---|---|
1 | 1 | peter | cow | yes | 1 |
3 | 2 | angel | dog | yes | 1 |
5 | 3 | harry | bird | no | 1 |
7 | 4 | maria | no | 2 | |
9 | 4 | luis | lama | yes | 2 |
As you can see, I want one row per "matched" group unless there are 2 rows (or more) that share the best possible hierarchy in that group. The hierarchy is an int column and the best is 1, then 2 then 3 that is the worst value, this being the worst doesn't mean I want to remove it, but it means it would be better to replace if something better exists for that group, also keeping empty values is not a problem.
I would prefer it if the solution didn't use pandas iterrows() since it is a big dataframe and it would run really slow to go through the links
I tried using loops through the columns but it is way too slow or straight up doesn't work as I expected it to.
CodePudding user response:
You can use:
# is the hierarchy equal to the min per group?
m = df.groupby('matched')['hierarchy'].transform('min').eq(df['hierarchy'])
# keep the matching conditions
out = df[m]
output:
Id matched name animals flying hierarchy
0 1 1 peter cow yes 1
2 3 2 angel dog yes 1
4 5 3 harry NaN no 1
6 7 4 maria NaN no 2
8 9 4 luis lama yes 2
CodePudding user response:
You can try with a join after using groupby()
:
output = df.merge(df.groupby('matched',as_index=False)['hierarchy'].min(),
how='inner',
on = ['matched','hierarchy'])
Returning:
Id matched name animals flying hierarchy
0 1 1 peter cow yes 1
1 3 2 angel dog yes 1
2 5 3 harry NaN no 1
3 7 4 maria NaN no 2
4 9 4 luis lama yes 2