I have a pandas dataframe like this:
Id A B C D
1 a b c d
2 a b d
2 a c d
3 a d
3 a b c
I want to aggregate the empty values for the columns B-C and D, using the values contained in the other rows, by using the information for the same Id.
The resulting data frame should be the following:
Id A B C D
1 a b c d
2 a b c d
3 a b c d
There can be the possibility to have different values in the first column (A), for the same Id. In this case instead of putting the first instance I prefer to put another value indicating this event.
So for e.g.
Id A B C D
1 a b c d
2 a b d
2 x c d
It becomes:
Id A B C D
1 a b c d
2 f b c d
CodePudding user response:
IIUC, you can use groupby_last
:
>>> df.groupby('Id').last()
A B C D
Id
1 a b c d
2 a b c d
3 a b c d
>>> df.groupby('Id').last()
A B C D
Id
1 a b c d
2 x b c d
You can use another function like first
, sum
, mean
or a custom function with apply
but the idea still stay the same.
CodePudding user response:
The best way I can think to do this is to iterate through each unique Id
, slicing it out of the original dataframe, and constructing a new row as a product of merging the relevant rows:
def aggregate(df):
ids = df['Id'].unique()
rows = []
for id in ids:
relevant = df[df['Id'] == id]
newrow = {c: "" for c in df.columns}
for _, row in relevant.iterrows():
for col in newrow:
if row[col]:
if len(newrow[col]):
if newrow[col][-1] == row[col]:
continue
newrow[col] = row[col]
rows.append(newrow)
return pd.DataFrame(rows)