I have data with Spouses on separate rows, but each spouse shares the same ID. These IDs, in some cases, are on several rows. When IDs match, I need to move the spouse row to a column, so both spouses share one row. I will also then need to sum values.
Input
ID Position Title First Last SpTitle SpFirst SpLast Address Value1 Value2 Value3
0 456 HoH Mr. John Doe NaN NaN NaN 123 street 10 NaN 30
1 456 Spouse Mrs. Jane Doe NaN NaN NaN 123 street 10 NaN 30
2 789 HoH Mrs. Jane Doe NaN NaN NaN 456 road 100 200 300
3 789 HoH Mrs. Jane Doe NaN NaN NaN 456 road 400 500 600
4 789 Spouse Mr. John Doe NaN NaN NaN 456 road NaN 10 30
Desired Output
ID Position Title First Last SpTitle SpFirst SpLast Address Value1 Value2 Value3
0 456 HoH Mr. John Doe Mrs. Jane Doe 123 street 20 NaN 60
1 789 HoH Mrs. Jane Doe Mr. John Doe 456 road 500 710 930
There are thousands of rows. Some IDs are duplicated over dozens of rows. I think it would use groupby
and agg
, but I can't seem to get it working correctly.
When an ID matches, how can I move the spouse to share the same row as the Head of household (HoH) while summing values?
CodePudding user response:
Try:
m = df.Position.eq("Spouse")
df.loc[m, ["SpTitle", "SpFirst", "SpLast"]] = df.loc[
m, ["Title", "First", "Last"]
].values
df[["Value1", "Value2", "Value3"]] = df.groupby("ID", as_index=False)[
["Value1", "Value2", "Value3"]
].transform(np.sum, min_count=1)
df[["SpTitle", "SpFirst", "SpLast"]] = df.groupby("ID", as_index=False)[
["SpTitle", "SpFirst", "SpLast"]
].transform(lambda x: x.ffill().bfill())
df = df[~m].drop_duplicates()
print(df)
Prints:
ID Position Title First Last SpTitle SpFirst SpLast Address Value1 Value2 Value3
0 456 HoH Mr. John Doe Mrs. Jane Doe 123 street 20.0 NaN 60.0
2 789 HoH Mrs. Jane Doe Mr. John Doe 456 road 500.0 710.0 930.0
CodePudding user response:
You can do different aggregations on your columns after a groupby on ID
and in the end you fill the values for rows with spouses
into the aggregated output.
df = df.set_index('ID')
spouses = df.loc[df['Position'].eq('Spouse'), ['Title', 'First', 'Last']].values
agg_dict = {col : 'sum' if col in ['Value1', 'Value2', 'Value3'] else 'first' for col in df.columns.tolist()}
out = df.groupby(level=0).agg(agg_dict).reset_index()
out.loc[:, ['SpTitle', 'SpFirst', 'SpLast']] = spouses
print(out)
Output:
ID Position Title First Last SpTitle SpFirst SpLast Address Value1 Value2 Value3
0 456 HoH Mr. John Doe Mrs. Jane Doe 123 street 20.0 0.0 60
1 789 HoH Mrs. Jane Doe Mr. John Doe 456 road 500.0 710.0 930