In python, if ID matches, move row to column while summing specific columns-CodePudding

I have data with Spouses on separate rows, but each spouse shares the same ID. These IDs, in some cases, are on several rows. When IDs match, I need to move the spouse row to a column, so both spouses share one row. I will also then need to sum values.

Input

   ID  Position  Title  First  Last  SpTitle  SpFirst  SpLast  Address    Value1  Value2  Value3
0  456 HoH       Mr.    John   Doe   NaN      NaN      NaN     123 street  10      NaN     30
1  456 Spouse    Mrs.   Jane   Doe   NaN      NaN      NaN     123 street  10      NaN     30
2  789 HoH       Mrs.   Jane   Doe   NaN      NaN      NaN     456 road    100     200     300
3  789 HoH       Mrs.   Jane   Doe   NaN      NaN      NaN     456 road    400     500     600
4  789 Spouse    Mr.    John   Doe   NaN      NaN      NaN     456 road    NaN     10      30

Desired Output

   ID  Position  Title  First  Last  SpTitle  SpFirst  SpLast  Address    Value1  Value2  Value3
0  456 HoH       Mr.    John   Doe   Mrs.     Jane     Doe     123 street  20      NaN     60
1  789 HoH       Mrs.   Jane   Doe   Mr.      John     Doe     456 road    500     710     930

There are thousands of rows. Some IDs are duplicated over dozens of rows. I think it would use groupby and agg, but I can't seem to get it working correctly.

When an ID matches, how can I move the spouse to share the same row as the Head of household (HoH) while summing values?

CodePudding user response：

Try:

m = df.Position.eq("Spouse")

df.loc[m, ["SpTitle", "SpFirst", "SpLast"]] = df.loc[
    m, ["Title", "First", "Last"]
].values

df[["Value1", "Value2", "Value3"]] = df.groupby("ID", as_index=False)[
    ["Value1", "Value2", "Value3"]
].transform(np.sum, min_count=1)

df[["SpTitle", "SpFirst", "SpLast"]] = df.groupby("ID", as_index=False)[
    ["SpTitle", "SpFirst", "SpLast"]
].transform(lambda x: x.ffill().bfill())

df = df[~m].drop_duplicates()

print(df)

Prints:

    ID Position Title First Last SpTitle SpFirst SpLast     Address  Value1  Value2  Value3
0  456      HoH   Mr.  John  Doe    Mrs.    Jane    Doe  123 street    20.0     NaN    60.0
2  789      HoH  Mrs.  Jane  Doe     Mr.    John    Doe    456 road   500.0   710.0   930.0

CodePudding user response：

You can do different aggregations on your columns after a groupby on ID and in the end you fill the values for rows with spouses into the aggregated output.

df = df.set_index('ID')
spouses = df.loc[df['Position'].eq('Spouse'), ['Title', 'First', 'Last']].values
agg_dict = {col : 'sum' if col in ['Value1', 'Value2', 'Value3'] else 'first' for col in df.columns.tolist()}

out = df.groupby(level=0).agg(agg_dict).reset_index()
out.loc[:, ['SpTitle', 'SpFirst', 'SpLast']] = spouses

print(out)

Output:

    ID Position Title First Last SpTitle SpFirst SpLast     Address  Value1  Value2  Value3
0  456      HoH   Mr.  John  Doe    Mrs.    Jane    Doe  123 street    20.0     0.0      60
1  789      HoH  Mrs.  Jane  Doe     Mr.    John    Doe    456 road   500.0   710.0     930