Home > database >  merge rows into new column value
merge rows into new column value

Time:11-23

I am taking a df that is all dup value pairs and then from the 2nd row take the 2nd column value and add it to the first row in a new column called 'new_amt' then inserting NaN for the second row and new third column. After I'll drop all row that contain NaN.

so the dataframe look like this:

ref_num Amt fy fund_type
row 1 1 10 21 IX
row 2 1 20 21 IX
row 3 2 5 22 III
row 4 2 15 22 III
row 5 3 12 20 VI
row 6 3 7 20 VI

after it should look like this:

ref_num Amt new_Amt fy fund_type
row 1 1 10 20 21 IX
row 2 1 20 NaN 21 IX
row 3 2 5 15 22 III
row 4 2 15 NaN 22 III
row 5 3 12 7 20 VI
row 6 3 7 NaN 20 VI

I thought a lambda function could work where I'd have the else statement return NaN for all the second dup rows but I could figure out the syntax.

df['new_Amt'] = df.apply(lambda x : x['Amt'] if x['ref_num'] == x['ref_num'] else x['new_Amt'] is NaN)

CodePudding user response:

Why not do both operations at once (resolve duplicates as you describe and drop the redundant rows)?

k = 'ref_num'
newdf = df.drop_duplicates(subset=k, keep='first').merge(
    df.drop_duplicates(subset=k, keep='last'), on='ref_num', suffixes=('', '_new'))
>>> newdf
   ref_num  Amt  Amt_new
0        1   10       20
1        2    5       15
2        3   12        7

Another possibility:

gb = df.groupby('ref_num')['Amt']
newdf = pd.concat([gb.first(), gb.last()], axis=1, keys=['Amt', 'new_Amt']).reset_index()
>>> newdf
   ref_num  Amt  new_Amt
0        1   10       20
1        2    5       15
2        3   12        7

Note: in your question it is not clear if 'row 1', 'row 2' etc. are indices, meant to be kept or not, etc. If they are desired in the final output, please let us know if and how they should appear.

Addendum: what if df has more columns?

Here is a way to keep the whole "first" rows, and only add the column new_Amt:

gb = df.groupby('ref_num')
newdf = pd.concat([gb.first(), gb['Amt'].last().to_frame('new_Amt')], axis=1).reset_index()

Example:

df = df.rename_axis(index='foo').reset_index()

# code above

>>> newdf
   ref_num    foo  Amt  new_Amt
0        1  row 1   10       20
1        2  row 3    5       15
2        3  row 5   12        7
  • Related