merge rows into new column value-CodePudding

I am taking a df that is all dup value pairs and then from the 2nd row take the 2nd column value and add it to the first row in a new column called 'new_amt' then inserting NaN for the second row and new third column. After I'll drop all row that contain NaN.

so the dataframe look like this:

	ref_num	Amt	fy	fund_type
row 1	1	10	21	IX
row 2	1	20	21	IX
row 3	2	5	22	III
row 4	2	15	22	III
row 5	3	12	20	VI
row 6	3	7	20	VI

after it should look like this:

	ref_num	Amt	new_Amt	fy	fund_type
row 1	1	10	20	21	IX
row 2	1	20	NaN	21	IX
row 3	2	5	15	22	III
row 4	2	15	NaN	22	III
row 5	3	12	7	20	VI
row 6	3	7	NaN	20	VI

I thought a lambda function could work where I'd have the else statement return NaN for all the second dup rows but I could figure out the syntax.

df['new_Amt'] = df.apply(lambda x : x['Amt'] if x['ref_num'] == x['ref_num'] else x['new_Amt'] is NaN)

CodePudding user response：

Why not do both operations at once (resolve duplicates as you describe and drop the redundant rows)?

k = 'ref_num'
newdf = df.drop_duplicates(subset=k, keep='first').merge(
    df.drop_duplicates(subset=k, keep='last'), on='ref_num', suffixes=('', '_new'))
>>> newdf
   ref_num  Amt  Amt_new
0        1   10       20
1        2    5       15
2        3   12        7

Another possibility:

gb = df.groupby('ref_num')['Amt']
newdf = pd.concat([gb.first(), gb.last()], axis=1, keys=['Amt', 'new_Amt']).reset_index()
>>> newdf
   ref_num  Amt  new_Amt
0        1   10       20
1        2    5       15
2        3   12        7

Note: in your question it is not clear if 'row 1', 'row 2' etc. are indices, meant to be kept or not, etc. If they are desired in the final output, please let us know if and how they should appear.

Addendum: what if df has more columns?

Here is a way to keep the whole "first" rows, and only add the column new_Amt:

gb = df.groupby('ref_num')
newdf = pd.concat([gb.first(), gb['Amt'].last().to_frame('new_Amt')], axis=1).reset_index()

Example:

df = df.rename_axis(index='foo').reset_index()

# code above

>>> newdf
   ref_num    foo  Amt  new_Amt
0        1  row 1   10       20
1        2  row 3    5       15
2        3  row 5   12        7