I am taking a df that is all dup value pairs and then from the 2nd row take the 2nd column value and add it to the first row in a new column called 'new_amt' then inserting NaN for the second row and new third column. After I'll drop all row that contain NaN.
so the dataframe look like this:
ref_num | Amt | fy | fund_type | |
---|---|---|---|---|
row 1 | 1 | 10 | 21 | IX |
row 2 | 1 | 20 | 21 | IX |
row 3 | 2 | 5 | 22 | III |
row 4 | 2 | 15 | 22 | III |
row 5 | 3 | 12 | 20 | VI |
row 6 | 3 | 7 | 20 | VI |
after it should look like this:
ref_num | Amt | new_Amt | fy | fund_type | |
---|---|---|---|---|---|
row 1 | 1 | 10 | 20 | 21 | IX |
row 2 | 1 | 20 | NaN | 21 | IX |
row 3 | 2 | 5 | 15 | 22 | III |
row 4 | 2 | 15 | NaN | 22 | III |
row 5 | 3 | 12 | 7 | 20 | VI |
row 6 | 3 | 7 | NaN | 20 | VI |
I thought a lambda function could work where I'd have the else statement return NaN for all the second dup rows but I could figure out the syntax.
df['new_Amt'] = df.apply(lambda x : x['Amt'] if x['ref_num'] == x['ref_num'] else x['new_Amt'] is NaN)
CodePudding user response:
Why not do both operations at once (resolve duplicates as you describe and drop the redundant rows)?
k = 'ref_num'
newdf = df.drop_duplicates(subset=k, keep='first').merge(
df.drop_duplicates(subset=k, keep='last'), on='ref_num', suffixes=('', '_new'))
>>> newdf
ref_num Amt Amt_new
0 1 10 20
1 2 5 15
2 3 12 7
Another possibility:
gb = df.groupby('ref_num')['Amt']
newdf = pd.concat([gb.first(), gb.last()], axis=1, keys=['Amt', 'new_Amt']).reset_index()
>>> newdf
ref_num Amt new_Amt
0 1 10 20
1 2 5 15
2 3 12 7
Note: in your question it is not clear if 'row 1'
, 'row 2'
etc. are indices, meant to be kept or not, etc. If they are desired in the final output, please let us know if and how they should appear.
Addendum: what if df
has more columns?
Here is a way to keep the whole "first" rows, and only add the column new_Amt
:
gb = df.groupby('ref_num')
newdf = pd.concat([gb.first(), gb['Amt'].last().to_frame('new_Amt')], axis=1).reset_index()
Example:
df = df.rename_axis(index='foo').reset_index()
# code above
>>> newdf
ref_num foo Amt new_Amt
0 1 row 1 10 20
1 2 row 3 5 15
2 3 row 5 12 7