I'm trying to figure out the most efficient way to join two dataframes such as below.
I've tried pd.merge and maybe using the rank function but cannot seem to figure a way.
Thanks in advance
df1
| A | B | C |
| -------- | -------------- |------------|
| TBK1 | 2022-01-01 |2022-04-04 |
| TBK1 | 2022-02-02 |2021-01-09 |
| TBK3 | 2022-05-07 |2023-02-04 |
What I'm trying to achieve is this
df2
| A | B | C | D | E |
| -------- | -------------- |------------|----------------|------------|
| TBK1 | 2022-01-01 |2022-04-04 | 2022-02-02 |2021-01-09 |
| TBK3 | 2022-05-07 |2023-02-04 |NaN |NaN |
CodePudding user response:
You might want to use groupby
with unstack
as advised in this answer:
import pandas as pd
from string import ascii_uppercase
# Reproduce the data
df = pd.DataFrame()
df['A'] = ['TBK1','TBK1', 'TBK3']
df['B'] = ['2022-01-01' , '2022-02-02', '2022-05-07']
df['C'] = ['2022-04-04', '2021-01-09', '2023-02-04']
# Count how many rows exists per unique entry
s = df.groupby(['A']).cumcount()
# Unstack
df1 = df.set_index(['A', s]).unstack().sort_index(level=1, axis=1)
# Rename columns
df1.columns = [l for l in ascii_uppercase[1:len(df1.columns) 1]]
# Flatten columns names (aesthetics)
df1 = df1.reset_index()
print(df1)
A B C D E
0 TBK1 2022-01-01 2022-04-04 2022-02-02 2021-01-09
1 TBK3 2022-05-07 2023-02-04 NaN NaN