Home > Software design >  what would be the most efficient way to do this in pandas
what would be the most efficient way to do this in pandas

Time:06-02

I'm trying to figure out the most efficient way to join two dataframes such as below.

I've tried pd.merge and maybe using the rank function but cannot seem to figure a way.

Thanks in advance

df1

| A        | B              | C          |
| -------- | -------------- |------------|
| TBK1     | 2022-01-01     |2022-04-04  |
| TBK1     | 2022-02-02     |2021-01-09  | 
| TBK3     | 2022-05-07     |2023-02-04  |

What I'm trying to achieve is this

df2

| A        | B              | C          | D              | E          |
| -------- | -------------- |------------|----------------|------------|
| TBK1     | 2022-01-01     |2022-04-04  | 2022-02-02     |2021-01-09  | 
| TBK3     | 2022-05-07     |2023-02-04  |NaN             |NaN         |


CodePudding user response:

You might want to use groupby with unstack as advised in this answer:

import pandas as pd
from string import ascii_uppercase

# Reproduce the data
df = pd.DataFrame()
df['A'] = ['TBK1','TBK1', 'TBK3']
df['B'] = ['2022-01-01' , '2022-02-02', '2022-05-07']
df['C'] = ['2022-04-04', '2021-01-09', '2023-02-04']

# Count how many rows exists per unique entry
s = df.groupby(['A']).cumcount() 
# Unstack
df1 = df.set_index(['A', s]).unstack().sort_index(level=1, axis=1)
# Rename columns
df1.columns = [l for l in ascii_uppercase[1:len(df1.columns) 1]]
# Flatten columns names (aesthetics)
df1 = df1.reset_index()

print(df1)

      A           B           C           D           E
0  TBK1  2022-01-01  2022-04-04  2022-02-02  2021-01-09
1  TBK3  2022-05-07  2023-02-04         NaN         NaN
  • Related