Home > Enterprise >  How to expand a column and merge with another dataframe by these columns?
How to expand a column and merge with another dataframe by these columns?

Time:04-13

I have the following two dataframes:

df1

col1  col2
A     5, 3, 4
B     5, 1
C     3, 4

df2

c1   c2
1    XXX
2    CCC
3    AAA
4    DDD
5    FFF 

Script to create df1 and df2:

import pandas as pd

data1 = [['A', '5, 3, 4'], ['B', '5, 1'], ['C', '3, 4']]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])

data2 = [['1', 'XXX'], ['2', 'CCC'], ['3', 'AAA'], ['4', 'DDD'], ['5', 'FFF']]
df2 = pd.DataFrame(data2, columns = ['c1', 'c2'])

I need to expand the col2 in df1 and merge the expanded columns with c2 of df2 by c1.

I know how to expand col2, but the challenge is that the number of expanded values is different for different rows. Therefore I don't know how to merge df_expanded with df2.

df_expanded = df1['col2'].str.split(', ', expand=True)

The expected result:

col1  col2_1    col2_2   col2_3   col2_1_title   col2_2_title   col2_3_title
A     5         3        4        FFF            AAA            DDD
B     5         1                 FFF            XXX
C     3         4                 AAA            DDD

How can I get such result?

CodePudding user response:

Let us do it step by step

s = df1.col2.str.split(',').apply(pd.Series).add_prefix('col_')
t = s.astype(float).replace(df2.set_index(['c1'])['c2'])

out = df1.join(s).join(t.add_suffix('_title'))
Out[42]: 
  col1     col2 col_0 col_1 col_2 col_0_title col_1_title col_2_title
0    A  5, 3, 4     5     3     4         FFF         AAA         DDD
1    B     5, 1     5     1   NaN         FFF         XXX         NaN
2    C     3, 4     3     4   NaN         AAA         DDD         NaN

CodePudding user response:

Here is one approach using the stacked data for a merge and a reshape to a MultiIndex. Note that it has a slightly different output:

df3 = (df1['col2']
 .str.split(', ', expand=True)
 .stack()
#.astype(int) # only if df2 has integers in c1
 .to_frame('col2')
 .merge(df2.set_index('c1').rename(columns={'c2': 'col2_title'}),
        left_on='col2', right_index=True)
 .unstack(1)
)
df3.columns = df3.columns.map(lambda x: f'{x[0]}_{x[1]}')

df_expand = df1[['col1']].join(df3)

output:

  col1 col2_0 col2_1 col2_2 col2_title_0 col2_title_1 col2_title_2
0    A      5      3      4          FFF          AAA          DDD
1    B      5      1    NaN          FFF          XXX          NaN
2    C      3      4    NaN          AAA          DDD          NaN

CodePudding user response:

I really wanted a prettier way to do this but I don't think this is any better than the other answers.

get = dict(zip(*map(df2.get, df2))).get
d = df1.col2.str.split(', ', expand=True).add_prefix('col2_')
df1[['col1']].join(d).join(d.applymap(get).add_suffix('_title'))

  col1 col2_0 col2_1 col2_2 col2_0_title col2_1_title col2_2_title
0    A      5      3      4          FFF          AAA          DDD
1    B      5      1   None          FFF          XXX         None
2    C      3      4   None          AAA          DDD         None

Maybe this isn't exactly what OP wanted, output wise, but this seems better is some subjective way.

s = df1.set_index('col1').col2
m = df2.set_index('c1').c2
pd.concat({
    'number': s, 'title': s.replace(m, regex=True)
}).str.split(', ', expand=True).unstack(0)

          0            1            2      
     number title number title number title
col1                                       
A         5   FFF      3   AAA      4   DDD
B         5   FFF      1   XXX   None  None
C         3   AAA      4   DDD   None  None
  • Related