Home > Net >  Split dataframe multivalue columns
Split dataframe multivalue columns

Time:02-22

I have pipe delimited file with data as:

col1|col2|col3|col4|col5
row1|r1|src1#src2|val1#val2|val4#val5
row2|r2|src2#src1|val11#val12|val14#val15
row3|r3|src1#src2|val44|val23#val33

col3, col4 and col5 are multi value fields with # as a separator. Now based on the src2 position in col3, I need to consider the value in col4 and col5 for the same position. For example if src2 is in second position, then take only values from col4 and col5 which are at second position as below:

Output :

col1|col2|col3|col4|col5
row1|r1|src2|val2|val5
row2|r2|src2|val11|val14
row3|r3|src2||val33

I have written this code, but i am not getting the desired result with this:

df2 = (df["col3"].str.split("#", expand=True))
df2.loc[df2[0] == 'src2', 'Position'] = '0'

df2.loc[df2[1] == 'src2', 'Position'] = '1'

df.loc[df2['Position']=='0' ,'col4'] = df["col4"].str.split("#", expand=True)[0]

print(df['col4'])

CodePudding user response:

Because you have uneven "sub-list" length, this is not directly achievable with explode.

Here is an approach using itertools.zip_longest and apply:

from itertools import zip_longest

cols = ['col3', 'col4', 'col5']

df2 = df.copy()

df2[cols] = (df2[cols]
             .apply(lambda c: c.str.split('#'))
             .apply(lambda r: next(filter(lambda x: x[0]=='src2',
                                          zip_longest(*r)), float('nan')),
                            axis=1, result_type='expand')
            )

output:

   col1 col2  col3   col4   col5
0  row1   r1  src2   val2   val5
1  row2   r2  src2  val11  val14
2  row3   r3  src2   None  val33
  • Related