I have pipe delimited file with data as:
col1|col2|col3|col4|col5
row1|r1|src1#src2|val1#val2|val4#val5
row2|r2|src2#src1|val11#val12|val14#val15
row3|r3|src1#src2|val44|val23#val33
col3
, col4
and col5
are multi value fields with # as a separator. Now based on the src2
position in col3
, I need to consider the value in col4
and col5
for the same position. For example if src2
is in second position, then take only values from col4
and col5
which are at second position as below:
Output :
col1|col2|col3|col4|col5
row1|r1|src2|val2|val5
row2|r2|src2|val11|val14
row3|r3|src2||val33
I have written this code, but i am not getting the desired result with this:
df2 = (df["col3"].str.split("#", expand=True))
df2.loc[df2[0] == 'src2', 'Position'] = '0'
df2.loc[df2[1] == 'src2', 'Position'] = '1'
df.loc[df2['Position']=='0' ,'col4'] = df["col4"].str.split("#", expand=True)[0]
print(df['col4'])
CodePudding user response:
Because you have uneven "sub-list" length, this is not directly achievable with explode
.
Here is an approach using itertools.zip_longest
and apply
:
from itertools import zip_longest
cols = ['col3', 'col4', 'col5']
df2 = df.copy()
df2[cols] = (df2[cols]
.apply(lambda c: c.str.split('#'))
.apply(lambda r: next(filter(lambda x: x[0]=='src2',
zip_longest(*r)), float('nan')),
axis=1, result_type='expand')
)
output:
col1 col2 col3 col4 col5
0 row1 r1 src2 val2 val5
1 row2 r2 src2 val11 val14
2 row3 r3 src2 None val33