I have one source file with structure as below:
ID|RcrdId|SrcId|Name|Address
row1|r1|src1#src2|val1#val2|val4#val5
row2|r2|src2#src1|val11#val12|val14#val15
row3|r3|src1|val44|val23
I need to include values in Name and Address fields only for Src2 value which is present in SrcID column, in case there is no value for SRC2, then null values should be populated in the Name and address field like below(RecordID=r3):
ID|RcrdId|SrcId|Name|Address
row1|r1|src2|val2|val5
row2|r2|src2|val11|val14
row3|r3|||
I have this sample code, but I am not able to handle the record r3 where src2 is not present and I am getting following error : AttributeError: Can only use .str accessor with string values!
from itertools import zip_longest
cols = ['SrcId', 'Name', 'Address']
df[cols] = (df[cols].fillna(#)
.apply(lambda c: c.str.split('#'))
.apply(lambda r: next(filter(lambda x: x[0]=='src2',
zip_longest(*r)), float('nan')),
axis=1, result_type='expand')
)
CodePudding user response:
Split your values to get a list then explode it into multiple rows. Filter out your rows and fill missing values with original ones:
cols = ['SrcId', 'Name', 'Address']
df[cols] = df[cols].apply(lambda x: x.str.split('#')).explode(cols) \
.query("SrcId == 'src2'").reindex(df.index).fillna(df[cols])
print(df)
# Output
ID RcrdId SrcId Name Address
0 row1 r1 src2 val2 val5
1 row2 r2 src2 val11 val14
2 row3 r3 src1 val44 val23
Step-by-step:
>>> out = df[cols].apply(lambda x: x.str.split('#'))
SrcId Name Address
0 [src1, src2] [val1, val2] [val4, val5]
1 [src2, src1] [val11, val12] [val14, val15]
2 [src1] [val44] [val23]
>>> out = out.explode(cols)
SrcId Name Address
0 src1 val1 val4
0 src2 val2 val5
1 src2 val11 val14
1 src1 val12 val15
2 src1 val44 val23
>>> out = out.query("SrcId == 'src2'")
SrcId Name Address
0 src2 val2 val5
1 src2 val11 val14
>>> out = out.reindex(df.index)
SrcId Name Address
0 src2 val2 val5
1 src2 val11 val14
2 NaN NaN NaN
>>> out = out.fillna(df[cols])
SrcId Name Address
0 src2 val2 val5
1 src2 val11 val14
2 src1 val44 val23