I have a data frame dF with 3 COLUMN A B C
dF =
A B C
navigate to "www.xyz.com" to "www.xyz.com" NA
enters valid username "JOHN" enters "JOHN"
enters password "1234567" enters "1234567"
enters RIGHT destination"YUL" enters "YUL"
clicks Customer Service clicks NA
clicks Booking Information from Booking clicks NA
i want to find the difference between of A ,B C, and rest values will be in D column. i want my data frame to look like this
dF =
A B C D
navigate to "www.xyz.com" to "www.xyz.com" NA navigate
enters valid username "JOHN" enters "JOHN" valid username
enters valid password "1234567" enters "1234567" valid password
enters RIGHT destination"YUL" enters "YUL" RIGHT destination
clicks Customer Service clicks NA Customer Service
clicks Booking Information from Booking clicks NA Booking Information from Booking
i am using:
df['D'] = Final_df[['B', 'C']].agg(' '.join, axis=1).str.split(' ')
df['D'] = df.apply(lambda x: ''.join(set(x['A'].split(' ')) - set(x['D'])), axis=1)
but i am not getting in sequence order in D column.
CodePudding user response:
df = {'A': ['navigate to "www.xyz.com"',
'enters valid username "JOHN"',
'enters password "1234567"',
'enters RIGHT destination"YUL"',
'clicks Customer Service',
'clicks Booking Information from Booking'],
'B': ['to "www.xyz.com"', 'enters', 'enters', 'enters', 'clicks', 'clicks'],
'C': ['NA', '"JOHN"', '"1234567"', '"YUL"', 'NA', 'NA']}
If you are sure that all words are space-separated (which is not true in row #4), then you can use split, but don't convert 'A' to a set to preserve the ordering.
a = df['A'].str.split()
b = df['B'].str.split().apply(set)
c = df['C'].str.split().apply(set)
df['D'] = [' '.join([a2 for a2 in a1 if a2 not in (b1 | c1)]) for a1, b1, c1 in zip(a,b,c)]
Otherwise, you may consider replace
df['D'] = df.apply(lambda r: r['A'].replace(r['B'], '').replace(r['C'], '').strip(), axis=1)