DATAFRAME join and divide-CodePudding

I have a data frame dF with 3 COLUMN A B C

dF =       
           
               A                                      B                  C
        navigate to "www.xyz.com"               to "www.xyz.com"        NA
     enters valid username "JOHN"                enters                "JOHN"
    enters password "1234567"                    enters                "1234567"
    enters  RIGHT destination"YUL"                enters               "YUL"
    clicks Customer Service                      clicks                 NA
    clicks Booking Information from Booking      clicks                 NA

i want to find the difference between of A ,B C, and rest values will be in D column. i want my data frame to look like this

dF =       
        
               A                                      B                     C                 D
        navigate to "www.xyz.com"               to "www.xyz.com"        NA              navigate
     enters valid username "JOHN"                enters                "JOHN"           valid username
    enters valid password "1234567"               enters              "1234567"         valid password 
    enters  RIGHT destination"YUL"                enters               "YUL"            RIGHT destination
    clicks Customer Service                      clicks                 NA              Customer Service
    clicks Booking Information from Booking      clicks                 NA              Booking Information from Booking

i am using:

df['D'] = Final_df[['B', 'C']].agg(' '.join, axis=1).str.split(' ') 

df['D'] = df.apply(lambda x: ''.join(set(x['A'].split(' ')) - set(x['D'])), axis=1)

but i am not getting in sequence order in D column.

CodePudding user response：

df = {'A': ['navigate to "www.xyz.com"',
  'enters valid username "JOHN"',
  'enters password "1234567"',
  'enters  RIGHT destination"YUL"',
  'clicks Customer Service',
  'clicks Booking Information from Booking'],
 'B': ['to "www.xyz.com"', 'enters', 'enters', 'enters', 'clicks', 'clicks'],
 'C': ['NA', '"JOHN"', '"1234567"', '"YUL"', 'NA', 'NA']}

If you are sure that all words are space-separated (which is not true in row #4), then you can use split, but don't convert 'A' to a set to preserve the ordering.

a = df['A'].str.split()
b = df['B'].str.split().apply(set)
c = df['C'].str.split().apply(set)

df['D'] = [' '.join([a2 for a2 in a1 if a2 not in (b1 | c1)]) for a1, b1, c1 in zip(a,b,c)]

Otherwise, you may consider replace

df['D'] = df.apply(lambda r: r['A'].replace(r['B'], '').replace(r['C'], '').strip(), axis=1)