Home > front end >  Reconcile Pandas Dataframe based on same wording Found
Reconcile Pandas Dataframe based on same wording Found

Time:01-05

I have below dataframe,

df = pd.DataFrame([['Coca-Cola','Coca-Cola Ltd Co'], ['BMW','Company BMW Ltd'], ['Nike','Adidas Ltd and Co.']], columns=['Brand','Company Name'])

where I need to reconcile the "status" column based on same wording found in 'brand' and 'company name' columns:

Brand             Company Name

Coca-Cola         Coca-Cola Ltd Co
BMW               Company BMW Ltd
Nike              Adidas Ltd and Co.

Ideally I would like my 'status' column to return 'same' to same brand and company name and 'not' for different ones as below:

    Brand             Company Name     Status

Coca-Cola         Coca-Cola            same
BMW               Company BMW Ltd      same
Nike              Nike Ltd and Co.     not

The current method I have is working but not for all as some company names come in between or differently (with full company name)

my_list=[]
for brands,names in zip(df.Brand,df["Company Name"]):
    if brands==names:
       my_list.append('same')
    else:
       my_list.append('not')

Please share your suggestions on how I can reconcile it based on same wording found in both columns. Thanks.

CodePudding user response:

If brand names aren't multiple words, we can simply split Company Name on white space and look up membership:

df['Status'] = df.apply(lambda x: 'same' if x['Brand'] in x['Company Name'].split() else 'not', axis=1)

Output:

       Brand        Company Name Status
0  Coca-Cola    Coca-Cola Ltd Co   same
1        BMW     Company BMW Ltd   same
2       Nike  Adidas Ltd and Co.    not

CodePudding user response:

You can use List comprehension for better timings. zip the columns into one and check whether Brand exists in Company Name or not:

In [251]: df['Status'] = ['same' if x in y else 'not' for x,y in zip(df['Brand'], df['Company Name'].str.split())]

In [252]: df
Out[252]: 
       Brand        Company Name Status
0  Coca-Cola    Coca-Cola Ltd Co   same
1        BMW     Company BMW Ltd   same
2       Nike  Adidas Ltd and Co.    not

Timings:

In [261]: def f1():
     ...:     df['Status'] = df.apply(lambda x: 'same' if x['Brand'] in x['Company Name'].split() else 'not', axis=1)
     ...: 

In [259]: def f2(): 
     ...:     df['Status'] = ['same' if x in y else 'not' for x,y in zip(df['Brand'], df['Company Name'].str.split())]

In [262]: %timeit f1() # @Manlai's solution
468 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [263]: %timeit f2() # my solution
360 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  •  Tags:  
  • Related