I have two Data Frames:
**df1-- data1 = {'Company':['ADIDAS','NIKE','PUMA','NEW BALANCE','UNDER ARMOUR'],
'Keywords':['COPA, PREDATOR, ORIGINALS, SPEEDFLOW','MERCURIAL, SUPERSTAR, VAPOR','ULTRA, FUTURE, RAPIDO','FURON','TEKELA'],
'Suppliers':['', '', '[STADIUM, JD]', '', '']}
| Company |Keywords | Suppliers |
| ---------- |--------------------------------------|-------------|
| ADIDAS |[COPA, PREDATOR, ORIGINALS, SPEEDFLOW]| <NA> |
| NIKE |[MERCURIAL, SUPERSTAR, VAPOR] | <NA> |
| PUMA |[ULTRA, FUTURE, RAPIDO] |[STADIUM, JD]|
|NEW BALANCE |[FURON] | <NA> |
|UNDER ARMOUR|[TEKELA] | <NA> |
**df2 --data2 = {'Names':['ADIDAS PREDATOR 17.1','NIKE MERCURIAL 2020','NIKE VAPOR 2021','NEW BALANCE FURON','PUMA RAPIDO 21.3', 'PUMA RAPIDO 21.4'],
'Supplier':['ADIDAS','NIKE','NIKE','JD','STADIUM', 'JD'], 'Company.1': ['', '', '', '', '', '']}**
| Names | Supplier | Company.1 |
| -------------------- | ---------|-----------|
| ADIDAS PREDATOR 17.1 |ADIDAS | <NA> |
| NIKE MERCURIAL 2020 |NIKE | <NA> |
| NIKE VAPOR 2021 |NIKE | <NA> |
| NEW BALANCE FURON |JD | <NA> |
| PUMA RAPIDO 21.3 |STADIUM | <NA> |
| PUMA RAPIDO 21.4 |JD | <NA> |
Goal is to check if df2[Names] contains any word from df1[keywords], if YES then check if df1[suppliers] and df2[supplier] are equal, if YES then assign df2[Company.1] as df1[Company]. (If df1[Suppliers] is empty then no need to check for supplier)
Here is some code I've tried.(Print statement is just a reference for me)
for i in range(len(df1["Keywords"])):
for j in range(len(df1["Keywords"][i])):
for name_index in range(len(df2["Product_name"])):
if df1["Keywords"][i][j].strip() in df2["Product_name"][name_index]:
print("YES " df1["Keywords"][i][j] " in " df2["Product_name"][name_index])
# Now need to check if suppliers are same
Expected Output:
| Names | Supplier | Company.1 |
| -------------------- | ---------|-----------|
| ADIDAS PREDATOR 17.1 |ADIDAS | ADIDAS |
| NIKE MERCURIAL 2020 |NIKE | NIKE |
| NIKE VAPOR 2021 |NIKE | NIKE |
| NEW BALANCE FURON |JD | NEW BALANCE|
| PUMA RAPIDO 21.3 |STADIUM | PUMA |
How to add Company name to Company.1 using the satisfied condition?
CodePudding user response:
Is Suppliers
in data1
supposed to have <NA>
values, except for [STADIUM, JD]
? If so, I'm unsure how you've gotten the Company.1
values in your expected output. None of the values in data2
's Supplier
are <NA>
, and — of the one that is not <NA>
in data1
— the Keywords
do not match the Names
in data2
.
Regardless, I believe I have the gist of what you're looking for.
keywords: str = "Keywords"
names: str = "Names"
# 1 - Compare the values of data2.Name to data1.Keywords
data1[keywords] = [i.split(", ") for i in data1.get(keywords)]
data2[names] = [i.split() for i in data2.get(names)]
data2["_match_keywords"] = [any(i in name for i in keyword) for name, keyword in zip(data2.get(names), data1.get(keywords))]
# Out - data2
# {'Names': [['ADIDAS', 'PREDATOR', '17.1'], ['NIKE', 'MERCURIAL', '2020'], #['NIKE', 'VAPOR', '2021'],
# ['NEW', 'BALANCE', 'FURON'], ['PUMA', 'RAPIDO', '21.3']],
# 'Supplier': ['ADIDAS', 'NIKE', 'NIKE', 'JD', 'STADIUM'], 'Company.1': ['', # '', '', '', ''],
# '_match_keywords': [True, True, False, True, False]}
# 2 - Compare data2.Supplier to data1.Suppliers
data2["_match_supplier"] = [any(i in s1 for i in s2) for s2, s1 in zip(data2.get("Supplier"), data1.get("Suppliers"))]
# Out
# {'Names': [['ADIDAS', 'PREDATOR', '17.1'], ['NIKE', 'MERCURIAL', '2020'], ['NIKE', 'VAPOR', '2021'],
# ['NEW', 'BALANCE', 'FURON'], ['PUMA', 'RAPIDO', '21.3']],
# 'Supplier': ['ADIDAS', 'NIKE', 'NIKE', 'JD', 'STADIUM'], 'Company.1': ['', '', '', '', ''],
# '_match_keywords': [True, True, False, True, False], '_match_supplier': [False, False, True, False, False]}
# 3 - If Keyword and Supplier are equal, then assign data1.Company to data2.Company.1
for match, org in zip(data2.get("_match_keywords"), data2.get("_match_supplier")):
if match == org:
data2["Company.1"] = org
# 4 - Make the frames and drop the helper columns (_match_keywords, _match_supplier)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2 = df2[[col for col in df2.columns if not col.startswith("_")]]
df1
Company | Keywords | Suppliers |
---|---|---|
ADIDAS | ['COPA', 'PREDATOR', 'ORIGINALS', 'SPEEDFLOW'] | |
NIKE | ['MERCURIAL', 'SUPERSTAR', 'VAPOR'] | |
PUMA | ['ULTRA', 'FUTURE', 'RAPIDO'] | [STADIUM, JD] |
NEW BALANCE | ['FURON'] | |
UNDER ARMOUR | ['TEKELA'] |
df2 w/conditions
Names | Supplier | Company.1 | _match_keywords | _match_supplier |
---|---|---|---|---|
['ADIDAS', 'PREDATOR', '17.1'] | ADIDAS | True | False | |
['NIKE', 'MERCURIAL', '2020'] | NIKE | True | False | |
['NIKE', 'VAPOR', '2021'] | NIKE | False | True | |
['NEW', 'BALANCE', 'FURON'] | JD | True | False | |
['PUMA', 'RAPIDO', '21.3'] | STADIUM | False | False |
df2 w/helpers dropped
| Names | Supplier | Company.1 |
|--------------------------------|------------|-------------|
| ['ADIDAS', 'PREDATOR', '17.1'] | ADIDAS | |
| ['NIKE', 'MERCURIAL', '2020'] | NIKE | |
| ['NIKE', 'VAPOR', '2021'] | NIKE | |
| ['NEW', 'BALANCE', 'FURON'] | JD | |
| ['PUMA', 'RAPIDO', '21.3'] | STADIUM | |
CodePudding user response:
Here is the code that is working for me. If anybody has any suggestion on improving my code, I would appreciate it!
for i in range(len(df1["Keywords"])):
for j in range(len(df1["Keywords"][i])):
for name_index in range(len(df2["Product_name"])):
if df1["Keywords"][i][j].strip() in df2["Product_name"][name_index]:
print("YES," df1["Keywords"][i][j] " in " df2["Product_name"][name_index])
print("---Checking Supplier-------")
df1["Suppliers"].fillna("Empty", inplace = True)
if df1["Suppliers"][i] == "Empty":
print("---Supplier empty so addding brand name")
df2["Company.1"][name_index] = df1["Company"][i]
print("---Brand added--")
else:
print("---Supplier not empty so looking for match")
for suppliers in df1["Suppliers"][i]:
if df2["Supplier"][name_index] in suppliers:
print("Supplier matched", end =" ")
print(df2["Supplier"][name_index], suppliers)
df2["Company.1"][name_index] = df1["Company"][i]
print("Brand added")
break
else:
df2["Company.1"][name_index] = ("Unmapped")
print("Supplier not matched so unmapped.", end = " ")
print(df2["Supplier"][name_index], suppliers)