I am using nested apply() functions to iterate through one dataframe with another. The objective is to create a similarity score with the new items vs the inventory. Is there a better way to accomplish putting this data into a new dataframe? The warning I get when running this is to use pandas.concat but i am unsure of how to apply that in this scenario.
The warning i am recieving each iteration:
/tmp/ipykernel_126/2064736442.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df_test.head()
item
0 paintbrush
1 mop #2
2 red bucket
3 o-light flashlight
item_test.head()
item_desc
0 broom
1 mop
2 bucket
3 flashlight
temp = df_test.apply(lambda x: item_test.apply(lambda y: temp.append({'New Item':x['ITEM_DESC'],'Inventory Item':y['Item_Desc'],'Similarity':fuzz.ratio(str(x['ITEM_DESC']).lower(), str(y['Item_Desc']).lower())},ignore_index=True), axis=1), axis=1)
Running the above code shows numbered columns and i'm not sure why
1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99
1 Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne... ... Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne... Ne...
End goal i'd like to go pass each row of dt_test through the item_test and then have a new dataframe that shows:
test.head()
new item inventory item similarity
0 paintbrush broom 22
1 paintbrush mop 15
2 paintbrush bucket 45
3 paintbrush flashlight 4
Update: I've switched to dictionarys for storing the new data which has eliminated the warnings, outputs the correct format and improved the speed. However i am only seeing one row now. Is update() the correct way to address adding multiple rows?
dict = {}
temp = df_test.apply(
lambda x: item_test.apply(
lambda y: dict.update(
{
"New Item": x["ITEM_DESC"],
"Inventory Item": y["Item_Desc"],
"Similarity": fuzz.ratio(
str(x["ITEM_DESC"]).lower(), str(y["Item_Desc"]).lower()
),
},
ignore_index=True,
),
axis=1,
),
axis=1,
)
dict
Output:
{'New Item': 'mop',
'Inventory Item': 'paintbrush',
'Similarity': 23,
'ignore_index': True}
CodePudding user response:
Here is your "one-liner", in more readable form.
temp = df_test.apply(
lambda x: item_test.apply(
lambda y: temp.append(
{
"New Item": x["ITEM_DESC"],
"Inventory Item": y["Item_Desc"],
"Similarity": fuzz.ratio(
str(x["ITEM_DESC"]).lower(),
str(y["Item_Desc"]).lower(),
),
},
ignore_index=True,
),
axis=1,
),
axis=1,
)
The inner loop of that code creates 3-element dicts, containing a pair of items and their similarity.
You'd be better off producing a list of dicts
which you can hand to pd.DataFrame( [...] )
.
Pandas / numpy memory management goes much
more smoothly when number of rows is known
in an advance, and the list container reveals
that information. Continual appends / reallocates
would be inefficient and slow.
Simple example of creating dataframe from list of dicts:
df = pd.DataFrame([
dict(x=6, y=56, sim=0.4),
dict(x=8, y=58, sim=0.6),
])
output
x y sim
0 6 56 0.4
1 8 58 0.6