Why does using append to insert rows create issues?-CodePudding

I am using nested apply() functions to iterate through one dataframe with another. The objective is to create a similarity score with the new items vs the inventory. Is there a better way to accomplish putting this data into a new dataframe? The warning I get when running this is to use pandas.concat but i am unsure of how to apply that in this scenario.

The warning i am recieving each iteration:

/tmp/ipykernel_126/2064736442.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

df_test.head()

   item  
0  paintbrush  
1  mop #2  
2  red bucket  
3  o-light flashlight  

item_test.head()

   item_desc  
0  broom  
1  mop  
2  bucket  
3  flashlight  


temp = df_test.apply(lambda x: item_test.apply(lambda y: temp.append({'New Item':x['ITEM_DESC'],'Inventory Item':y['Item_Desc'],'Similarity':fuzz.ratio(str(x['ITEM_DESC']).lower(), str(y['Item_Desc']).lower())},ignore_index=True), axis=1), axis=1)

Running the above code shows numbered columns and i'm not sure why


1   2   3   4   5   6   7   8   9   10  ... 90  91  92  93  94  95  96  97  98  99
1   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   ... Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...   Ne...

End goal i'd like to go pass each row of dt_test through the item_test and then have a new dataframe that shows:

test.head()

   new item    inventory item  similarity
0  paintbrush   broom          22
1  paintbrush   mop            15
2  paintbrush   bucket         45   
3  paintbrush   flashlight     4

Update: I've switched to dictionarys for storing the new data which has eliminated the warnings, outputs the correct format and improved the speed. However i am only seeing one row now. Is update() the correct way to address adding multiple rows?

dict = {}


temp = df_test.apply(
    lambda x: item_test.apply(
        lambda y: dict.update(
            {
                "New Item": x["ITEM_DESC"],
                "Inventory Item": y["Item_Desc"],
                "Similarity": fuzz.ratio(
                    str(x["ITEM_DESC"]).lower(), str(y["Item_Desc"]).lower()
                ),
            },
            ignore_index=True,
        ),
        axis=1,
    ),
    axis=1,
)

dict

Output:
{'New Item': 'mop',
 'Inventory Item': 'paintbrush',
 'Similarity': 23,
 'ignore_index': True}

CodePudding user response：

Here is your "one-liner", in more readable form.

temp = df_test.apply(
    lambda x: item_test.apply(
        lambda y: temp.append(
            {
                "New Item": x["ITEM_DESC"],
                "Inventory Item": y["Item_Desc"],
                "Similarity": fuzz.ratio(
                    str(x["ITEM_DESC"]).lower(),
                    str(y["Item_Desc"]).lower(),
                ),
            },
            ignore_index=True,
        ),
        axis=1,
    ),
    axis=1,
)

The inner loop of that code creates 3-element dicts, containing a pair of items and their similarity.

You'd be better off producing a list of dicts which you can hand to pd.DataFrame( [...] ). Pandas / numpy memory management goes much more smoothly when number of rows is known in an advance, and the list container reveals that information. Continual appends / reallocates would be inefficient and slow.

Simple example of creating dataframe from list of dicts:

df = pd.DataFrame([    
        dict(x=6, y=56, sim=0.4),
        dict(x=8, y=58, sim=0.6),    
])

output

   x   y  sim
0  6  56  0.4
1  8  58  0.6