Converting for loop to numpy calculation for pandas dataframes-CodePudding

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.

I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.

Couple of caveats:

The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.

df_new represents new information to be checked against what is currently on file (df_current)

My current code:

set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)

for i in range(len(df_new.index)):
    # if the row entry is new, we add it to our dataset
    if not df_new[new_col_name][i] in set_current:
        df_out.loc[len(df_out)] = df_new.iloc[i]
    # if the row entry is a match, then we aren't going to do anything with it
    else:
        continue

# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)

Here are some simple examples of dataframes I would be working with:

df_current |partno|description|category|cost|price|upc|brand|color|size|year| |:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---| |123|Logo T-Shirt||25|49.99||apple|red|large|2021|| |456|Knitted Shirt||35|69.99||apple|green|medium|2021||

df_new |mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year| |:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---| |456|Knitted Shirt||35|69.99|||apple|green|medium|2021| |789|Logo Vest||20|39.99|||apple|yellow|small|2022|

There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.

I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.

CodePudding user response：

It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.

Find the IDs that exist in the second dataframe

test_ids = df2['cola_id'].unique().tolist()

the filter the first dataframe for those IDs.

df1[df1['keep_id'].isin(test_ids)]

CodePudding user response：

Here is the answer that works - was supplied to me by someone much smarter.

df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]