Insert colA into DF1 with vals from DF2['colB'] by matching colC in both DFs-CodePudding

I have two CSV files, CSV_A and CSV_B.csv. I must insert the column (Category) from CSV_B into CSV_A.

The two CSVs share a common column: StockID, and I must add the correct category onto each row by matching the StockID columns.

This can be done using merge, like this:

dfa.merge(dfb, how='left', on='StockID')

but I only want to add the one column, not join the two dataframes.

CSV_A (indexed on StockID):

StockID,Brand,ToolName,Price
ABC123,Maxwell,ToolA,1.25
BCD234,Charton,ToolB,2.22
CDE345,Bingley,ToolC,3.33
DEF789,Charton,ToolD,1.44

CSV_B:

PurchDate,Supplier,StockID,Category
20201005,Sigmat,BCD234,CatShop
20210219,Zorbak,AAA111,CatWares
20210307,Phillips
20210417,Tandey,CDE345,CatThings
20210422,Stapek,BBB222,CatElse
20210502,Zorbak,ABC123,CatThis
20210512,Zorbak,CCC999,CatThings
20210717,Phillips,DEF789,CatShop

My task is to insert a Cat field into CSV_A, matching each inserted Category with its correct StockID.

Note1: CSV_A is indexed on the StockID column. CSV_B has the default indexing.
Note2: There are some columns in CSV_B (e.g. row 3) that do not have complete information. Note3: Adding column "Category" from CSV_B into CSV_A but calling it "Cat" in CSV_A

CodePudding user response：

Use Series.map to map 'Category' based on 'StockID'

df_a['Cat'] = df_a['StockID'].map(dict(zip(df_b['StockID'], df_b['Category'])))

Note that for this specific question (i.e. with CSV_A indexed on StockID), the code must be:

df_a['Cat'] = df_a.index.map(dict(zip(df_b['StockID'], df_b['Category'])))
                   ^^^^^

CodePudding user response：

While creating the question I discovered the solution, so decided to post it rather than just deleting the question.

import pandas as pd
dfa = pd.read_csv('csv_a.csv')
dfa.set_index('StockID', inplace=True)
dfb = pd.read_csv('csv_b.csv')

#remove incomplete rows (i.e. without Category/StockID columns)
dfb_tmp = dfb[dfb['StockID'].notnull()]

def myfunc(row):
    # NB: Use row.name because row['StockID'] is the index
    if row.name in list(dfb_tmp['StockID']):
        return dfb_tmp.loc[dfb_tmp['StockID'] == row.name]['Category'].values[0]
dfa['Cat'] = dfa.apply(lambda row: myfunc(row), axis=1)

print(dfa)

Result:

StockID Brand   ToolName    Price   Cat
ABC123  Maxwell ToolA       1.25    CatThis
BCD234  Charton ToolB       2.22    CatShop
CDE345  Bingley ToolC       3.33    CatThings
DEF789  Charton ToolD       1.44    CatShop