replacing a value from df1['colA'] with df2['ColB'] using a unique identifier?-CodePudding

Hi I am trying to replace values in a df1 column A with values from df2 column B, by matching them with df2 column A. Basically if the string of row x in df1['a'] is equal to a string of row y in df2['a'] I want to replace the value of df1['a'] with df2['b']. I have tried a couple things but for some reason this isn't working properly. I also wants to replace values that aren't in df2['a'] with None.

my sample data is:

df1 = pd.DataFrame({'a': ['a','b','a','d','e','f','g', 'h', 'i'],
                    'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie', "alpha", "beta"],
                    'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant','elephant', "zebra"]})

df2 = pd.DataFrame({'a': ['a','b','c','d','e','f','g'], 
                    'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie'],
                    'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant']})
df1['UID'] = df1['a']  df1['b'] df1['c']
df2['UID'] = df2['a']  df2['b'] df2['c']

df1['a'].loc[df1['UID'].isin(df2['UID'])] = df2['c']
animals = ['elephant','zebra']
df1.loc[~df1['a'].isin(animals), "a"] = "None"

This works in my sample data but isn't working in my actual data set which is much larger. Any ideas on how to do something similar to this?

CodePudding user response：

I think the explanation is not quite correct. Based on your code attempt, I suspect that what you mean is:

For each row i of df1 that matches (for all fields (a, b, c)) a row j of df2, then replace df1.loc[i, 'a'] by df2.loc[j, 'c'].

If that is the correct interpretation of your question, then:

First, it is safer to use a tuple of the row values as UID for the row, instead of the string concatenation: imagine a row '_', 'foo', 'bar' and another '_', 'fooba', 'r' -- they are most certainly distinct. The second advantage of tuple is that it works with other types, not just strings. Thus:

df1['UID'] = df1[['a', 'b', 'c']].apply(tuple, axis=1)
df2['UID'] = df2[['a', 'b', 'c']].apply(tuple, axis=1)

Then, the expected result can be obtained by merging on UID:

df = df1.assign(
    a=df1.merge(
        df2[['UID', 'c']], on='UID', how='left',
        suffixes=['', '_y'])['c_y'].fillna('None')
)

>>> df
          a        b         c                     UID
0  elephant    alpha  elephant    (a, alpha, elephant)
1     zebra    alpha     zebra       (b, alpha, zebra)
2  elephant    alpha  elephant    (a, alpha, elephant)
3     zebra     beta     zebra        (d, beta, zebra)
4  elephant     beta  elephant     (e, beta, elephant)
5     zebra  charlie     zebra     (f, charlie, zebra)
6  elephant  charlie  elephant  (g, charlie, elephant)
7      None    alpha  elephant    (h, alpha, elephant)
8      None     beta     zebra        (i, beta, zebra)