Create column on DataFrame comparing any match between two Series-CodePudding

I have two different DataFrames

>> df = pd.DataFrame([
    {'id': 'ab-c', 'id_plain': 'ABC'},
    {'id': 'ab-c', 'id_plain': 'ABC'},
    {'id': 'd_ef', 'id_plain': 'DEF'},
    {'id': 'gh.i', 'id_plain': 'GHI'},
    {'id': 'ab-c', 'id_plain': 'ABC'},
    {'id': 'ab-c', 'id_plain': 'ABC'},
    {'id': 'd_ef', 'id_plain': 'DEF'},
    {'id': 'gh.i', 'id_plain': 'GHI'}
])

>> df
    id      id_plain
0   ab-c    ABC
1   ab-c    ABC
2   d_ef    DEF
3   gh.i    GHI
4   ab-c    ABC
5   ab-c    ABC
6   d_ef    DEF
7   gh.i    GHI

>> codes = pd.DataFrame([
    {'id': 'AB_C', 'id_plain': 'ABC', 'code': 1},
    {'id': 'd_ef', 'id_plain': 'DEF', 'code': 2},
    {'id': 'GHI', 'id_plain': 'GHI', 'code': 3}
])

>> codes
    id      id_plain    code
0   ab_c    ABC         1
1   d_ef    DEF         2
2   ghi     GHI         3

id_plain has been calculated using

df['id_plain'] = df['id'].map(lambda x: re.sub('[\W_] ', '', x).upper())

The codes DataFrame acts as a key book having only uniques values. I want to check, for every row in df, if its id_plain value is contained on the id_plain Series of codes DataFrame, and then get that matching row code value and insert it into a newly created column on df.

The desired output for df would be:

    id      id_plain     code
0   ab-c    ABC           1
1   ab-c    ABC           1
2   d_ef    DEF           2
3   gh.i    GHI           3
4   ab-c    ABC           1
5   ab-c    ABC           1
6   d_ef    DEF           2
7   gh.i    GHI           3

CodePudding user response：

Use Series.str.replace with Series.str.upper and then mapping by Series from codes by Series.map:

df['id_plain'] = df['id'].str.replace('[\W_] ', '', regex=True).str.upper()
df['code'] = df['id_plain'].map(codes.set_index('id_plain')['code'])
print (df)
     id id_plain  code
0  ab-c      ABC     1
1  ab-c      ABC     1
2  d_ef      DEF     2
3  gh.i      GHI     3
4  ab-c      ABC     1
5  ab-c      ABC     1
6  d_ef      DEF     2
7  gh.i      GHI     3

CodePudding user response：

Alternatively, you could use merge in pandas:

value_if_na = -1 # this is a value you can set
df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left").fillna(value_if_na).astype({"code":int})

OUTPUT

     id id_plain  code
0  ab-c      ABC     1
1  ab-c      ABC     1
2  d_ef      DEF     2
3  gh.i      GHI     3
4  ab-c      ABC     1
5  ab-c      ABC     1
6  d_ef      DEF     2
7  gh.i      GHI     3

If one or more of the plain_code in df are not available in codes, the solution will allow to output a predetermined value. For example:

df = df.append({"id":"hd.k", "id_plain":"HDK"}, ignore_index=True) # let's add a novel value
df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left").fillna(value_if_na).astype({"code": int})

OUTPUT

     id id_plain  code
0  ab-c      ABC     1
1  ab-c      ABC     1
2  d_ef      DEF     2
3  gh.i      GHI     3
4  ab-c      ABC     1
5  ab-c      ABC     1
6  d_ef      DEF     2
7  gh.i      GHI     3
8  hd.k      HDK    -1

If you don't include the last part .fillna(value_if_na).astype({"code": int}):

df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left")

OUTPUT

     id id_plain  code
0  ab-c      ABC   1.0
1  ab-c      ABC   1.0
2  d_ef      DEF   2.0
3  gh.i      GHI   3.0
4  ab-c      ABC   1.0
5  ab-c      ABC   1.0
6  d_ef      DEF   2.0
7  gh.i      GHI   3.0
8  hd.k      HDK   NaN