I have two different DataFrames
>> df = pd.DataFrame([
{'id': 'ab-c', 'id_plain': 'ABC'},
{'id': 'ab-c', 'id_plain': 'ABC'},
{'id': 'd_ef', 'id_plain': 'DEF'},
{'id': 'gh.i', 'id_plain': 'GHI'},
{'id': 'ab-c', 'id_plain': 'ABC'},
{'id': 'ab-c', 'id_plain': 'ABC'},
{'id': 'd_ef', 'id_plain': 'DEF'},
{'id': 'gh.i', 'id_plain': 'GHI'}
])
>> df
id id_plain
0 ab-c ABC
1 ab-c ABC
2 d_ef DEF
3 gh.i GHI
4 ab-c ABC
5 ab-c ABC
6 d_ef DEF
7 gh.i GHI
>> codes = pd.DataFrame([
{'id': 'AB_C', 'id_plain': 'ABC', 'code': 1},
{'id': 'd_ef', 'id_plain': 'DEF', 'code': 2},
{'id': 'GHI', 'id_plain': 'GHI', 'code': 3}
])
>> codes
id id_plain code
0 ab_c ABC 1
1 d_ef DEF 2
2 ghi GHI 3
id_plain
has been calculated using
df['id_plain'] = df['id'].map(lambda x: re.sub('[\W_] ', '', x).upper())
The codes
DataFrame acts as a key book having only uniques values. I want to check, for every row in df
, if its id_plain
value is contained on the id_plain
Series of codes
DataFrame, and then get that matching row code
value and insert it into a newly created column on df
.
The desired output for df
would be:
id id_plain code
0 ab-c ABC 1
1 ab-c ABC 1
2 d_ef DEF 2
3 gh.i GHI 3
4 ab-c ABC 1
5 ab-c ABC 1
6 d_ef DEF 2
7 gh.i GHI 3
CodePudding user response:
Use Series.str.replace
with Series.str.upper
and then mapping by Series
from codes
by Series.map
:
df['id_plain'] = df['id'].str.replace('[\W_] ', '', regex=True).str.upper()
df['code'] = df['id_plain'].map(codes.set_index('id_plain')['code'])
print (df)
id id_plain code
0 ab-c ABC 1
1 ab-c ABC 1
2 d_ef DEF 2
3 gh.i GHI 3
4 ab-c ABC 1
5 ab-c ABC 1
6 d_ef DEF 2
7 gh.i GHI 3
CodePudding user response:
Alternatively, you could use merge
in pandas
:
value_if_na = -1 # this is a value you can set
df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left").fillna(value_if_na).astype({"code":int})
OUTPUT
id id_plain code
0 ab-c ABC 1
1 ab-c ABC 1
2 d_ef DEF 2
3 gh.i GHI 3
4 ab-c ABC 1
5 ab-c ABC 1
6 d_ef DEF 2
7 gh.i GHI 3
If one or more of the plain_code
in df
are not available in codes
, the solution will allow to output a predetermined value. For example:
df = df.append({"id":"hd.k", "id_plain":"HDK"}, ignore_index=True) # let's add a novel value
df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left").fillna(value_if_na).astype({"code": int})
OUTPUT
id id_plain code
0 ab-c ABC 1
1 ab-c ABC 1
2 d_ef DEF 2
3 gh.i GHI 3
4 ab-c ABC 1
5 ab-c ABC 1
6 d_ef DEF 2
7 gh.i GHI 3
8 hd.k HDK -1
If you don't include the last part .fillna(value_if_na).astype({"code": int})
:
df.merge(codes[["id_plain","code"]], on=["id_plain"], how="left")
OUTPUT
id id_plain code
0 ab-c ABC 1.0
1 ab-c ABC 1.0
2 d_ef DEF 2.0
3 gh.i GHI 3.0
4 ab-c ABC 1.0
5 ab-c ABC 1.0
6 d_ef DEF 2.0
7 gh.i GHI 3.0
8 hd.k HDK NaN