Say I have a dataframe
d = {
"cid": [1, 3, 3],
"txt": ["Kakashi is GOAT", "Eleven is El", "Hello agian"],
"anime": ['Kakashi ', 'el', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['anime'].astype('category').cat.codes
I need to create a new column code
which will contain value from code if anime is present in txt else 999
This is my mask
mask = df.apply(lambda x: x.anime.lower() in x.txt.lower(), axis=1)
df['newCol'] = 999
df['newCol'] = df.loc[mask, 'code']
but this gives me float values and the replaces the 999
as NaN
Output :
df
cid txt anime code newCol
0 1 Kakashi is GOAT Kakashi 0 0.0
1 3 Eleven is El el 1 1.0
2 3 Hello agian mouse 2 NaN
Expected :
df
cid txt anime code newCol
0 1 Kakashi is GOAT Kakashi 0 0
1 3 Eleven is El el 1 1
2 3 Nothing mouse 2 999
Note that I need to do it the masking way there are various methods in which this can be achieved though
CodePudding user response:
You could skip the mask
step and just assign the newCol
values using (almost) the same lambda function:
df['newCol'] = df.apply(lambda x: x.code if x.anime.lower() in x.txt.lower() else 999, axis=1)
Output:
cid txt anime code newCol
0 1 Kakashi is GOAT Kakashi 0 0
1 3 Eleven is El el 1 1
2 3 Hello agian mouse 2 999
Note that el
matches against Eleven
and El
in the string Eleven is El
. If you want to avoid that, you should use a regex match instead, using word boundaries to ensure el
can only match El
, not Eleven
:
import re
df['newCol'] = df.apply(lambda x: x.code if re.search(rf'\b{x.anime}\b', x.txt, re.I) else 999, axis=1)
The output is the same. Note that by using the re.I
(case-insensitive) flag, we can avoid the need to convert both strings to lower-case.