I have a pandas dataframe with a series of strings. For each string, I want to extract the number after the letter "H", and put this value, as an integer, into a new column.
import pandas as pd
import numpy as np
inp = [{"H-Type": np.nan, 'SomeValue': "Influenza A(H1N1)pdm09 Virus"},
{"H-Type": np.nan, 'SomeValue': "Influencer A(H3N2) Virus"},
{"H-Type": 1, 'SomeValue': "Influenza A Virus"},
{"H-Type": np.nan, 'SomeValue': "Influenza A Virus"}]
df = pd.DataFrame(inp)
Out:
H-Type SomeValue
0 NaN Influenza A(H1N1)pdm09 Virus
1 NaN Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
What I need is the H-Type values like so:
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1 Influenza A Virus
3 NaN Influenza A Virus
I can use slice to get the corrrect value for the first row, like so:
df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(start=13, stop=14))
I can define the correct start (and stop) positions like so:
df["SomeValue"].str.find("H") 1
BUT, when I try to define the start and stop values using .str.find, it just returns NaN.
df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(
start=(df["SomeValue"].str.find("H") 1), stop=(df["SomeValue"].str.find("H") 2)))
Is there an alternative to .str.find to define the start and stop position? or am I applying this wrongly?
CodePudding user response:
Use Series.str.extract
for replace missing values to new Series
:
df["H-Type"] = df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False))
print (df)
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
If need integers with missing values cast digits to floats and then to Int64
:
df["H-Type"] = (df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False)
.astype(float)).astype('Int64')
)
print (df)
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1 Influenza A Virus
3 <NA> Influenza A Virus
Your solution:
df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(),
df["SomeValue"].str.extract(r'H(\d)', expand=False))
But if need extract values by position use this solution (not test digits, only values after H
):
s = df["SomeValue"].apply(lambda x: x[x.find("H") 1: x.find("H") 2]
if x.find('H') != -1 else np.nan)
df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(), s.astype(float))
print (df)
H-Type SomeValue
0 1.0 Influenza A(H1N1)pdm09 Virus
1 3.0 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
CodePudding user response:
Rather use extract
:
# priority on Somevalue
df['H-Type'] = (df['SomeValue'].str.extract(r'H(\d)N\d', expand=False)
.fillna(df['H-Type'], downcast='infer')
)
# priority on original H-type
df['H-Type'] = df['H-Type'].fillna(
df['SomeValue'].str.extract(r'H(\d)N\d', expand=False),
downcast='infer')
With boolean indexing (more efficient here than where
):
m = df['SomeValue'].isna()
df.loc[m, 'H-Type'] = df.loc[m, 'SomeValue'].str.extract(r'H(\d)N\d', expand=False)
Output:
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
regex:
H # match literal H
(\d) # capture a digit
N\d # match N followed by digit
CodePudding user response:
df['H-Type'] = df['H-Type'].fillna(df['SomeValue'].str.extract('H(\d)')[0])
df
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus