Home > OS >  In a pandas series of text values, how can I slice out a single character following a specific marke
In a pandas series of text values, how can I slice out a single character following a specific marke

Time:12-16

I have a pandas dataframe with a series of strings. For each string, I want to extract the number after the letter "H", and put this value, as an integer, into a new column.

import pandas as pd
import numpy as np
inp = [{"H-Type": np.nan, 'SomeValue': "Influenza A(H1N1)pdm09 Virus"},
       {"H-Type": np.nan, 'SomeValue': "Influencer A(H3N2) Virus"},
       {"H-Type": 1, 'SomeValue': "Influenza A Virus"}, 
       {"H-Type": np.nan, 'SomeValue': "Influenza A Virus"}]
df = pd.DataFrame(inp)
Out:
   H-Type                     SomeValue
0     NaN  Influenza A(H1N1)pdm09 Virus
1     NaN      Influencer A(H3N2) Virus
2     1.0             Influenza A Virus
3     NaN             Influenza A Virus

What I need is the H-Type values like so:

   H-Type                     SomeValue
0     1      Influenza A(H1N1)pdm09 Virus
1     3      Influencer A(H3N2) Virus
2     1             Influenza A Virus
3     NaN             Influenza A Virus

I can use slice to get the corrrect value for the first row, like so:

df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(start=13, stop=14))

I can define the correct start (and stop) positions like so:

df["SomeValue"].str.find("H") 1

BUT, when I try to define the start and stop values using .str.find, it just returns NaN.

df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(
    start=(df["SomeValue"].str.find("H") 1), stop=(df["SomeValue"].str.find("H") 2)))

Is there an alternative to .str.find to define the start and stop position? or am I applying this wrongly?

CodePudding user response:

Use Series.str.extract for replace missing values to new Series:

df["H-Type"] = df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False))

print (df)
  H-Type                     SomeValue
0      1  Influenza A(H1N1)pdm09 Virus
1      3      Influencer A(H3N2) Virus
2    1.0             Influenza A Virus
3    NaN             Influenza A Virus

If need integers with missing values cast digits to floats and then to Int64:

df["H-Type"] = (df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False)
                                    .astype(float)).astype('Int64')
                )

print (df)
   H-Type                     SomeValue
0       1  Influenza A(H1N1)pdm09 Virus
1       3      Influencer A(H3N2) Virus
2       1             Influenza A Virus
3    <NA>             Influenza A Virus

Your solution:

df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(),
                                  df["SomeValue"].str.extract(r'H(\d)', expand=False))

But if need extract values by position use this solution (not test digits, only values after H):

s = df["SomeValue"].apply(lambda x: x[x.find("H") 1: x.find("H") 2] 
                                    if x.find('H') != -1 else np.nan)
df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(), s.astype(float))
print (df)
   H-Type                     SomeValue
0     1.0  Influenza A(H1N1)pdm09 Virus
1     3.0      Influencer A(H3N2) Virus
2     1.0             Influenza A Virus
3     NaN             Influenza A Virus

CodePudding user response:

Rather use extract:

# priority on Somevalue
df['H-Type'] = (df['SomeValue'].str.extract(r'H(\d)N\d', expand=False)
                .fillna(df['H-Type'], downcast='infer')
               )

# priority on original H-type
df['H-Type'] = df['H-Type'].fillna(
                df['SomeValue'].str.extract(r'H(\d)N\d', expand=False),
                downcast='infer')

With boolean indexing (more efficient here than where):

m = df['SomeValue'].isna()
df.loc[m, 'H-Type'] = df.loc[m, 'SomeValue'].str.extract(r'H(\d)N\d', expand=False)

Output:

  H-Type                     SomeValue
0      1  Influenza A(H1N1)pdm09 Virus
1      3      Influencer A(H3N2) Virus
2    1.0             Influenza A Virus
3    NaN             Influenza A Virus

regex:

H       # match literal H
(\d)    # capture a digit
N\d     # match N followed by digit

CodePudding user response:

df['H-Type'] = df['H-Type'].fillna(df['SomeValue'].str.extract('H(\d)')[0])

df

  H-Type    SomeValue
0   1   Influenza A(H1N1)pdm09 Virus
1   3   Influencer A(H3N2) Virus
2   1.0 Influenza A Virus
3   NaN Influenza A Virus
  • Related