Home > Software design >  Copying substring that starts with "C" and ends with all the values after "D" in
Copying substring that starts with "C" and ends with all the values after "D" in

Time:10-31

df_input

The above photo is data stored in df_input.

I would like extract the "C#D#" part from the 'Visit' column and place it into the column of a new dataframe I created (df_output['VISIT']).

Additionally, there could be up to two numeric values that follow after the "D".

I'm not sure if I am supposed to use '.str.extract' and how I would capture all the numeric values that follow right after the "D"

The output I would like to get is:

C1D1

C1D1

" "

C1D1

Please note df_input[Visit] does not only have "C1D1". It has variations of the C#D# structure so it could be "C1D12" or "C2D9".

CodePudding user response:

You can use a simple regex to recognize your pattern and then you can apply a function to dataframe to apply the recognizer to the whole column:

import pandas as pd
import re

def extract(year):
    matches = re.findall('C\dD\d{1,2}', year)
    if matches:
        return matches[0]  # Assuming you only want to retrieve the first occurrence


df_input = pd.DataFrame(data=['C1D1-Pre', 'C1D12-2hr Post', 'test'], columns=['VISIT'])

df_output = pd.DataFrame()
df_output['VISIT'] = df_input['VISIT'].apply(lambda x: extract(x))
print(df_output)

The output will be:

   VISIT
0   C1D1
1  C1D12
2   None

If you want empty string instead of None, you have to edit the extract function:

def extract(year):
    matches = re.findall('C\dD\d{1,2}', year)
    if matches:
        return matches[0]
    return ""
  • Related