The above photo is data stored in df_input.
I would like extract the "C#D#" part from the 'Visit' column and place it into the column of a new dataframe I created (df_output['VISIT']).
Additionally, there could be up to two numeric values that follow after the "D".
I'm not sure if I am supposed to use '.str.extract' and how I would capture all the numeric values that follow right after the "D"
The output I would like to get is:
C1D1
C1D1
" "
C1D1
Please note df_input[Visit] does not only have "C1D1". It has variations of the C#D# structure so it could be "C1D12" or "C2D9".
CodePudding user response:
You can use a simple regex to recognize your pattern and then you can apply a function to dataframe to apply the recognizer to the whole column:
import pandas as pd
import re
def extract(year):
matches = re.findall('C\dD\d{1,2}', year)
if matches:
return matches[0] # Assuming you only want to retrieve the first occurrence
df_input = pd.DataFrame(data=['C1D1-Pre', 'C1D12-2hr Post', 'test'], columns=['VISIT'])
df_output = pd.DataFrame()
df_output['VISIT'] = df_input['VISIT'].apply(lambda x: extract(x))
print(df_output)
The output will be:
VISIT
0 C1D1
1 C1D12
2 None
If you want empty string instead of None, you have to edit the extract function:
def extract(year):
matches = re.findall('C\dD\d{1,2}', year)
if matches:
return matches[0]
return ""