I have a dataframe that looks like this:
ID RESULT
1 Pivot (Triage) Form Entered On: 12/30/2022 23:20 EST Performed On: 12/30/2022 23:16 EST
I would like to extract both datetime variables so the new dataframe looks like this:
ID END_TIME START_TIME
1 12/30/2022 23:20 12/30/2022 23:16
I'm trying multiple methods but getting results where the 'END_TIME'
and 'START_TIME'
variables output is 'NA
'.
TEST['END_TIME']=TEST['RESULT'].str.extract("Entered On: (\d ) EST")
TEST['START_TIME']=TEST['RESULT'].str.extract("Performed On: (\d ) EST")
CodePudding user response:
Assuming there always be 2 and only two timestamps, and using a more generic regex pattern, we can try:
test[["END_TIME", "START_TIME"]] = test["RESULT"].str.extract(r'Entered On:\s*(\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [A-Z]{3})\s Performed On:\s*(\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [A-Z]{3})')
Here is a regex demo showing that the regex pattern and capture groups are working correctly.
CodePudding user response:
Test dataframe to start with :
We build the following dataframe before applying regex function (I presume end date is always before start date) :
import pandas as pd
import re
### We build dataframe test first ###
s = "Pivot (Triage) Form Entered On: 12/30/2022 23:20 EST Performed On: 12/30/2022 23:16 EST"
df = pd.DataFrame([('1', s)], columns=['ID', 'RESULT'])
### ----------------------------- ###
ID RESULT
0 1 Pivot (Triage) Form Entered On: 12/30/2022 23...
You could use regex below in your code or use the following code (it's like best fits to you)
regex = r'\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}'
With your script :
import pandas as pd
import re
### We build dataframe test first ###
s = "Pivot (Triage) Form Entered On: 12/30/2022 23:20 EST Performed On: 12/30/2022 23:16 EST"
df = pd.DataFrame([('1', s)], columns=['ID', 'RESULT'])
### ----------------------------- ###
# We define regex
regex = r'Form Entered On: (\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}:\d{1,2}) EST'
df['END_TIME'] = df['RESULT'].str.extract(regex)
regex = r'Performed On: (\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}:\d{1,2}) EST'
df['START_TIME'] = df['RESULT'].str.extract(regex)
Another way :
import pandas as pd
import re
### We build dataframe test first ###
s = "Pivot (Triage) Form Entered On: 12/30/2022 23:20 EST Performed On: 12/30/2022 23:16 EST"
df = pd.DataFrame([('1', s)], columns=['ID', 'RESULT'])
### ----------------------------- ###
# We define regex
regex = r'\d{1,2}\/\d{1,2}\/\d{4} \d{1,2}:\d{1,2}'
df[['END_TIME', 'START_TIME']] = df.apply(lambda x: re.findall(regex, x.iloc[1]), axis=1).iloc[0]
df['END_TIME'] :
0 12/30/2022 23:20
Name: END_TIME, dtype: object
df['START_TIME'] :
0 12/30/2022 23:16
Name: START_TIME, dtype: object