I have a time-series pivot table with struct timestamp column including start
and end
of time frame of records as follow:
import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
"X1": [25],
"X2": [33],
})
df
# timestamp X1 X2
#0 {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"} 25 33
Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end
/start
.
I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:
df[["start_timestamp", "end_timestamp"]] = (
df["timestamp"].str.extractall(r"(\d \.\d \.\d )").unstack().ffill(axis=1)
)
but I get:
ValueError: Columns must be same length as key
so I try to reach following expected dataframe:
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
'start_timestamp': ['2022-01-19T00:00:00.000 0000'],
'end_timestamp': ['2022-01-20T00:00:00.000 0000'],
"X1": [25],
"X2": [33]})
df
# timestamp start_timestamp end_timestamp X1 X2
#0 {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"} 2022-01-19T00:00:00.000 0000 2022-01-20T00:00:00.000 0000 25 33
CodePudding user response:
You can extract both values with an extract
call:
df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"] )')
The "start":"([^"]*)","end":"([^"] )
regex matches "start":"
, then captres any zero or more chars other than "
into Group 1 (the start
column value) and then matches ","end":"
and then captures one or more chars other than "
into Group 2 (the end
column value).
Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:
def extract_startend(x):
j = json.loads(x)
return pd.Series([j["start"], j["end"]])
df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)
Output of print(df.to_string())
:
timestamp X1 X2 start_timestamp end_timestamp
0 {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:......... 25 33 2022-01-19T00:00:00.000 0000 2022-01-20T00:00:00.000 0000
CodePudding user response:
This may not be the most efficient approach, but it works:
df[['start_timestamp','end_timestamp']]=df['timestamp'].str.split(',',expand=True)
df['start_timestamp']=df['start_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')
df['end_timestamp']=df['end_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')