How can convert struct column timestamp with start and end into normal pythonic stamp column?-CodePudding

I have a time-series pivot table with struct timestamp column including start and end of time frame of records as follow:

import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
                   "X1": [25],
                   "X2": [33],
                   })
df 
#                                                                       timestamp   X1  X2
#0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}   25  33

Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end/start. I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:

df[["start_timestamp", "end_timestamp"]] = (
    df["timestamp"].str.extractall(r"(\d \.\d \.\d )").unstack().ffill(axis=1)
)

but I get:

ValueError: Columns must be same length as key

so I try to reach following expected dataframe:

df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
                   'start_timestamp': ['2022-01-19T00:00:00.000 0000'],
                   'end_timestamp': ['2022-01-20T00:00:00.000 0000'],
                   "X1": [25],
                   "X2": [33]})
df 
#                                                                       timestamp   start_timestamp                 end_timestamp                   X1  X2
#0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}   2022-01-19T00:00:00.000 0000    2022-01-20T00:00:00.000 0000    25  33

CodePudding user response：

You can extract both values with an extract call:

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"] )')

The "start":"([^"]*)","end":"([^"] ) regex matches "start":", then captres any zero or more chars other than " into Group 1 (the start column value) and then matches ","end":" and then captures one or more chars other than " into Group 2 (the end column value).

Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:

def extract_startend(x):
    j = json.loads(x)
    return pd.Series([j["start"], j["end"]])

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)

Output of print(df.to_string()):

                                                                   timestamp  X1  X2               start_timestamp                 end_timestamp
0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:.........  25  33  2022-01-19T00:00:00.000 0000  2022-01-20T00:00:00.000 0000

CodePudding user response：

This may not be the most efficient approach, but it works:

df[['start_timestamp','end_timestamp']]=df['timestamp'].str.split(',',expand=True)
df['start_timestamp']=df['start_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')
df['end_timestamp']=df['end_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')