Home > Net >  How can convert struct column timestamp with start and end into normal pythonic stamp column?
How can convert struct column timestamp with start and end into normal pythonic stamp column?

Time:02-20

I have a time-series pivot table with struct timestamp column including start and end of time frame of records as follow:

import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
                   "X1": [25],
                   "X2": [33],
                   })
df 
#                                                                       timestamp   X1  X2
#0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}   25  33

Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end/start. I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:

df[["start_timestamp", "end_timestamp"]] = (
    df["timestamp"].str.extractall(r"(\d \.\d \.\d )").unstack().ffill(axis=1)
)

but I get:

ValueError: Columns must be same length as key

so I try to reach following expected dataframe:

df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}'],
                   'start_timestamp': ['2022-01-19T00:00:00.000 0000'],
                   'end_timestamp': ['2022-01-20T00:00:00.000 0000'],
                   "X1": [25],
                   "X2": [33]})
df 
#                                                                       timestamp   start_timestamp                 end_timestamp                   X1  X2
#0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:00.000 0000"}   2022-01-19T00:00:00.000 0000    2022-01-20T00:00:00.000 0000    25  33

CodePudding user response:

You can extract both values with an extract call:

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"] )')

The "start":"([^"]*)","end":"([^"] ) regex matches "start":", then captres any zero or more chars other than " into Group 1 (the start column value) and then matches ","end":" and then captures one or more chars other than " into Group 2 (the end column value).

Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:

def extract_startend(x):
    j = json.loads(x)
    return pd.Series([j["start"], j["end"]])

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)

Output of print(df.to_string()):

                                                                   timestamp  X1  X2               start_timestamp                 end_timestamp
0  {"start":"2022-01-19T00:00:00.000 0000","end":"2022-01-20T00:00:.........  25  33  2022-01-19T00:00:00.000 0000  2022-01-20T00:00:00.000 0000

CodePudding user response:

This may not be the most efficient approach, but it works:

df[['start_timestamp','end_timestamp']]=df['timestamp'].str.split(',',expand=True)
df['start_timestamp']=df['start_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')
df['end_timestamp']=df['end_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\ \d{4})')
  • Related