Pandas splitting data into multiple columns by year-CodePudding

I am working with (15 minute) data looking like the following:

                  records
year1-01-01 00:00        1
year1-01-01 00:15        2
...
year2-01-01 00:00        3
year2-01-01 00:15        4
...
year3-01-01 00:00        5
year3-01-01 00:15        6
...
...

And, if possible, I would like to split this data into multiple columns by year for analysis:

              year1  year2  year3  ...
01-01 00:00       1      3      5
01-01 00:15       2      4      6
...                                ...

The data is not guaranteed to contain all 15 minute timestamps for every year, so if one of the years is missing a timestamp, I would like that timestamp to be omitted from the result for all years.

I have tried various combinations of df.groupby(df.index.year) and pd.merge or pd.concat to perform an inner join on the groups, but I have not gotten it to work properly.

What would be a clean way to implement this?

Many thanks in advance.

============================

Code to generate example dataframe:

records = {"records": [1, 2, 3, 4, 5, 6]}
dates = [
    "2020-01-01 00:00:00",
    "2020-01-01 00:15:00",
    # ...
    "2021-01-01 00:00:00",
    "2021-01-01 00:15:00",
    # ...
    "2022-01-01 00:00:00",
    "2022-01-01 00:15:00",
    # ...
]
df = pd.DataFrame(data=records, index=pd.DatetimeIndex(dates))

CodePudding user response：

First create MultiIndex by DatetimeIndex.strftime and DatetimeIndex.year and reshape by Series.unstack:

df.index = [df.index.strftime('%m-%d %H:%M:%S'), df.index.year]

df = df['records'].unstack()
print (df)
                2020  2021  2022
01-01 00:00:00     1     3     5
01-01 00:15:00     2     4     6

CodePudding user response：

My answer is for sure neither pretty nor elegant but it should get the job done

records = [1, 2, 3, 4, 5, 6, 7, 8]

dates = [
    "2020-01-01 00:00:00",
    "2020-01-01 00:15:00",

    "2021-01-01 00:00:00",
    "2021-01-01 00:15:00",
    "2021-01-01 00:45:00",

    "2022-01-01 00:00:00",
    "2022-01-01 00:15:00",
    "2022-01-01 00:30:00",
]

data_list = [[],[],[]]

for subListIndex, each_year in enumerate(range(2020,2023)):

    for each_hour in range(0,24):
        each_hour = str(each_hour).zfill(2)

        for each_quarter in range(0,60,15):
            each_quarter = str(each_quarter).zfill(2)

            date = str(each_year) "-01-01 " each_hour ":" each_quarter ":00"

            for index, each_date in enumerate(dates):
                if each_date == date:

                    data_list[subListIndex].append(records[index])
                    break

            else:
                data_list[subListIndex].append(0)


print(data_list)

Use can directly put this List into a pandas dataframe

import pandas as pd
pd.DataFrame(data_list)