Formatting data sheet datetime values with pandas and numpy Python-CodePudding

The code below calculates the mean','median','max','min' of the vals values that are in correlation to the dates month_changes . The code calculates the mean','median','max','min' by separating the intervals in years. I want to implement a bit to the code where it adds zeroes to the mean','median','max','min' values for the years that start from the starting_year variable, also to the years that are in between one another like 2020 in this example it will also add zeroes. How would I be able to do that?

import numpy as np 
import pandas as pd 

month_changes = np.array(["2018-04-01 00:00:00", "2018-05-01 00:00:00", "2019-03-01 00:00:00", "2019-04-01 00:00:00","2019-08-01 00:00:00", "2019-11-01 00:00:00", "2019-12-01 00:00:00","2021-01-01 00:00:00"]) 
vals = np.array([10, 23, 45, 4,5,12,4,-6])
starting_year = 2016

def YearlyIntervals(vals):
    data = pd.DataFrame({"Date": month_changes, "Averages": vals})
    data["Date"] = pd.to_datetime(data["Date"])
    out=(data.groupby(data["Date"].dt.year)
         .agg(['mean','median','max','min'])
         .droplevel(0,1)
         .rename(columns=lambda x:'Average' if x=='mean' else x.title())
        )
    return out

PnL_YearlyFilter= YearlyIntervals(vals)

Output

          Average     Median    Max  Min 
Date                                                                         
2018      16.5        16.5      23   10
2019      14.0        5.0       45    4 
2021      -6.0       -6.0       -6   -6

Expected Output

          Average     Median    Max  Min 
Date
2016      0           0         0     0
2017      0           0         0     0
2018      16.5        16.5      23   10
2019      14.0        5.0       45    4
2020      0           0         0     0
2021      -6.0       -6.0       -6   -6

CodePudding user response：

Use reindex with a fill_value=0.

One way it can be added to the function is like:

def yearly_intervals(mc, vs, start_year=None, end_year=None):
    data = pd.DataFrame({
        "Date": pd.to_datetime(mc),  # Convert to_datetime immediately
        "Averages": vs
    })
    out = (
        data.groupby(data["Date"].dt.year)["Averages"]  # Access Series
            .agg(['mean', 'median', 'max', 'min'])
            .rename(columns=lambda x: 'Average' if x == 'mean' else x.title())
    )
    # If start_year
    if start_year is not None:
        # Reindex to ensure index contains all years in range
        out = out.reindex(range(
            start_year,
            # Use last year (maximum value) from index or user defined arg
            (end_year if end_year is not None else out.index.max())   1
        ), fill_value=0)
    return out

Some changes made:

PEP8 guidelines state that function names should "should be lowercase, with words separated by underscores as necessary". Function and Variable Names
The numpy array can be converted directly to_datetime when building the DataFrame without needing to build the DataFrame then convert the Series afterwards.
Slight modifications to the groupby aggregation using SeriesGroupBy.aggregate instead of DataFrame groupby to ensure more consistent behaviour across versions.
start_year and end_year kwargs with default parameters allow for more flexibility of the function so that any range of years can be used. (More argument processing could be added to ensure that end_year is always larger than start_year)
end_year is optional. If no end_year is provided it will use the Index.max (the largest value in the index)

Sample function call:

month_changes = np.array(
    ["2018-04-01 00:00:00", "2018-05-01 00:00:00", "2019-03-01 00:00:00",
     "2019-04-01 00:00:00", "2019-08-01 00:00:00", "2019-11-01 00:00:00",
     "2019-12-01 00:00:00", "2021-01-01 00:00:00"])
vals = np.array([10, 23, 45, 4, 5, 12, 4, -6])
starting_year = 2016

PnL_YearlyFilter = yearly_intervals(month_changes, vals, starting_year)

PnL_YearlyFilter:

      Average  Median  Max  Min
Date                           
2016      0.0     0.0    0    0
2017      0.0     0.0    0    0
2018     16.5    16.5   23   10
2019     14.0     5.0   45    4
2020      0.0     0.0    0    0
2021     -6.0    -6.0   -6   -6