Python Pandas aggregate past rows when a condition is met for time series-CodePudding

I have a time series problem and I want to aggregate some data based on the values that appear in a certain column. to illustrate, consider the following table

Date	colA	colB	colC
2019-01-01	1	-10	Null
2019-01-02	2	-5	Null
2019-01-03	3	0	101
2019-01-04	4	5	101
2019-01-05	5	10	101
2019-01-06	6	15	Null
2019-01-07	7	20	101

I want to accomplish the following:

the moment the value of colC is not null, aggregate the values up to that row and get the delta for the date column
if the element X of colC is not null but the element (X-1) is also not null, just disregard row X.

For the previous table, the result would be

agg(colC)	avg(colA)	avg(colB)	delta(Date) [in days]
101	2	-5	2
101	6.5	17.5	1

I could not find any way to accomplish that so far

CodePudding user response：

Try with groupby:

#convert Date column to datetime if needed
df["Date"] = pd.to_datetime(df["Date"])

#keep only rows where there aren't consecutive non-null values
df2 = df[~(df["colC"].notnull()&df["colC"].shift().notnull())]

#groupby consecutive null values and aggregate
output = df2.groupby(df2["colC"].notnull().shift().cumsum().fillna(0)) \
            .agg({"colA": "mean", \
                  "colB": "mean", \
                  "colC": "first", \
                  "Date": lambda x: (x.max()-x.min()).days}) \
            .rename_axis(None) \
            .rename(columns={"Date": "Delta"})

>>> output
     colA  colB   colC  Delta
0.0   2.0  -5.0  101.0      2
1.0   6.5  17.5  101.0      1

CodePudding user response：

You can set groups of colC with cumsum(), and then group by the groups by .groupby(), as follows:

(Assuming the Null values are NaN or None):

# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# disregard rows if the element X of colC is not null but the element (X-1) is also not null
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]

# set grouping for `colC`
group = (df2['colC'].shift(1).notna() & df2['colC'].isna()).cumsum()

df_out = (df.groupby(group, as_index=False)
            .agg(**{'agg(colC)':('colC', 'last'), 
                    'avg(colA)':('colA', 'mean'),
                    'avg(colB)':('colB', 'mean'), 
                    'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
         )

Data Input:

data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'],
 'colA': [1, 2, 3, 4, 5, 6, 7],
 'colB': [-10, -5, 0, 5, 10, 15, 20],
 'colC': [np.nan, np.nan, 101.0, 101.0, 101.0, np.nan, 101.0]}
df = pd.DataFrame(data)

         Date  colA  colB   colC
0  2019-01-01     1   -10    NaN
1  2019-01-02     2    -5    NaN
2  2019-01-03     3     0  101.0
3  2019-01-04     4     5  101.0
4  2019-01-05     5    10  101.0
5  2019-01-06     6    15    NaN
6  2019-01-07     7    20  101.0

Result:

print(df_out)

   agg(colC)  avg(colA)  avg(colB)  delta(Date)
0      101.0        2.0       -5.0            2
1      101.0        6.5       17.5            1