I have a time series problem and I want to aggregate some data based on the values that appear in a certain column. to illustrate, consider the following table
Date | colA | colB | colC |
---|---|---|---|
2019-01-01 | 1 | -10 | Null |
2019-01-02 | 2 | -5 | Null |
2019-01-03 | 3 | 0 | 101 |
2019-01-04 | 4 | 5 | 101 |
2019-01-05 | 5 | 10 | 101 |
2019-01-06 | 6 | 15 | Null |
2019-01-07 | 7 | 20 | 101 |
I want to accomplish the following:
- the moment the value of colC is not null, aggregate the values up to that row and get the delta for the date column
- if the element X of colC is not null but the element (X-1) is also not null, just disregard row X.
For the previous table, the result would be
agg(colC) | avg(colA) | avg(colB) | delta(Date) [in days] |
---|---|---|---|
101 | 2 | -5 | 2 |
101 | 6.5 | 17.5 | 1 |
I could not find any way to accomplish that so far
CodePudding user response:
Try with groupby
:
#convert Date column to datetime if needed
df["Date"] = pd.to_datetime(df["Date"])
#keep only rows where there aren't consecutive non-null values
df2 = df[~(df["colC"].notnull()&df["colC"].shift().notnull())]
#groupby consecutive null values and aggregate
output = df2.groupby(df2["colC"].notnull().shift().cumsum().fillna(0)) \
.agg({"colA": "mean", \
"colB": "mean", \
"colC": "first", \
"Date": lambda x: (x.max()-x.min()).days}) \
.rename_axis(None) \
.rename(columns={"Date": "Delta"})
>>> output
colA colB colC Delta
0.0 2.0 -5.0 101.0 2
1.0 6.5 17.5 101.0 1
CodePudding user response:
You can set groups of colC
with cumsum()
, and then group by the groups by .groupby()
, as follows:
(Assuming the Null values are NaN
or None):
# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# disregard rows if the element X of colC is not null but the element (X-1) is also not null
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]
# set grouping for `colC`
group = (df2['colC'].shift(1).notna() & df2['colC'].isna()).cumsum()
df_out = (df.groupby(group, as_index=False)
.agg(**{'agg(colC)':('colC', 'last'),
'avg(colA)':('colA', 'mean'),
'avg(colB)':('colB', 'mean'),
'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
)
Data Input:
data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'],
'colA': [1, 2, 3, 4, 5, 6, 7],
'colB': [-10, -5, 0, 5, 10, 15, 20],
'colC': [np.nan, np.nan, 101.0, 101.0, 101.0, np.nan, 101.0]}
df = pd.DataFrame(data)
Date colA colB colC
0 2019-01-01 1 -10 NaN
1 2019-01-02 2 -5 NaN
2 2019-01-03 3 0 101.0
3 2019-01-04 4 5 101.0
4 2019-01-05 5 10 101.0
5 2019-01-06 6 15 NaN
6 2019-01-07 7 20 101.0
Result:
print(df_out)
agg(colC) avg(colA) avg(colB) delta(Date)
0 101.0 2.0 -5.0 2
1 101.0 6.5 17.5 1