Home > Blockchain >  Group by sum date and fill all missing values with excedents from past dates untill count = 1
Group by sum date and fill all missing values with excedents from past dates untill count = 1

Time:06-16

I have a dataframe with a grouped date and a count, if there is any gap in this time series I have to fill them with excess of previous stacks, and if no gaps extend the series until all counts = 1. These examples happen all same month

NOTE: day_date is a timestamp with daily frequency where missing values are 0, did integer for simplicity in example

An example with missing gaps but no previous stacks:

| day_date | stack |
| -------- | ----- |
|     1    |   0   |
|     2    |   2   |

Produces

| day_date | stack |
| -------- | ----- |
|     1    |   0   |
|     2    |   1   | #
|     3    |   1   | # The entire period flattents to a day frequency with value = 1

An example of days being over stacked and filling gaps:

| day_date | stack |
| -------- | ----- |
|     1    |   0   |
|     2    |   2   | #this row wont be able to fill until the 6th
|     6    |   3   | #this row and below will craete overlap
|     8    |   2   |
|     15   |   1   | # there is a big gap here that will get filled as much as possible from previous overlap

Produces:

| day_date | stack |
| -------- | ----- |
|     1    |   0   |
|     2    |   1   |
|     3    |   1   |
|     4    |   0   | # the previous staack coverd only until the 3rd.
|     5    |   0   |
|     6    |   1   |
|     7    |   1   | 
|     8    |   1   | #Here is an overal of last stack from 6 and 2 days from 8, this results on the two days from 8 moving forward to fill gaps as the day is covered from past stack.
|     9    |   1   | # there is a big gap here that will get filled as much as possible from previous overlap from the 8th, which is 2 days that fill 9th and 10th.
|     10   |   1   |
|     11   |   0   |
|     12   |   0   |
|     13   |   0   |
|     14   |   0   |
|     15   |   1   | #last stack.

Note that the reason 9th and 10th have a 1 is because the excess from the date 8 which was covered since the big refill that happened the 6th and covered from 6th to 8th.

CodePudding user response:

EDIT: using timestamps

Maybe a more readable solution (for beginners) using for loops and a bunch of if statements:

import pandas as pd
lst = [[pd.Timestamp(year=2017, month=1, day=1), 0],
       [pd.Timestamp(year=2017, month=1, day=2), 2],
       [pd.Timestamp(year=2017, month=1, day=10), 3],
       [pd.Timestamp(year=2017, month=2, day=1), 2],
       [pd.Timestamp(year=2017, month=2, day=3), 2]]

df = pd.DataFrame(lst, columns=['day_date', 'stack'])
n_days = (df.day_date.max() - df.day_date.min()).days   1

stack = 0
for index in range(n_days):
    stack  = df.loc[index, 'stack']
    # insert new day
    if index   1 < len(df): # if you are not at the end of the dataframe
        next_day = df.loc[index 1].day_date # compute the next day in dataframe
        this_day = df.loc[index].day_date # compute this day
        if df.loc[index, 'stack'] >= 1:
            df.loc[index, 'stack'] = 1
            stack -= 1
        if this_day   pd.DateOffset(1) != next_day: # if there is a gap in days
            for new_day in range(1, (next_day - this_day).days):
                if stack > 0:
                    df.loc[len(df)] = [this_day   pd.DateOffset(new_day), 1]
                    stack -= 1
                else:
                    df.loc[len(df)] = [this_day   pd.DateOffset(new_day), 0]
            df = df.sort_values('day_date').reset_index(drop=True)
    else:
        if df.loc[index, 'stack'] >= 1:
            df.loc[index, 'stack'] = 1
            stack -= 1
        while stack >= 1:
            this_day = df.loc[len(df)-1].day_date
            df.loc[len(df)] = [this_day   pd.DateOffset(1), 1]
            stack -= 1

CodePudding user response:

This is not such an easy task (if needed to perform in a vectorial way).

You can calculate first the remainder days to carry them to the next date, then use reindexing to duplicate/fill the rows:

remainder = (df['stack'].add(df['day_date'].diff(-1))
             .fillna(0, downcast='infer').clip(lower=0)
            )

df2 = (df
 # shift extra "stack" to next stack
 .assign(stack=df['stack'].sub(remainder).add(remainder.shift(fill_value=0)))
 # repeat rows using "stack" value with a minimum of 1
 .loc[lambda d: d.index.repeat(d['stack'].clip(lower=1))]
 # make stack>1 equal to 1
 # and increment the days per group
 .assign(stack=lambda d: d['stack'].clip(upper=1),
         day_date=lambda d: d['day_date'].add(
               (m:=d['day_date'].duplicated())
                .astype(int)
                .groupby((~m).cumsum())
                .cumsum()
         )
        )
 # fill missing days (all remaining lines)
 .set_index('day_date')
 .reindex(range(df['day_date'].min(), df['day_date'].max() 1))
 .fillna(0, downcast='infer')
 .reset_index()
)

output:

    day_date  stack
0          1      0
1          2      1
2          3      1
3          4      0
4          5      0
5          6      1
6          7      1
7          8      1
8          9      1
9         10      1
10        11      0
11        12      0
12        13      0
13        14      0
14        15      1
  • Related