Problem Statement

I want to create a function that generates a new DataFrame from my original, complete DataFrame with n-consecutive rows missing at random. For example:

Original DataFrame (Input)

Temperature and RH measurements made at 2 minute intervals

    timestamp           temp_c      rh
0   2020-06-11 13:52:00 24.037599   42.064286
1   2020-06-11 13:54:00 24.077255   42.061364
2   2020-06-11 13:56:00 24.113462   42.058696
3   2020-06-11 13:58:00 24.146652   42.056250
4   2020-06-11 14:00:00 24.177187   42.054000
5   2020-06-11 14:02:00 24.205373   42.051923
6   2020-06-11 14:04:00 24.231471   42.050000
7   2020-06-11 14:06:00 24.255705   42.048214
8   2020-06-11 14:08:00 24.278268   42.046552
9   2020-06-11 14:10:00 24.299326   42.045000
10  2020-06-11 14:12:00 24.363610   42.042222
11  2020-06-11 14:14:00 24.427894   42.075556
12  2020-06-11 14:16:00 24.492178   42.108889
13  2020-06-11 14:18:00 24.556462   42.142222
14  2020-06-11 14:20:00 24.604675   42.175556
15  2020-06-11 14:22:00 24.636817   42.205556
16  2020-06-11 14:24:00 24.668959   42.238889
17  2020-06-11 14:26:00 24.701101   42.272222
18  2020-06-11 14:28:00 24.733243   42.305556
19  2020-06-11 14:30:00 24.765385   42.338889

New DataFrame (Output)

Called function with n=3, so remove a certain number of distance of 3-consecutive rows at random.

    timestamp           temp_c      rh
0   2020-06-11 13:52:00 24.037599   42.064286
1   2020-06-11 13:54:00 24.077255   42.061364
2   2020-06-11 13:56:00 24.113462   42.058696
3   2020-06-11 13:58:00 24.146652   42.056250
7   2020-06-11 14:06:00 24.255705   42.048214
8   2020-06-11 14:08:00 24.278268   42.046552
9   2020-06-11 14:10:00 24.299326   42.045000
10  2020-06-11 14:12:00 24.363610   42.042222
11  2020-06-11 14:14:00 24.427894   42.075556
12  2020-06-11 14:16:00 24.492178   42.108889
13  2020-06-11 14:18:00 24.556462   42.142222
17  2020-06-11 14:26:00 24.701101   42.272222
18  2020-06-11 14:28:00 24.733243   42.305556
19  2020-06-11 14:30:00 24.765385   42.338889

In the new DataFrame, 2 instances of 3-consecutive, non-overlapping rows were removed. I would like to specify the number of instances of n-consecutive rows to remove by including a percent parameter as the input. Therefore the number of instances to remove from the DataFrame would be calculated as percent*len(df_in)/n.

In the end, the function would take two inputs in addition to the original DataFrame:

def remove_n_consecutive_rows(df_in,n,percent):

CodePudding user response：

You could try something like this:

def remove_n_consecutive_rows(frame, n, percent):
    chunks_to_remove = int(percent/100*frame.shape[0]/n)
    #split the indices into chunks of length n 2
    chunks = [list(range(i,i n 2)) for i in range(0, frame.shape[0]-n)]
    drop_indices = list()
    for i in range(chunks_to_remove):
        indices = random.choice(chunks)
        drop_indices =indices[1:-1]
        #remove all chunks which contain overlapping values with indices
        chunks = [c for c in chunks if not any(n in indices for n in c)]
    return frame.drop(drop_indices)

>>> remove_n_consecutive_rows(df, 3, 30)
             timestamp     temp_c         rh
0  2020-06-11 13:52:00  24.037599  42.064286
4  2020-06-11 14:00:00  24.177187  42.054000
5  2020-06-11 14:02:00  24.205373  42.051923
9  2020-06-11 14:10:00  24.299326  42.045000
10 2020-06-11 14:12:00  24.363610  42.042222
14 2020-06-11 14:20:00  24.604675  42.175556
15 2020-06-11 14:22:00  24.636817  42.205556
16 2020-06-11 14:24:00  24.668959  42.238889
17 2020-06-11 14:26:00  24.701101  42.272222
18 2020-06-11 14:28:00  24.733243  42.305556
19 2020-06-11 14:30:00  24.765385  42.338889

How it works.

Assume n=3 and percent = 30 and the input DataFrame has 20 rows as in your example.

chunks is a list of all consecutve indices of size (n 2): [[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], ... [16, 17, 18, 19, 20]]
In each iteration of the loop,

generate one random chunk. Assume the first iteration generated [5, 6, 7, 8, 9]. The indices to drop are then [6, 7, 8], the center of the generated chunk.
Update chunks to remove all sublists that contain any of [5, 6, 7, 8, 9]. Removing 5 and 9 assures that the next chunk that is generated is not overlapping.

CodePudding user response：

Here is a way to select row position, apart enough from each other, then create all row position from the selected rows.

# dummy dataframe
df = pd.DataFrame({'a':range(20)}, 
                  index=pd.date_range('2021-09-22', freq='2T', periods=20))

# parameter
n = 3
percent=0.3

# function to select rows position to drop
def idx_drop (_df, _n, _percent):
    _len = len(_df)
    vals = np.sort(
        # select enough integer such as this number times 
        # number of consecutive rows to drop
        # will be the percentage wanted
        np.random.choice(range(0,_len-_n-1),
                          size=int(round(_len*_percent/_n,0)), 
                          replace=False)
    )
    # check that each point is at least n rows apart else retry
    return vals if (np.diff(vals)>=_n).all() else idx_drop (_df, _n, _percent)

np.random.seed(2) # for reproductibility
arr = idx_drop (_df=df, _n=n, _percent=percent)
print(arr)
# [ 4 12]
                          # create an aray with all rows to drop
new_df = df.drop(df.index[np.hstack([arr i for i in range(n)])])
print(new_df)
#                       a
# 2021-09-22 00:00:00   0
# 2021-09-22 00:02:00   1
# 2021-09-22 00:04:00   2
# 2021-09-22 00:06:00   3 #no 4-5-6 rows
# 2021-09-22 00:14:00   7
# 2021-09-22 00:16:00   8
# 2021-09-22 00:18:00   9
# 2021-09-22 00:20:00  10
# 2021-09-22 00:22:00  11 #no 12-13-14 rows
# 2021-09-22 00:30:00  15
# 2021-09-22 00:32:00  16
# 2021-09-22 00:34:00  17
# 2021-09-22 00:36:00  18
# 2021-09-22 00:38:00  19