Problem Statement
I want to create a function that generates a new DataFrame from my original, complete DataFrame with n-consecutive rows missing at random. For example:
Original DataFrame (Input)
Temperature and RH measurements made at 2 minute intervals
timestamp temp_c rh
0 2020-06-11 13:52:00 24.037599 42.064286
1 2020-06-11 13:54:00 24.077255 42.061364
2 2020-06-11 13:56:00 24.113462 42.058696
3 2020-06-11 13:58:00 24.146652 42.056250
4 2020-06-11 14:00:00 24.177187 42.054000
5 2020-06-11 14:02:00 24.205373 42.051923
6 2020-06-11 14:04:00 24.231471 42.050000
7 2020-06-11 14:06:00 24.255705 42.048214
8 2020-06-11 14:08:00 24.278268 42.046552
9 2020-06-11 14:10:00 24.299326 42.045000
10 2020-06-11 14:12:00 24.363610 42.042222
11 2020-06-11 14:14:00 24.427894 42.075556
12 2020-06-11 14:16:00 24.492178 42.108889
13 2020-06-11 14:18:00 24.556462 42.142222
14 2020-06-11 14:20:00 24.604675 42.175556
15 2020-06-11 14:22:00 24.636817 42.205556
16 2020-06-11 14:24:00 24.668959 42.238889
17 2020-06-11 14:26:00 24.701101 42.272222
18 2020-06-11 14:28:00 24.733243 42.305556
19 2020-06-11 14:30:00 24.765385 42.338889
New DataFrame (Output)
Called function with n=3
, so remove a certain number of distance of 3-consecutive rows at random.
timestamp temp_c rh
0 2020-06-11 13:52:00 24.037599 42.064286
1 2020-06-11 13:54:00 24.077255 42.061364
2 2020-06-11 13:56:00 24.113462 42.058696
3 2020-06-11 13:58:00 24.146652 42.056250
7 2020-06-11 14:06:00 24.255705 42.048214
8 2020-06-11 14:08:00 24.278268 42.046552
9 2020-06-11 14:10:00 24.299326 42.045000
10 2020-06-11 14:12:00 24.363610 42.042222
11 2020-06-11 14:14:00 24.427894 42.075556
12 2020-06-11 14:16:00 24.492178 42.108889
13 2020-06-11 14:18:00 24.556462 42.142222
17 2020-06-11 14:26:00 24.701101 42.272222
18 2020-06-11 14:28:00 24.733243 42.305556
19 2020-06-11 14:30:00 24.765385 42.338889
In the new DataFrame, 2 instances of 3-consecutive, non-overlapping rows were removed. I would like to specify the number of instances of n-consecutive rows to remove by including a percent
parameter as the input. Therefore the number of instances to remove from the DataFrame would be calculated as percent*len(df_in)/n
.
In the end, the function would take two inputs in addition to the original DataFrame:
def remove_n_consecutive_rows(df_in,n,percent):
CodePudding user response:
You could try something like this:
def remove_n_consecutive_rows(frame, n, percent):
chunks_to_remove = int(percent/100*frame.shape[0]/n)
#split the indices into chunks of length n 2
chunks = [list(range(i,i n 2)) for i in range(0, frame.shape[0]-n)]
drop_indices = list()
for i in range(chunks_to_remove):
indices = random.choice(chunks)
drop_indices =indices[1:-1]
#remove all chunks which contain overlapping values with indices
chunks = [c for c in chunks if not any(n in indices for n in c)]
return frame.drop(drop_indices)
>>> remove_n_consecutive_rows(df, 3, 30)
timestamp temp_c rh
0 2020-06-11 13:52:00 24.037599 42.064286
4 2020-06-11 14:00:00 24.177187 42.054000
5 2020-06-11 14:02:00 24.205373 42.051923
9 2020-06-11 14:10:00 24.299326 42.045000
10 2020-06-11 14:12:00 24.363610 42.042222
14 2020-06-11 14:20:00 24.604675 42.175556
15 2020-06-11 14:22:00 24.636817 42.205556
16 2020-06-11 14:24:00 24.668959 42.238889
17 2020-06-11 14:26:00 24.701101 42.272222
18 2020-06-11 14:28:00 24.733243 42.305556
19 2020-06-11 14:30:00 24.765385 42.338889
How it works.
Assume n=3 and percent = 30 and the input DataFrame has 20 rows as in your example.
chunks
is a list of all consecutve indices of size (n 2):[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], ... [16, 17, 18, 19, 20]]
- In each iteration of the loop,
- generate one random chunk. Assume the first iteration generated
[5, 6, 7, 8, 9]
. The indices to drop are then[6, 7, 8]
, the center of the generated chunk. - Update
chunks
to remove all sublists that contain any of[5, 6, 7, 8, 9]
. Removing 5 and 9 assures that the next chunk that is generated is not overlapping.
CodePudding user response:
Here is a way to select row position, apart enough from each other, then create all row position from the selected rows.
# dummy dataframe
df = pd.DataFrame({'a':range(20)},
index=pd.date_range('2021-09-22', freq='2T', periods=20))
# parameter
n = 3
percent=0.3
# function to select rows position to drop
def idx_drop (_df, _n, _percent):
_len = len(_df)
vals = np.sort(
# select enough integer such as this number times
# number of consecutive rows to drop
# will be the percentage wanted
np.random.choice(range(0,_len-_n-1),
size=int(round(_len*_percent/_n,0)),
replace=False)
)
# check that each point is at least n rows apart else retry
return vals if (np.diff(vals)>=_n).all() else idx_drop (_df, _n, _percent)
np.random.seed(2) # for reproductibility
arr = idx_drop (_df=df, _n=n, _percent=percent)
print(arr)
# [ 4 12]
# create an aray with all rows to drop
new_df = df.drop(df.index[np.hstack([arr i for i in range(n)])])
print(new_df)
# a
# 2021-09-22 00:00:00 0
# 2021-09-22 00:02:00 1
# 2021-09-22 00:04:00 2
# 2021-09-22 00:06:00 3 #no 4-5-6 rows
# 2021-09-22 00:14:00 7
# 2021-09-22 00:16:00 8
# 2021-09-22 00:18:00 9
# 2021-09-22 00:20:00 10
# 2021-09-22 00:22:00 11 #no 12-13-14 rows
# 2021-09-22 00:30:00 15
# 2021-09-22 00:32:00 16
# 2021-09-22 00:34:00 17
# 2021-09-22 00:36:00 18
# 2021-09-22 00:38:00 19