Home > Software design >  Applying a mask to a dataframe, but only over a certain range inside the dataframe
Applying a mask to a dataframe, but only over a certain range inside the dataframe

Time:07-12

I currently have some code that uses a mask to calculate the mean of values that are overloads, and values that are baseline values. It does this over the entire length of the dataframe. However, now I want to only apply this to a certain range in the dataframe column, between first and last values (ie, a specified region in the column, dictated by user input). Here is my code as it stands:


mask_number = 5
no_overload_cycles = 1
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})

list_test = []
for i in range(0,len(hyst)-1,mask_number):
    for x in range(no_overload_cycles):
        list_test.append(i x)
    
mask = np.array(list_test)

print(mask)
[0 1 5 10 15 20]

first = 4
last = 17
regression_area = hyst.iloc[first:last]

mean_range_overload = regression_area.loc[np.where(mask == regression area.index)]['test'].mean()
mean_range_baseline = regression_area.drop(mask[first:last])['test'].mean()

So the overload mean would be be cycles, 5, 10, and 15 in test, and the baseline mean would be from positions 4 to 17, excluding 5, 10 and 15. This would be my expected output from this:

print (mean_range_overload)
4

print(mean_range_baseline)
4.545454

However, the no_overload_cycles value can change, and may for example, be 3, which would then create a mask of this:


mask_number = 5
no_overload_cycles = 3
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})

list_test = []
for i in range(0,len(hyst)-1,mask_number):
    for x in range(no_overload_cycles):
        list_test.append(i x)

mask = np.array(list_test)

print(mask)
[0 1 2 5 6 7 10 11 12 15 16 17 20]

So the mean_range_overload would be mean of the values at 5,6,7,10,11,12,15,16,17, and the mean_range_baseline would be the values inbetween these, in the range of first and last in the dataframe column.

Any help on this would be greatly appreciated!

CodePudding user response:

Assuming no_overload_cycles == 1 always, you can simply use slice objects to index the DataFrame.

Say you wish to, in your example, specifically pick cycles 5, 10 and 15 and use them as overload. Then you can get them by doing df.loc[5:15:5]. On the other hand, if you wish to pick the 5th, 10th and 15th cycles from the range you selected, you can get them by doing df.iloc[5:15 1:5] (iloc does not include the right index, so we add one). No loops required.

As mentioned in the comments, your question is slightly confusing, and it'd be helpful if you gave a better description and some expected results; in general I'd also advise you to decouple the domain-specific part of your problem before asking it in a forum, since not everyone knows what you mean by "overload", "baseline", "cycles" etc. I'm not commenting that since I still don't have enough reputation to do so.

CodePudding user response:

I renamed a few of the variables, so what I called a "mask" is not exactly what you called a mask, but I reckon this is what you were trying to make:

mask_length = 5
overload_cycles_per_mask = 3
df = pd.DataFrame({"test": [12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})

selected_range = (4, 17)
overload_indices = []
baseline_indices = []
# `range` does not include the right hand side so we add one
# ideally you would specify the range as (4, 18) instead
for i in range(selected_range[0], selected_range[1] 1):
    if i % mask_length < overload_cycles_per_mask:
        overload_indices.append(i)
    else:
        baseline_indices.append(i)


print(overload_indices)
print(df.iloc[overload_indices].test.mean())
print(baseline_indices)
print(df.iloc[baseline_indices].test.mean())

Basically, the DataFrame rows inside selected_range are divided into segments of length mask_length, each of which has their first overload_cycles_per_mask elements marked as overload, and any others, as baseline.

With that, you get two lists of indices, which you can directly pass to df.iloc, as according to the documentation it supports a list of integers.

Here is the output for mask_length = 5 and overload_cycles_per_mask = 1:

[5, 10, 15]
4.0
[4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17]
4.545454545454546

And here is for mask_length = 5 and overload_cycles_per_mask = 3:

[5, 6, 7, 10, 11, 12, 15, 16, 17]
3.6666666666666665
[4, 8, 9, 13, 14]
5.8

I do believe calling this a single mask makes things more confusing. In any case, I would tuck the logic for getting the indices away in some separate function to the one which calculates the mean.

  • Related