Home > Software engineering >  Using numpy.where to calculate new pandas column, with multiple conditions
Using numpy.where to calculate new pandas column, with multiple conditions

Time:06-23

I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:

subtraction_value = 3
subtraction_value = 6

data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]} 


data['new_column'] = np.where(data.index%4,
                              data['test']-subtraction_value,
                              data['test']-subtraction_value_2)
print (data['new_column']


[6,1,2,1,-5,0,-1,3,4,6]

However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:

data['new_column'] = np.where((data.index%4) | (data.index%5),
                              data['test']-subtraction_value,
                              data['test']-subtraction_value_2)

However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:

print(data['new_column'])

[6,-2,2,1,-2,-3,-4,3,7,6])

As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!

CodePudding user response:

As mentioned in the comment section, the output should equal [6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:

import pandas as pd
import numpy as np

from itertools import chain

subtraction_value = 3
subtraction_value_2 = 6

data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})

index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i 1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)

# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
    test  new_column
0   12    6
1   4     -2
2   5     2
3   4     1
4   1     -2
5   3     -3
6   2     -4
7   5     2
8   10    7
9   9     6
---------------------------------------------

As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.

  • Related