Home > Software engineering >  Creating categories in df based on other columns
Creating categories in df based on other columns

Time:12-20

I have been experiencing several problems when trying to create a Categories column in my dataframe based on if-statements linked to one of the other columns.

My dataframe (called 'kio') looks as follows when calling the .tail() function:

               High      Low     Open    Close    Volume  Adj Close  Daily % Change
Date                                                                               
2021-12-07  48326.0  46566.0  47000.0  48245.0  300090.0    48245.0        6.168302
2021-12-08  49662.0  46662.0  49870.0  46799.0  304657.0    46799.0       -2.997202
2021-12-09  47395.0  45045.0  47001.0  45332.0  267214.0    45332.0       -3.134682
2021-12-10  46899.0  43850.0  46899.0  43997.0  221268.0    43997.0       -2.944940
2021-12-13  46285.0  44150.0  46600.0  44150.0  208391.0    44150.0        0.347751

The final column ('Daily % Change') I added myself using the following code:

for n in kio['Close']:
    kio['Daily % Change'] = kio['Close'].pct_change() * 100

Not sure if the format or dtype of this column is not causing an issue?

What I am trying to do is have every element in the 'Daily % Change' column be categorised as a number between 1 and 6 (absolute values). I have attempted this through thee below code:

def mvt_cat(row = kio):
    if row[(row['Daily % Change'] > 0) & (row['Daily % Change'] < 1)]:
        val = 1
    elif row[(row['Daily % Change'] < 0) & (row['Daily % Change'] > -1)]:
        val = -1
    elif row[(row['Daily % Change'] >= 1) & (row['Daily % Change'] < 2)]:
        val = 2
    elif row[(row['Daily % Change'] <= -1) & (row['Daily % Change'] > 2)]:
        val = -2
    elif row[(row['Daily % Change'] >= 2) & (row['Daily % Change'] < 3)]:
        val = 3
    elif row[(row['Daily % Change'] <= -2) & (row['Daily % Change'] > -3)]:
        val = -3
    elif row[(row['Daily % Change'] >= 3) & (row['Daily % Change'] < 4)]:
        val = 4
    elif row[(row['Daily % Change'] <= -3) & (row['Daily % Change'] > 4)]:
        val = -4
    elif row[(row['Daily % Change'] >= 4) & (row['Daily % Change'] < 5)]:
        val = 5
    elif row[(row['Daily % Change'] <= -4) & (row['Daily % Change'] > -5)]:
        val = -5
    elif row[(row['Daily % Change'] >= 5)]:
        val = 6
    else:
        val = -6
    return val

kio['Move Category'] = kio.apply(mvt_cat, axis=1)

I unfortunately get various errors - the mains ones are these:

KeyError: False ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have tried the possible solutions found on links such as Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() and KeyError: False in pandas dataframe

Any assistance will be greatly appreciated!

CodePudding user response:

Try this:

import numpy as np

ceil = np.ceil(df['Daily % Change'].abs())

kio['Move Category'] = kio['Daily % Change'].ge(0).mul(2).sub(1) \
                                            .mul(ceil).clip(-6, 6).astype(int)

Output:

>>> kio[['Daily % Change', 'Move Category']]
   Daily % Change  Move Category
0        6.168302              6
1       -2.997202             -3
2       -3.134682             -4
3       -2.944940             -3
4        0.347751              1

Details:

>>> ceil
0    7.0  # 6.168302
1    3.0  # -2.997202
2    4.0  # -3.134682
3    3.0  # -2.944940
4    1.0  # 0.347751
Name: Daily % Change, dtype: float64

>>> kio['Daily % Change'].ge(0).mul(2).sub(1)
0    1
1   -1
2   -1
3   -1
4    1
Name: Daily % Change, dtype: int64

CodePudding user response:

Use pd.cut

foo = pd.Series([ 6.168302, -2.997202, -3.134682, -2.944940,  0.347751])
pd.cut(abs(foo), range(0,7))

Result:

0           NaN
1    (2.0, 3.0]
2    (3.0, 4.0]
3    (2.0, 3.0]
4    (0.0, 1.0]
dtype: category
Categories (6, interval[int64, right]): [(0, 1] < (1, 2] < (2, 3] < (3, 4] < (4, 5] < (5, 6]]

CodePudding user response:

I would recommend using pandas.cut for binning rather than what you are trying to do:

groups = [n for n in range(-6,8,1)]
df['Move Category'] = pd.cut(df['Daily % Change'], groups)

print(df)

         Date   High    Low  ...  Adj Close  Daily % Change  Move Category
0  07/12/2021  48326  46566  ...      48245        6.168302         (6, 7]
1  08/12/2021  49662  46662  ...      46799       -2.997202       (-3, -2]
2  09/12/2021  47395  45045  ...      45332       -3.134682       (-4, -3]
3  10/12/2021  46899  43850  ...      43997       -2.944940       (-3, -2]
4  13/12/2021  46285  44150  ...      44150        0.347751         (0, 1]

[5 rows x 9 columns]

CodePudding user response:

You're getting that error because for example, row[(row['Daily % Change'] >= 4) & (row['Daily % Change'] < 5)] is a Series, it doesn't make sense to ask if it is true or not.

You can apply mvt_cat to a single row as:

def mvt_cat(row):
    if (row > 0) & (row < 1):
        val = 1
    elif (row < 0) & (row > -1):
        val = -1
    elif (row >= 1) & (row < 2):
        val = 2
    elif (row <= -1) & (row > 2):
        val = -2
    elif (row >= 2) & (row < 3):
        val = 3
    elif (row <= -2) & (row > -3):
        val = -3
    elif (row >= 3) & (row < 4):
        val = 4
    elif (row <= -3) & (row > 4):
        val = -4
    elif (row >= 4) & (row < 5):
        val = 5
    elif (row <= -4) & (row > -5):
        val = -5
    elif (row >= 5):
        val = 6
    else:
        val = -6
    return val


kio['Move Category'] = kio['Daily % Change'].apply(mvt_cat)

Note that some of the categories don't make sense. For example, in your evaluation,

elif row[(row['Daily % Change'] <= -1) & (row['Daily % Change'] > 2)]:

will always be empty, I assume it should be:

elif (row <= -1) & (row > -2):

instead.

  • Related