I have a DataFrame and need to perform calculations between columns. Can my function do

I have a DataFrame and need to perform calculations between columns. Can my function do_something be vectorised ?

Column ['1min', '2min', '5min', '15min', '30min', '1hour', '2hour', '4hour', '1day', '2day', '7day',] need to be compared with price and the value of the previous column in turn. If the column is less than price and smaller than the previous column, min_sig is assigned the value of the column, and min_bar is assigned the name of the column. If it does not match, if it is the column '1min', then min_sig and min_bar are assigned the values False, and the other columns interrupt the loop.

My code can achieve the effect I want, can the function generate_data() be optimized by vectorization?

My code is as follows:

import pandas as pd
import numpy as np


def generate_data():
    code = ['a', 'b', 'c', 'd']
    price = [72, 50.8, 77.8, 54.6]
    min1 = [69.78, 49.21, 79.75, 56.21]
    min2 = [69.9, 49.22, 79.4, 55.85]
    min5 = [73.36, 51.81, 74.78, 52]
    min15 = [79.07, 56.25, 67.86, 46.9]
    min30 = [77.1, 54.86, 70.38, 48.91]
    hour1 = [75.12, 53.49, 72.84, 51.29]
    hour2 = [74.1, 52.75, 73.51, 51.79]
    hour4 = [72.18, 51.69, 77.83, 55.96]
    day1 = [78.13, 56.76, 73.47, 52.37]
    day2 = [80.42, 58.72, 71.88, 51.78]
    day7 = [110.79, 84.6, 83.73, 65.48]
    dict1 = {'code': code, 'price': price, '1min': min1, '2min': min2, '5min': min5, '15min': min15, '30min': min30,
             '1hour': hour1, '2hour': hour2, '4hour': hour4, '1day': day1, '2day': day2, '7day': day7, }
    df = pd.DataFrame(dict1)
    df['min_bar'] = np.NAN
    df['min_sig'] = np.NAN
    col = ['code', 'price', 'min_bar', 'min_sig', '1min', '2min', '5min', '15min', '30min', '1hour', '2hour', '4hour',
           '1day', '2day', '7day', ]
    df = df[col]
    return df


def do_something(a):
    list1 = ['1min', '2min', '5min', '15min', '30min', '1hour', '2hour', '4hour',
             '1day', '2day', '7day', ]
    for i in range(len(list1)):
        bar = list1[i]
        if i == 0:
            if a['price'] >= a[bar]:
                a['min_sig'] = a[bar]
                a['min_bar'] = bar
            else:
                a['min_sig'] = False
                a['min_bar'] = False
                break
        else:
            if a['min_sig'] >= a[bar]:
                a['min_sig'] = a[bar]
                a['min_bar'] = bar
            else:
                break
    return a


def main():
    df = generate_data()
    print('Dataframe before running generate_data():')
    print(df)
    df = df.apply(do_something, axis=1)
    print('The result after running is the result I want:')
    print(df)


if __name__ == '__main__':
    main()

Dataframe before running generate_data():
  code  price  min_bar  min_sig   1min  ...  2hour  4hour   1day   2day    7day
0    a   72.0      NaN      NaN  69.78  ...  74.10  72.18  78.13  80.42  110.79
1    b   50.8      NaN      NaN  49.21  ...  52.75  51.69  56.76  58.72   84.60
2    c   77.8      NaN      NaN  79.75  ...  73.51  77.83  73.47  71.88   83.73
3    d   54.6      NaN      NaN  56.21  ...  51.79  55.96  52.37  51.78   65.48

[4 rows x 15 columns]
The result after running is the result I want:
  code  price min_bar min_sig   1min  ...  2hour  4hour   1day   2day    7day
0    a   72.0    1min   69.78  69.78  ...  74.10  72.18  78.13  80.42  110.79
1    b   50.8    1min   49.21  49.21  ...  52.75  51.69  56.76  58.72   84.60
2    c   77.8   False   False  79.75  ...  73.51  77.83  73.47  71.88   83.73
3    d   54.6   False   False  56.21  ...  51.79  55.96  52.37  51.78   65.48

[4 rows x 15 columns]

%timeit df.apply(do_something,axis=1)
4.88 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

IIUC, it looks like you want to get the idxmin and min masked with False if the first value is not greater than price.

You can use numpy to get both operations at once:

m = np.argmin(df[list1].values, axis=1)
(pd.DataFrame({'min_bar':  np.take(list1, m),
               'min_sig':  np.take(df[list1].values, m)})
   .mask(df['price'].lt(df[list1[0]]), False)
)

(Then join or assign to the original df)

Output:

  min_bar min_sig
0    1min   69.78
1    1min   69.78
2   False   False
3   False   False

using pandas

This requires searching for the mean twice, though

m = df['price'].lt(df[list1[0]])

df['min_bar'] = df[list1].idxmin(axis=1).mask(m, False)
df['min_sig'] = df[list1].min(axis=1).mask(m, False)