Home > Software engineering >  Dataframe column numpy
Dataframe column numpy

Time:05-03

Need some help here: I want to have column1 of df2 returning the pertinent Cat values. Below the code and actual output and expected output:

import pandas as pd
import numpy as np

data = {'cat': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        'date': ['2021-12-30', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06',\
                '2022-01-09', '2022-01-10', '2021-12-30', '2022-01-02', '2022-01-03', '2022-01-04', \
                '2022-01-05', '2022-01-06', '2022-01-09', '2022-01-10','2021-12-30', '2022-01-02', \
                '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-09', '2022-01-10'],
        'value': [99435, 99401, 100113, 100528, 100428, 100734, 99035, 100077, 100018, 99425, 100728, 100863, \
                99930, 100076, 100995, 100640, 99158, 100868, 100819, 99247, 100851, 100500, 100082, 99089],
        'act': [1.0000, 0.7981, 0.7785, 0.3563, 0.1916, 0.0000, 0.0000, 0.0233, 1.0000, 0.5625, 0.5774, \
                0.6777, 0.7300, 0.1951, 0.1966, 0.6413, 1.0000, 0.7905, 0.7867, 0.000, 0.8769, 0.4683, 0.7122, 0.7183]
    }

df =pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['prev'] = df.groupby('cat')['act'].shift(1)
df['ln_return'] = np.log(df['act']/df['prev'].clip(0.000001)).clip(np.log(0.000001))
df['ln_return'] = df['ln_return'].mask(df['prev'].eq(0)&df['act'].eq(0), 0)
df = df.dropna()

a_avg = df.groupby('cat')['value'].mean().to_numpy()
a_test  = a_avg /1000
df2 = pd.DataFrame({'col2': a_avg, 'col3':a_test})
print(df2)

Out:                             Expected:
            col2        col3         col1             col2        col3
0  100045.142857  100.045143     0      1    100045.142857  100.045143
1  100379.571429  100.379571     1      2    100379.571429  100.379571
2  100208.000000  100.208000     2      3    100208.000000  100.208000

CodePudding user response:

You shouldn't drop down to numpy because that can create issues with alignment. Insted work fully within pandas. In this case you can groupby agg, then eval to create your column divided by 1000, and rename the cat column to what you want.

The eval can be avoided if you separate it into multiple statements and instead just do the math.

(df.groupby('cat', as_index=False)
   .agg(col2=('value', 'mean'))
   .eval('col3 = col2/1000')
   .rename(columns={'cat': 'col1'}))

   col1           col2        col3
0     1  100045.142857  100.045143
1     2  100379.571429  100.379571
2     3  100208.000000  100.208000
  • Related