I have a dataset with a large number of columns. I wanted to perform a general computation on all these columns and get a final value and apply that as a new column.
For example, I have a data frame like below
A1 A2 A3 ... A120
0 0.12 0.03 0.43 ... 0.56
1 0.24 0.53 0.01 ... 0.98
. ... ... ... ... ...
200 0.11 0.22 0.31 ... 0.08
I want to construct a data frame similar to the below with a new column calc.
calc = (A1**2 - A1) (A2**2 - A2) ... (A120**2 - A120)
The final data frame should be like this
A1 A2 A3 ... A120 calc
0 0.12 0.03 0.43 ... 0.56 x
1 0.24 0.53 0.01 ... 0.98 y
. ... ... ... ... ... ...
200 0.11 0.22 0.31 ... 0.08 n
I tried to do this with python as below
import pandas as pd
df = pd.read_csv('sample.csv')
def construct_matrix():
temp_sumsqc = 0
for i in range(len(df.columns)):
column_name_construct = 'A' f'{i}'
temp_sumsqc = df[column_name_construct] ** 2 - (df[column_name_construct])
df["sumsqc"] = temp_sumsqc
matrix_constructor()
print(df_read.to_string())
But this throws a KeyError: 'A1
It is difficult to do df["A1"]**2 - df["A1"] df["A2"]**2 - df["A2"] ...
since there are 120 columns.
Since the way I attempted didn't work, I wonder whether there's a better way to do this?
CodePudding user response:
No need to use for loop, we can use vectorized approach here
df['calc'] = df.pow(2).sub(df).sum(1)
CodePudding user response:
You can use df.apply
to execute code for each column, and then use sum(axis=1)
to sum the resulting values across columns:
df['sumsqc'] = df.apply(lambda col: (col ** 2) - col).sum(axis=1)
Output:
>>> df
A1 A2 A3 A120 sumsqc
0 0.12 0.03 0.43 0.56 -0.6262
1 0.24 0.53 0.01 0.98 -0.4610
200 0.11 0.22 0.31 0.08 -0.5570
Note that A1**2 - A1
is equivalent to A1 * (A1 - 1)
, so you could do
df['sumsqc'] = df.apply(lambda col: col * (col - 1)).sum(axis=1)