Home > Back-end >  Pandas apply function by columns
Pandas apply function by columns

Time:12-27

I have a dataframe with dates (30/09/2022 to 31/11/2022) and 15 stock prices (wrote 5 as reference) for each of these dates (excluding weekends).

Current Data:

   DATES   |  A  |  B  |  C  |  D  |  E  |
 30/09/22 |100.5|151.3|233.4|237.2|38.42|
 01/10/22 |101.5|148.0|237.6|232.2|38.54|
 02/10/22 |102.2|147.6|238.3|231.4|39.32|
 03/10/22 |103.4|145.7|239.2|232.2|39.54|

I wanted to get the Pearson correlation matrix, so I did this:

df = pd.read_excel(file_path, sheet_name)
df=df.dropna() #Remove dates that do not have prices for all stocks
log_df = df.set_index("DATES").pipe(lambda d: np.log(d.div(d.shift()))).reset_index()
corrM = log_df.corr()

Now I want to build the Pearson Uncentered Correlation Matrix, so I have the following function:

def uncentered_correlation(x, y):

    x_dim = len(x)
    y_dim = len(y)
    
    xy = 0
    xx = 0
    yy = 0
    for i in range(x_dim):
        xy = xy   x[i] * y[i]
        xx = xx   x[i]  2.0
        yy = yy   y[i]**2.0
        
    corr = xy/np.sqrt(xx*yy)
    return(corr)

However, I do not know how to apply this function to each possible pair of columns of the dataframe to get the correlation matrix.

CodePudding user response:

  • First compute a list of possible column combinations. You can use the itertools library for that
  • Then use the pandas.DataFrame.apply() over multiple columns as explained here

Here is a simple code example:

import pandas as pd
import itertools

data = {'col1': [1,3], 'col2': [2,4], 'col3': [5,6]}
df = pd.DataFrame(data)

def add(num1,num2):
    return num1   num2

cols = list(df)
combList = list(itertools.combinations(cols, 2))

for tup in combList:
    firstCol = tup[0]
    secCol = tup[1]
    df[f'sum_{firstCol}_{secCol}'] = df.apply(lambda x: add(x[firstCol], x[secCol]), axis=1)

CodePudding user response:

try this? not elegant enough, but perhaps working for you. :)

from itertools import product

def iter_product(a, b):
    return list(product(a, b))

df='your dataframe hier'
re_dict={}
iter_re=iter_product(df.columns,df.columns)
for i in iter_re:    
    result=uncentered_correlation(df[f'{i[0]}'],df[f'{i[1]}'])
    re_dict[i]=result
re_df=pd.DataFrame(re_dict,index=[0]).stack()
  • Related