Home > OS >  Combine several columns into an array, find correlations, and save to another column
Combine several columns into an array, find correlations, and save to another column

Time:04-07

Assuming I have the following toy dataframe:

Firm  num1   num2    num3      

A      0.1    0.2    0.3    
B      0.4    1.5    9.7    
C      2.1    3.7    1.5     
D      6.2    2.3    5.5

I want to combine columns num1, num2, num3 into arrays (lists), so that A = [0.1, 0.2, 0.3] and create a new column that is this new array. My method so far involves converting them to strings first.

Firm           arrone      

A      [0.1, 0.2, 0.3]   
B      [0.4, 1.5,9.7 ]   
C      [2.1, 3.7, 1.5]     
D      [6.2, 2.3, 5.5]

Next, I want to create combinations using itertools. This has been achieved here

out = df.merge(df,how='cross',suffixes = ('_1', '_2')).query('base1_1<base1_2')

Firm_1   Firm_2           arrone            arrtwo 

A             B   [0.1, 0.2, 0.3]   [0.4, 1.5, 9.7]
A             C   [0.1, 0.2, 0.3]   [2.1, 3.7, 1.5]
A             D   [0.1, 0.2, 0.3]   [6.2, 2.3, 5.5]
B             C   [0.4, 1.5,9.7 ]   [2.1, 3.7, 1.5]  
B             D   [0.4, 1.5,9.7 ]   [6.2, 2.3, 5.5] 
C             D   [2.1, 3.7, 1.5]   [6.2, 2.3, 5.5]
  

Finally, I want to create df['corrcoef'] = correlation between arrone and arrtwo. This link has been helpful. However, the manually inputted toy model has the column in list form. My working dataset is loaded from a csv file. For that, I get the following error when I try the above:

ValueError: x and y must have the same length.

I wish to have a final dataset that looks like this:

Firm_1   Firm_2           arrone            arrtwo    corrcoef

A             B   [0.1, 0.2, 0.3]   [0.4, 1.5, 9.7]        0.8
A             C   [0.1, 0.2, 0.3]   [2.1, 3.7, 1.5]        0.3
A             D   [0.1, 0.2, 0.3]   [6.2, 2.3, 5.5]        0.2
B             C   [0.4, 1.5,9.7 ]   [2.1, 3.7, 1.5]        0.5  
B             D   [0.4, 1.5,9.7 ]   [6.2, 2.3, 5.5]        0.7
C             D   [2.1, 3.7, 1.5]   [6.2, 2.3, 5.5]        0.9
  

CodePudding user response:

If you want to convert your initial DataFrame of columns into one column with lists, you could use:

df[['Firm']].join(pd.Series(df.filter(like='num').to_numpy().tolist(),
                            index=df.index, name='arrone'))

output:

  Firm           arrone
0    A  [0.1, 0.2, 0.3]
1    B  [0.4, 1.5, 9.7]
2    C  [2.1, 3.7, 1.5]
3    D  [6.2, 2.3, 5.5]
  • Related