Home > OS >  How to calculate correlation between adjacent columns throughout a dataframe and add it to the dataf
How to calculate correlation between adjacent columns throughout a dataframe and add it to the dataf

Time:08-05

I'm new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.

input 
df=
|article |token1|token2|token3|token4|token5|
|article1|.00   |.04   |.03   |.00   |.10   |
|article2|.07   |.00   |.14   |.04   |.00   |

The tokens are in alphabetical order; I'm trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:

desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00    |.04    |.03    |.00    |.10    |
|article2|.07    |.00    |.14    |.04    |.00    |
|Corr    |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan    |

I know that I could use df.corr(), but that won't yield the expected output. I would think that looping over columns could get there, but I'm not really sure where to start. Does anyone have an idea on how to achieve this?

CodePudding user response:

Use:

df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
          token1  token2  token3  token4  token5
article                                         
article1    0.00    0.04    0.03    0.00     0.1
article2    0.07    0.00    0.14    0.04     0.0
Corr       -1.00   -1.00    1.00   -1.00     NaN

CodePudding user response:

Random dataframe

df = pd.DataFrame({
    "article": ["article1", "article2", "article3", "article4"],
    "token1": [0.00, 0.03, 0.04, 0.00],
    "token2": [0.07, 0.00, 0.01, 0.05],
    "token3": [0.09, 0.08, 0.07, 0.06],
    "token4": [0.00, 0.03, 0.05, 0.08],
    "token5": [0.01, 0.04, 0.01, 0.02],
    "token6": [0.00, 0.02, 0.04, 0.06],
})

calculate corr for sub dataframe

for i in range(2, len(df.columns)):
    sub_df = df.iloc[:,[i-1, i]]
    print(sub_df.columns)
    print(sub_df.corr())
    print("\n")
    

sample result


token3  0.195366  1.000000


Index(['token3', 'token4'], dtype='object')
          token3    token4
token3  1.000000 -0.997054
token4 -0.997054  1.000000


Index(['token4', 'token5'], dtype='object')
          token4    token5
token4  1.000000  0.070014
token5  0.070014  1.000000


Index(['token5', 'token6'], dtype='object')
              token5        token6
token5  1.000000e 00 -9.897366e-17
token6 -9.897366e-17  1.000000e 00
  • Related