How to calculate correlation between adjacent columns throughout a dataframe and add it to the dataf-CodePudding

I'm new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.

input 
df=
|article |token1|token2|token3|token4|token5|
|article1|.00   |.04   |.03   |.00   |.10   |
|article2|.07   |.00   |.14   |.04   |.00   |

The tokens are in alphabetical order; I'm trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:

desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00    |.04    |.03    |.00    |.10    |
|article2|.07    |.00    |.14    |.04    |.00    |
|Corr    |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan    |

I know that I could use df.corr(), but that won't yield the expected output. I would think that looping over columns could get there, but I'm not really sure where to start. Does anyone have an idea on how to achieve this?

CodePudding user response：

Use:

df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
          token1  token2  token3  token4  token5
article                                         
article1    0.00    0.04    0.03    0.00     0.1
article2    0.07    0.00    0.14    0.04     0.0
Corr       -1.00   -1.00    1.00   -1.00     NaN

CodePudding user response：

Random dataframe

df = pd.DataFrame({
    "article": ["article1", "article2", "article3", "article4"],
    "token1": [0.00, 0.03, 0.04, 0.00],
    "token2": [0.07, 0.00, 0.01, 0.05],
    "token3": [0.09, 0.08, 0.07, 0.06],
    "token4": [0.00, 0.03, 0.05, 0.08],
    "token5": [0.01, 0.04, 0.01, 0.02],
    "token6": [0.00, 0.02, 0.04, 0.06],
})

calculate corr for sub dataframe

for i in range(2, len(df.columns)):
    sub_df = df.iloc[:,[i-1, i]]
    print(sub_df.columns)
    print(sub_df.corr())
    print("\n")

sample result


token3  0.195366  1.000000


Index(['token3', 'token4'], dtype='object')
          token3    token4
token3  1.000000 -0.997054
token4 -0.997054  1.000000


Index(['token4', 'token5'], dtype='object')
          token4    token5
token4  1.000000  0.070014
token5  0.070014  1.000000


Index(['token5', 'token6'], dtype='object')
              token5        token6
token5  1.000000e 00 -9.897366e-17
token6 -9.897366e-17  1.000000e 00