I'm new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.
input
df=
|article |token1|token2|token3|token4|token5|
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
The tokens are in alphabetical order; I'm trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:
desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
|Corr |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan |
I know that I could use df.corr(), but that won't yield the expected output. I would think that looping over columns could get there, but I'm not really sure where to start. Does anyone have an idea on how to achieve this?
CodePudding user response:
Use:
df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
token1 token2 token3 token4 token5
article
article1 0.00 0.04 0.03 0.00 0.1
article2 0.07 0.00 0.14 0.04 0.0
Corr -1.00 -1.00 1.00 -1.00 NaN
CodePudding user response:
Random dataframe
df = pd.DataFrame({
"article": ["article1", "article2", "article3", "article4"],
"token1": [0.00, 0.03, 0.04, 0.00],
"token2": [0.07, 0.00, 0.01, 0.05],
"token3": [0.09, 0.08, 0.07, 0.06],
"token4": [0.00, 0.03, 0.05, 0.08],
"token5": [0.01, 0.04, 0.01, 0.02],
"token6": [0.00, 0.02, 0.04, 0.06],
})
calculate corr for sub dataframe
for i in range(2, len(df.columns)):
sub_df = df.iloc[:,[i-1, i]]
print(sub_df.columns)
print(sub_df.corr())
print("\n")
sample result
token3 0.195366 1.000000
Index(['token3', 'token4'], dtype='object')
token3 token4
token3 1.000000 -0.997054
token4 -0.997054 1.000000
Index(['token4', 'token5'], dtype='object')
token4 token5
token4 1.000000 0.070014
token5 0.070014 1.000000
Index(['token5', 'token6'], dtype='object')
token5 token6
token5 1.000000e 00 -9.897366e-17
token6 -9.897366e-17 1.000000e 00