Why is the correlation one when values differ?-CodePudding

I have a dataframe book_matrix with users as rows, books as columns, and ratings as values. When I use corrwith() to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0, but the values are clearly different.

The non-null values [10, 3] and [10, 9] have correlation 1.0. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?

CodePudding user response：

Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:

import pandas as pd
  
df1 = pd.DataFrame({"A":[1, 2, 3, 4], 
                    "B":[5, 8, 4, 3],
                    "C":[10, 4, 9, 3]})
  
df2 = pd.DataFrame({"A":[2, 4, 6, 8],
                    "B":[-5, -8, -4, -3],
                    "C":[4, 3, 8, 5]})

df1.corrwith(df2, axis=0)

A    1.000000
B   -1.000000
C    0.395437
dtype: float64

So you can see that [1, 2, 3, 4] and [2, 4, 6, 8] have correlation 1.0

The next column [5, 8, 4, 3] and [-5, -8, -4, -3] have extreme negative correlation -1.0

In the last column, [10, 4, 9, 3] and [4, 3, 8, 5] are somewhat correlated 0.395437, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.

So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.

df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
                    "B": [10, 3, 10, 3, 10, 3]})
df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
                    "B": [10, 10, 10, 9, 9, 9]})

df1.corrwith(df2, axis=0)

A    1.000000
B    0.333333
dtype: float64

So you can see that [10, 3, 10, 3, 10, 3] and [10, 9, 10, 9, 10, 9] are also correlated perfectly at 1.0.

But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3] and [10, 10, 10, 9, 9, 9] are not perfectly correlated anymore at 0.333333

So going forward, you need more data, and more variations in the data! Hope that helps