Correlation Function syntax-CodePudding

Hello,

I'm attempting to develop a function that takes two expressions as inputs: the name of the platform (for instance, "PS2") and the kind of reviews (for instance, "critic score").

The function's goal is to determine the Pearson connection between the platforms' overall sales and the type of ranking ('critic score'):

def corfunc(Platform, score_type):
    df5 = df_new[(df_new.platform==Platform)&(df_new[score_type].notna())][['total_sales',score_type]]
    df5.plot(x=score_type, y='total_sales', kind='hexbin', gridsize=20, sharex=False, alpha=1)
    correlation=df5.corr(method='pearson')
    
    if [[correlation > 0.7]]:
        result=print(correlation)
        print('There is a strong positive connection')
    elif [[correlation < -0.7]]:
        print('There is a strong negative connection')
    else:
        result=print('No/weak connection')
    print(result)

Here's an example of the data which is already filtered:

pita=df_new[(df_new.platform=='PS3')&(df_new['critic_score'].notna())][['total_sales','critic_score']]


pita.head(3)

total_sales critic_score
16  21.05   97.0
34  13.79   83.0
37  13.33   88.0

I would appreciate some assistance with the syntax relating to the outcomes of if/else statements: When I check the function output, it appears that it doesn't take into account the correlation scores, showing as though there is a fully and positively connected relationship. I would like for my function to take into account multiple correlations, rather than just checking a correlation of a variable to itself. Thanks!

corfunc('PS3', 'critic_score')

              total_sales  critic_score
total_sales      1.000000      0.379961
critic_score     0.379961      1.000000
There is a strong positive connection
None

CodePudding user response：

To get the correlation of one variable with another, use Series.corr:

correlation = df["critic_score"].corr(df["total_sales"])

As already noted, the correlation of a variable with itself is always 1.

The reason for the problems in your current code is that df.corr() returns a dataframe. df.corr() > .7 does not return a single True/False value, but another dataframe, containing True if each value in the original dataframe > .7 and False if not.

For this reason the df.corr() > 7 cannot be used as the condition in an if statement. You would need to select one of the values within the dataframe, or use (correlation > .7).any(axis=None) to check if any of the values are above .7. (Putting the dataframe in square brackets prevents an error being raised but does not do anything useful.) But it's easier to calculate the correlation between the two series instead, as shown above.

CodePudding user response：

def pearsonr(x, y): 
    n = len(x)
    if n != len(y):
        raise ValueError('x and y must have the same length.')

    if n < 2:
        raise ValueError('x and y must have length at least 2.')

    x = np.asarray(x)
    y = np.asarray(y)

    # If an input is constant, the correlation coefficient is not defined.
    if (x == x[0]).all() or (y == y[0]).all():
        warnings.warn(PearsonRConstantInputWarning())
        return np.nan, np.nan

    # dtype is the data type for the calculations.  This expression ensures
    # that the data type is at least 64 bit floating point.  It might have
    # more precision if the input is, for example, np.longdouble.
    dtype = type(1.0   x[0]   y[0])

    if n == 2:
        return dtype(np.sign(x[1] - x[0])*np.sign(y[1] - y[0])), 1.0

    xmean = x.mean(dtype=dtype)
    ymean = y.mean(dtype=dtype)

    # By using `astype(dtype)`, we ensure that the intermediate calculations
    # use at least 64 bit floating point.
    xm = x.astype(dtype) - xmean
    ym = y.astype(dtype) - ymean

    # Unlike np.linalg.norm or the expression sqrt((xm*xm).sum()),
    # scipy.linalg.norm(xm) does not overflow if xm is, for example,
    # [-5e210, 5e210, 3e200, -3e200]
    normxm = linalg.norm(xm)
    normym = linalg.norm(ym)

    threshold = 1e-13
    if normxm < threshold*abs(xmean) or normym < threshold*abs(ymean):
        # If all the values in x (likewise y) are very close to the mean,
        # the loss of precision that occurs in the subtraction xm = x - xmean
        # might result in large errors in r.
        warnings.warn(PearsonRNearConstantInputWarning())

    r = np.dot(xm/normxm, ym/normym)

    # Presumably, if abs(r) > 1, then it is only some small artifact of
    # floating point arithmetic.
    r = max(min(r, 1.0), -1.0)

    ab = n/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(r))))

    return r, prob

CodePudding user response：

The correlation to itself will always be 1.0 as it is follows the exact same data. As in you example

          total_sales  critic_score
total_sales      1.000000      0.379961
critic_score     0.379961      1.000000

The correlation between total_sales and critic_score is 0.379 while the 1's in correlation is not very interesting as they compare towards themselves.

One way to make it a bit more intuitive is to use heatmaps to visualize the correlations rather than just look at the pure numbers.

I often use seaborn heatmap which can take a DF like this:

import seaborn as sns
# With annotation true you get the values within the squares
sns.heatmap(dataframe, anoot=True)

To incorporate the correct pairwise funcitonality with sentences it would be something like this (like Stuart mentioned):

def corfunc(Platform, score_type):
    df5 = df_new[(df_new.platform==Platform)&(df_new[score_type].notna())][['total_sales',score_type]]
    df5.plot(x=score_type, y='total_sales', kind='hexbin', gridsize=20, sharex=False, alpha=1)
    correlation = df5["critic_score"].corr(df["total_sales"], method='pearson')


    if correlation > 0.7:
        result=print(correlation)
        print('There is a strong positive connection')
    elif correlation < -0.7:
        print('There is a strong negative connection')
    else:
        result=print('No/weak connection')
    print(result)