How to pass the whole dataframe and the index of the row being operated upon to the apply() method-CodePudding

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?

Specifically, I have a dataframe correlation_df with the following data:

id	scores	cosine
1	100	0.8
2	75	0.7
3	50	0.4
4	25	0.05

I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.

My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.

NB. Problem code:

import numpy as np
import pandas as pd

score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
    {
        "score": score,
        "cosine": cosine,
    }
)
corr = correlation_df.corr().values[0, 1]

[Edit] Roundabout solution that I'm sure can be improved:

def my_fuct(row):
    i = int(row["index"])
    r = list(range(correlation_df.shape[0]))
    r.remove(i)
    subset = correlation_df.iloc[r, :].copy()
    subset = subset.set_index("index")
    return subset.corr().values[0, 1]

correlation_df["diff_correlations"] =  = correlation_df.apply(my_fuct, axis=1)

CodePudding user response：

Your problem can be simplified to:

>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
   score  cosine  diff_correlations
0    100    0.80           0.999015
1     75    0.70           0.988522
2     50    0.40           0.977951
3     25    0.05           0.960769

A more sophisticated method would be:

The whole correlation matrix isn't made every time this way.

df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)

The index can be accessed in an apply with .name or .index, depending on the axis:

>>> correlation_df.apply(lambda x: x.name, axis=1)
0    0
1    1
2    2
3    3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
   score  cosine
0      0       0
1      1       1
2      2       2
3      3       3

CodePudding user response：

Using

correlation_df = correlation_df.reset_index()

gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:

correlation_df.apply(lambda r: r["index"])

After you are done you could do:

correlation_df = correlation_df.set_index("index")

to get your previous format back.