Home > front end >  How to pass the whole dataframe and the index of the row being operated upon to the apply() method
How to pass the whole dataframe and the index of the row being operated upon to the apply() method

Time:07-25

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?

Specifically, I have a dataframe correlation_df with the following data:

id scores cosine
1 100 0.8
2 75 0.7
3 50 0.4
4 25 0.05

I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.

My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.

NB. Problem code:

import numpy as np
import pandas as pd

score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
    {
        "score": score,
        "cosine": cosine,
    }
)
corr = correlation_df.corr().values[0, 1]

[Edit] Roundabout solution that I'm sure can be improved:

def my_fuct(row):
    i = int(row["index"])
    r = list(range(correlation_df.shape[0]))
    r.remove(i)
    subset = correlation_df.iloc[r, :].copy()
    subset = subset.set_index("index")
    return subset.corr().values[0, 1]

correlation_df["diff_correlations"] =  = correlation_df.apply(my_fuct, axis=1)

CodePudding user response:

Your problem can be simplified to:

>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
   score  cosine  diff_correlations
0    100    0.80           0.999015
1     75    0.70           0.988522
2     50    0.40           0.977951
3     25    0.05           0.960769

A more sophisticated method would be:

  • The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)

The index can be accessed in an apply with .name or .index, depending on the axis:

>>> correlation_df.apply(lambda x: x.name, axis=1)
0    0
1    1
2    2
3    3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
   score  cosine
0      0       0
1      1       1
2      2       2
3      3       3

CodePudding user response:

Using

correlation_df = correlation_df.reset_index()

gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:

correlation_df.apply(lambda r: r["index"])

After you are done you could do:

correlation_df = correlation_df.set_index("index")

to get your previous format back.

  • Related