Applying function to a dataframe with a vector return axis related error?-CodePudding

I have the following function, dataframe and vector, why I am getting an error?

import pandas as pd
import numpy as np

def vanilla_vec_similarity(x, y):
  x.drop('request_id', axis=1, inplace=True).values.flatten().tolist()
  y.drop('request_id', axis=1, inplace=True).values.flatten().tolist()
  res = (np.array(x) == np.array(y)).astype(int)
  return res.mean()


test_df = pd.DataFrame({'request_id': [55, 42, 13], 'a': ['x','y','z'], 'b':[1,2,3], 'c': [1.0, -1.8, 19.113]})
test_vec = pd.DataFrame([[123,'x',1.1, -1.8]], columns=['request_id', 'a', 'b', 'c'])

test_df['similarity'] = test_df.apply(lambda x: vanilla_vec_similarity(x, test_vec), axis=1)



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    367         try:
--> 368             return cls._AXIS_TO_AXIS_NUMBER[axis]
    369         except KeyError:

KeyError: 1

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
10 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    368             return cls._AXIS_TO_AXIS_NUMBER[axis]
    369         except KeyError:
--> 370             raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
    371 
    372     @classmethod

ValueError: No axis named 1 for object type Series

CodePudding user response：

You can make this code work with the following changes:

def vanilla_vec_similarity(x, y):
    x.drop('request_id', axis=1).values.flatten().tolist()
    y.drop('request_id', axis=1).values.flatten().tolist()
    res = (np.array(x) == np.array(y)).astype(int)
    return res.mean()


test_df = pd.DataFrame({'request_id': [55, 42, 13], 'a': ['x','y','z'], 'b':[1,2,3], 'c': [1.0, -1.8, 19.113]})
test_vec = pd.DataFrame([[123,'x',1.1, -1.8]], columns=['request_id', 'a', 'b', 'c'])

test_df['similarity'] = test_df.apply(lambda x: vanilla_vec_similarity(x.to_frame().T, test_vec), axis=1)

Explanation:

Firstly when you do this test_df.apply(lambda x: vanilla_vec_similarity(x, test_vec), axis=1) you are passing each row as a series (with column names as index of series) to the function.
Code breaks because you are trying to drop column request_id as it does not exists.
Also you don't need to use inplace=True.

Or You can just use:

test_df['similarity'] = test_df.apply(lambda x: x[1:].eq(pd.Series(test_vec.loc[0])[1:]).mean(), axis=1)

Or If you define test_vec as Series instead of Dataframe:

test_vec = pd.Series([123,'x',1.1, -1.8], index=['request_id', 'a', 'b', 'c'])
test_df['similarity'] = test_df.apply(lambda x: x[1:].eq(test_vec[1:]).mean(), axis=1)