Filter pandas dataframe by condition on column of np arrays-CodePudding

I have a dataframe where one column is a column of arrays. For the particular example below, I have a column called price_array where each row (unique by supplier) has an array of prices with length 3 representing 3 items. The function I'm creating should work on a variable number of items which is why I like the prices in an array rather than a column for price0, price1, price2.

df = pd.DataFrame({'supplier':['Abc Co','XYZ Inc','Bob and Sam'],
                   'price_array':[[2.00,3.50,1.00],[7.00,5.00,1.05],[2.00,10.50,3.40]],
                  'price_sum':[6.50,13.05,15.90],
                  'price0':[2.00,7.00,2.00]})

I'd like to be able to filter the dataframe on conditions of price_array. For illustration, I made extra columns for the sum of the array and the first element of the array. With those columns, I can filter like:

display(df[df.price_sum>10])
display(df[df.price0==2.00])

What I want to do is something like below, but I can't get the syntax to work (admittedly I am a python novice):

display(df[np.sum(df.price_array)>10.00])
display(df[df.price_array[0]==2.00])

Is something like this possible? What would be the correct syntax? Another option would be to not have the array column and instead create a variable number of columns with variable names (price0,price1,price2, etc.) but that seems like it could be cumbersome to reference, I'd want to be able to reference those columns based on a stored variable of the item number.

CodePudding user response：

Following code is how you this is done:

import pandas as pd
import numpy as np


df = pd.DataFrame({'supplier':['Abc Co','XYZ Inc','Bob and Sam'],
                   'price_array':[[2.00,3.50,1.00],[7.00,5.00,1.05],[2.00,10.50,3.40]],
                  'price_sum':[6.50,13.05,15.90],
                  'price0':[2.00,7.00,2.00]})
df['price_sum'] = df['price_array'].apply(lambda x: sum(x))
df['price0'] = df['price_array'].apply(lambda x: x[0])
print(df)


#          supplier       price_array  price_sum  price0
#  0       Abc Co   [2.0, 3.5, 1.0]       6.50     2.0
#  1      XYZ Inc  [7.0, 5.0, 1.05]      13.05     7.0
#  2  Bob and Sam  [2.0, 10.5, 3.4]      15.90     2.0

CodePudding user response：

These are my two solutions.

df = df.sort_values(by='price_array', key=lambda x: x[1])
bool_mask = [x[0] == 2 for x in df['price_array']]
df = df[bool_mask]

The first one returns a dataframe sorted by the values in the array using a lambda.

The second one creates a boolean mask by checking each row for a condition. It looks like this after [True, False, True]. Using that on the dataframe creates a frame where only the True rows are in the frame.

CodePudding user response：

I found a similar question at here, take a look and I hope this will help you.