I have a dataframe where one column is a column of arrays. For the particular example below, I have a column called price_array where each row (unique by supplier) has an array of prices with length 3 representing 3 items. The function I'm creating should work on a variable number of items which is why I like the prices in an array rather than a column for price0, price1, price2.
df = pd.DataFrame({'supplier':['Abc Co','XYZ Inc','Bob and Sam'],
'price_array':[[2.00,3.50,1.00],[7.00,5.00,1.05],[2.00,10.50,3.40]],
'price_sum':[6.50,13.05,15.90],
'price0':[2.00,7.00,2.00]})
I'd like to be able to filter the dataframe on conditions of price_array. For illustration, I made extra columns for the sum of the array and the first element of the array. With those columns, I can filter like:
display(df[df.price_sum>10])
display(df[df.price0==2.00])
What I want to do is something like below, but I can't get the syntax to work (admittedly I am a python novice):
display(df[np.sum(df.price_array)>10.00])
display(df[df.price_array[0]==2.00])
Is something like this possible? What would be the correct syntax? Another option would be to not have the array column and instead create a variable number of columns with variable names (price0,price1,price2, etc.) but that seems like it could be cumbersome to reference, I'd want to be able to reference those columns based on a stored variable of the item number.
CodePudding user response:
Following code is how you this is done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'supplier':['Abc Co','XYZ Inc','Bob and Sam'],
'price_array':[[2.00,3.50,1.00],[7.00,5.00,1.05],[2.00,10.50,3.40]],
'price_sum':[6.50,13.05,15.90],
'price0':[2.00,7.00,2.00]})
df['price_sum'] = df['price_array'].apply(lambda x: sum(x))
df['price0'] = df['price_array'].apply(lambda x: x[0])
print(df)
# supplier price_array price_sum price0
# 0 Abc Co [2.0, 3.5, 1.0] 6.50 2.0
# 1 XYZ Inc [7.0, 5.0, 1.05] 13.05 7.0
# 2 Bob and Sam [2.0, 10.5, 3.4] 15.90 2.0
CodePudding user response:
These are my two solutions.
df = df.sort_values(by='price_array', key=lambda x: x[1])
bool_mask = [x[0] == 2 for x in df['price_array']]
df = df[bool_mask]
The first one returns a dataframe sorted by the values in the array using a lambda.
The second one creates a boolean mask by checking each row for a condition. It looks like this after [True, False, True]. Using that on the dataframe creates a frame where only the True rows are in the frame.
CodePudding user response:
I found a similar question at here, take a look and I hope this will help you.