If there is a column that holds lists and if a single element matches from our list, Return entire row. For example we have a data frame:
index x
0 [apple, orange, strawberry]
1 [blueberry, pear, watermelon]
2 [apple, banana, strawberry]
3 [apple]
4 [strawberry]
And we have our list,
a = [apple, strawberry]
# I am trying to return index 0,2,3 and 4. But currently I am only able to return index 3 and 4
new_DF = df[df['x'].isin(a)]
# This function is getting the user input for list 'a'.
# This is for reference of what I am actually trying to do.
def filter_Industries():
num_of_industries = int(input('How many industries would you like to filter by?\n'))
list_industries = []
i = 0
for i in range(num_of_industries):
industry = input("Enter the industry:\n")
i = 1
list_industries.append(industry)
return list_industries
a = filter_Industries()
# This is where I am trying to match the elements from the user's list to the data set.
new_DF = df[df['x'].isin(a)]
CodePudding user response:
You can use DataFrame.apply(function)
method. In this case we check all rows whether have a common with "a" list.Let's create function :
a = ["apple", "strawberry"]
a_set = set(a)
def hasCommon(x):
return len(set(x) & a_set) > 0
So if we have a common element it will return True. Let's create dummy data
import pandas as pd
data = {
"calories": [["apple", "orange", "strawberry"], ["blueberry", "pear", "watermelon"], ["strawberry", "pear", "watermelon"]],
"duration": [50, 40,120]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
And we can use like that:
df[df["calories"].apply(hasCommon)]
CodePudding user response:
When you using isin(a) on the values of the 0, 1 and 2 index, the function try to compare a list (e.g., [apple, orange, strawberry]) with the a list. The function worked with the 3 and 4 elements because it compares a single element with a whole list.
I suggest to intersect the a list and the dataframe after converted that two a set, with this code:
for i in range(len(df)):
if set(a) & set(df['x'][i]) != set():
new_DF.append(df['x'][i])
It will append to new_DF just the lines that isn't returned void sets.