Home > Enterprise >  Pandas - check if a value is in a column made of lists
Pandas - check if a value is in a column made of lists

Time:03-11

I have a DataFrame with a names columns which is a list of names, and I want to drop all rows if a name is not in the list.

I wrote :

df["John" in df.names]

But this doesn't work at all.

Is there a way to do so ? I couldn't find any answer over the Internet.

Thanks !

CodePudding user response:

Use lambda function with in:

df1 = df[df.names.map(lambda x: "John" in x)]
#alternative
df1 = df[df.names.apply(lambda x: "John" in x)]

Or list comprehension:

 df1 = df[["John" in x for x in df.names]]

Or use set intersection:

df1 = df[df.names.map(set) & set(['John'])]

Or use DataFrame constructor:

df1 = df[pd.DataFrame(df.names.tolist(), index=df.index).eq('John').any(axis=1)]

Performance in small 2k rows, best test in real data:

df = pd.DataFrame({'value': [10, 11], 'names': [['John', 'Paul'], ['Louis', 'Alex']]})
df = pd.concat([df] * 1000, ignore_index=True)

print (df)

In [61]: %timeit df[["John" in x for x in df.names]]
863 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [62]: %timeit df[df.names.map(lambda x: "John" in x)]
834 µs ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [63]: %timeit df[df.names.apply(lambda x: "John" in x)]
910 µs ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [65]: %timeit df[df.names.map(set) & set(['John'])]
1.66 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [66]: %timeit df[pd.DataFrame(df.names.tolist(), index=df.index).eq('John').any(axis=1)]
1.49 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [67]: %timeit df.loc[df['names'].explode().eq('John').loc[lambda x: x].index]
2.47 ms ± 70.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

You can explode your list of names and keep rows that contain the name:

out = df.loc[df['names'].explode().eq('John').loc[lambda x: x].index]
print(out)

# Output
   value         names
0     10  [John, Paul]

Setup:

data = {'value': [10, 11],
        'names': [['John', 'Paul'],
                  ['Louis', 'Alex']]}
df = pd.DataFrame(data)
print(df)

# Output
   value          names
0     10   [John, Paul]
1     11  [Louis, Alex]
  • Related