How to filter a list column in a dataframe?-CodePudding

I have a data frame like this:

df:
ID    Names
3    [Ally, Ben, Cris]
5    [Bruno, Coleen, Flyn]
2    [Dave, Bob]
7    [Rob, Ally, Bob]
11   [Jill, Tom, Sal]

The Names column is a list of names. Some of them could be repeated.

I want to filter the data frame on Names columns where the names start with either A or B or D.

So my output should look like this:

ID    Names
3    [Ally, Ben]
5    [Bruno]
2    [Dave, Bob]
7    [Ally, Bob]

Reproducible input:

df = pd.DataFrame({'ID': [3, 5, 2, 7, 11],
                   'Names': [['Ally', 'Ben', 'Cris'],
                             ['Bruno', 'Coleen', 'Flyn'],
                             ['Dave', 'Bob'],
                             ['Rob', 'Ally', 'Bob'],
                             ['Jill', 'Tom', 'Sal']]
                  })

CodePudding user response：

You can use a list comprehension to filter the names, and boolean indexing to filter the rows:

target = {'A', 'B', 'D'}
df['Names'] = [[n for n in l if n[0] in target] for l in df['Names']]

df = df[df['Names'].str.len().gt(0)]

Or using explode, and groupby.agg:

s = (df['Names']
     .explode()
     .loc[lambda x: x.str[0].isin(['A', 'B', 'D'])]
     .groupby(level=0).agg(list)
    )

df = df.loc[s.index].assign(Names=s)

output:

   ID        Names
0   3  [Ally, Ben]
1   5      [Bruno]
2   2  [Dave, Bob]
3   7  [Ally, Bob]

CodePudding user response：

just a slight variance of this solution:

df = (df.explode('Names').
      query("Names.str[0].isin(['A','B','D'])",engine='python').
      groupby('ID').
      agg(list).
      reset_index())

print(df)
'''
   ID        Names
0   2  [Dave, Bob]
1   3  [Ally, Ben]
2   5      [Bruno]
3   7  [Ally, Bob]