Oly subset Group if element within a columns are all present within a list and nothing else in panda-CodePudding

I have a dataframe such as :

The_list=["A","B","D"]

Groups Values
G1     A
G1     B
G1     C
G1     D
G2     A
G2     B
G2     A
G2     D
G3     A
G3     D
G4     Z
G4     D
G4     E
G4     C
G5     A
G5     B
G5     D

And I would like only to subset Groups where Values element are all within The_list, and that not other element are within that group that are not present within The_list.

Here I should only subset then :

Groups Values
G2     A
G2     B
G2     A
G2     D
G5     A
G5     B
G5     D

So far I tried :

df.loc[df.Values.str.contains["A" & "B" & "D"].groupby(df.Groups)]

CodePudding user response：

You can use pandas.groupby on column Groups then check set each values of group with set the_list and return all rows that groups have True values:

The_list=["A","B","D"]
mask_rows = df.groupby('Groups')['Values'].transform(
    lambda x : set(x) == set(The_list)
)
print(df[mask_rows])

shorter answer with Paul H with using filter after groupby:

print(df.groupby("Groups").filter(lambda g: set(g["Values"]) == set(The_list)))

Output:

Groups Values
G2     A
G2     B
G2     A
G2     D
G5     A
G5     B
G5     D

CodePudding user response：

Masking the values which belong the The_list first (here considered as a set named values_set for better performance), and only then determine the groups whose values are all should scale better than I'mahdi's approach:

mask = (
    df.eval('Values in @values_set')
      .groupby(df['Groups'])
      .transform('all')
)

df[mask]

Basic benchmark

Setup:

df = pd.DataFrame({
    'Groups': ['G1', 'G1', 'G1', 'G1', 'G2', 'G2', 'G2', 'G2', 'G3', 'G3', 'G4', 'G4', 'G4', 'G4', 'G5', 'G5', 'G5'],
    'Values': ['A', 'B', 'C', 'D', 'A', 'B', 'A', 'D', 'A', 'D', 'Z', 'D', 'E', 'C', 'A', 'B', 'D']
})

values_set = {"A", "B", "D"}

n = 50_000

df = pd.concat([df]*n) # simulate a big dataset 

print(df.shape) # shape (850000, 2)

I'mahdi's solution:

(changed set(The_list) to values_set to be fair)

%%timeit -n 10

mask_rows = df.groupby('Groups')['Values'].transform(lambda x : set(x) == values_set)

df[mask_rows]

# Output: 357 ms ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Mine:

%%timeit -n 10

mask = (
    df.eval('Values in @values_set')
      .groupby(df['Groups'])
      .transform('all')
)
df[mask]

# Output: 138 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

Try with two step

out = df.groupby('Groups')['Values'].unique().map(tuple)==tuple(The_list)
out = df.loc[df.Groups.isin(out.index[out])]
Out[805]: 
   Groups Values
4      G2      A
5      G2      B
6      G2      A
7      G2      D
14     G5      A
15     G5      B
16     G5      D