I have a dataframe such as :
The_list=["A","B","D"]
Groups Values
G1 A
G1 B
G1 C
G1 D
G2 A
G2 B
G2 A
G2 D
G3 A
G3 D
G4 Z
G4 D
G4 E
G4 C
G5 A
G5 B
G5 D
And I would like only to subset Groups
where Values
element are all within The_list
, and that not other element are within that group that are not present within The_list
.
Here I should only subset then :
Groups Values
G2 A
G2 B
G2 A
G2 D
G5 A
G5 B
G5 D
So far I tried :
df.loc[df.Values.str.contains["A" & "B" & "D"].groupby(df.Groups)]
CodePudding user response:
You can use pandas.groupby
on column Groups
then check set
each values of group with set the_list
and return all rows that groups have True values:
The_list=["A","B","D"]
mask_rows = df.groupby('Groups')['Values'].transform(
lambda x : set(x) == set(The_list)
)
print(df[mask_rows])
shorter answer with Paul H with using filter
after groupby:
print(df.groupby("Groups").filter(lambda g: set(g["Values"]) == set(The_list)))
Output:
Groups Values
G2 A
G2 B
G2 A
G2 D
G5 A
G5 B
G5 D
CodePudding user response:
Masking the values which belong the The_list
first (here considered as a set named values_set
for better performance), and only then determine the groups whose values are all should scale better than I'mahdi's approach:
mask = (
df.eval('Values in @values_set')
.groupby(df['Groups'])
.transform('all')
)
df[mask]
Basic benchmark
Setup:
df = pd.DataFrame({
'Groups': ['G1', 'G1', 'G1', 'G1', 'G2', 'G2', 'G2', 'G2', 'G3', 'G3', 'G4', 'G4', 'G4', 'G4', 'G5', 'G5', 'G5'],
'Values': ['A', 'B', 'C', 'D', 'A', 'B', 'A', 'D', 'A', 'D', 'Z', 'D', 'E', 'C', 'A', 'B', 'D']
})
values_set = {"A", "B", "D"}
n = 50_000
df = pd.concat([df]*n) # simulate a big dataset
print(df.shape) # shape (850000, 2)
I'mahdi's solution:
(changed set(The_list)
to values_set
to be fair)
%%timeit -n 10
mask_rows = df.groupby('Groups')['Values'].transform(lambda x : set(x) == values_set)
df[mask_rows]
# Output: 357 ms ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mine:
%%timeit -n 10
mask = (
df.eval('Values in @values_set')
.groupby(df['Groups'])
.transform('all')
)
df[mask]
# Output: 138 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Try with two step
out = df.groupby('Groups')['Values'].unique().map(tuple)==tuple(The_list)
out = df.loc[df.Groups.isin(out.index[out])]
Out[805]:
Groups Values
4 G2 A
5 G2 B
6 G2 A
7 G2 D
14 G5 A
15 G5 B
16 G5 D