I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01'], ['A', '2022-09-03'], ['A', '2022-09-07'], ['A', '2022-09-08'],
['B', '2022-09-03'], ['B', '2022-09-07'], ['B', '2022-09-08'], ['B', '2022-09-09'],
['C', '2022-09-01'], ['C', '2022-09-03'], ['C', '2022-09-07'], ['C', '2022-09-10'],
['D', '2022-09-01'], ['D', '2022-09-03'], ['D', '2022-09-05'], ['D', '2022-09-07']]
df = pd.DataFrame(data = data, columns = ['group', 'date'])
group date
0 A 2022-09-01
1 A 2022-09-03
2 A 2022-09-07
3 A 2022-09-08
4 B 2022-09-03
5 B 2022-09-07
6 B 2022-09-08
7 B 2022-09-09
8 C 2022-09-01
9 C 2022-09-03
10 C 2022-09-07
11 C 2022-09-10
12 D 2022-09-01
13 D 2022-09-03
14 D 2022-09-05
15 D 2022-09-07
I would like to filter the dates which are available across all groups. For example, the date "2022-09-03" is available in groups: A, B, C and D so all groups. The date "2022-09-01" is only available in groups: A, C, and D which means it is missing in group B. Here is the desired output:
data = [['A', '2022-09-03'], ['A', '2022-09-07'], ['B', '2022-09-03'], ['B', '2022-09-07'],
['C', '2022-09-03'], ['C', '2022-09-07'], ['D', '2022-09-03'], ['D', '2022-09-07']]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date'])
group date
0 A 2022-09-03
1 A 2022-09-07
2 B 2022-09-03
3 B 2022-09-07
4 C 2022-09-03
5 C 2022-09-07
6 D 2022-09-03
7 D 2022-09-07
I know how to filter groups with all the same values within a group, but I want to filter the dates which are available in each group. So I was wondering if anyone knows how to perform this using pandas
?
CodePudding user response:
You can get all date
s which exist in each group
by crosstab
and filter columns names:
df1 = pd.crosstab(df['group'],df['date'])
df = df[df['date'].isin(df1.columns[df1.ne(0).all()])]
print (df)
group date
1 A 2022-09-03
2 A 2022-09-07
4 B 2022-09-03
5 B 2022-09-07
9 C 2022-09-03
10 C 2022-09-07
13 D 2022-09-03
15 D 2022-09-07
CodePudding user response:
One option is to group on the dates, get the number of uniques and prune the original dataframe:
df.loc[df.groupby('date').group.transform('nunique').eq(df.group.nunique())]
group date
1 A 2022-09-03
2 A 2022-09-07
4 B 2022-09-03
5 B 2022-09-07
9 C 2022-09-03
10 C 2022-09-07
13 D 2022-09-03
15 D 2022-09-07
CodePudding user response:
You can use set operations:
# which dates are common to all groups?
keep = set.intersection(*df.groupby('group')['date'].agg(set))
# {'2022-09-03', '2022-09-07'}
# keep only the matching ones
out = df[df['date'].isin(keep)]
output:
group date
1 A 2022-09-03
2 A 2022-09-07
4 B 2022-09-03
5 B 2022-09-07
9 C 2022-09-03
10 C 2022-09-07
13 D 2022-09-03
15 D 2022-09-07
comparison of approaches:
# set operations
669 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# nunique
750 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 2D reshaping (crosstab)
5.45 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# on 200k rows (random or like original)
# set operations
21.1 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# nunique
26.7 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# crosstab
47.8 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)