Filter rows where dates are available across all groups using Pandas-CodePudding

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01'], ['A', '2022-09-03'], ['A', '2022-09-07'], ['A', '2022-09-08'],
        ['B', '2022-09-03'], ['B', '2022-09-07'], ['B', '2022-09-08'], ['B', '2022-09-09'],
        ['C', '2022-09-01'], ['C', '2022-09-03'], ['C', '2022-09-07'], ['C', '2022-09-10'],
        ['D', '2022-09-01'], ['D', '2022-09-03'], ['D', '2022-09-05'], ['D', '2022-09-07']]
df = pd.DataFrame(data = data, columns = ['group', 'date'])

   group        date
0      A  2022-09-01
1      A  2022-09-03
2      A  2022-09-07
3      A  2022-09-08
4      B  2022-09-03
5      B  2022-09-07
6      B  2022-09-08
7      B  2022-09-09
8      C  2022-09-01
9      C  2022-09-03
10     C  2022-09-07
11     C  2022-09-10
12     D  2022-09-01
13     D  2022-09-03
14     D  2022-09-05
15     D  2022-09-07

I would like to filter the dates which are available across all groups. For example, the date "2022-09-03" is available in groups: A, B, C and D so all groups. The date "2022-09-01" is only available in groups: A, C, and D which means it is missing in group B. Here is the desired output:

data = [['A', '2022-09-03'], ['A', '2022-09-07'], ['B', '2022-09-03'], ['B', '2022-09-07'], 
        ['C', '2022-09-03'], ['C', '2022-09-07'], ['D', '2022-09-03'], ['D', '2022-09-07']]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date'])

  group        date
0     A  2022-09-03
1     A  2022-09-07
2     B  2022-09-03
3     B  2022-09-07
4     C  2022-09-03
5     C  2022-09-07
6     D  2022-09-03
7     D  2022-09-07

I know how to filter groups with all the same values within a group, but I want to filter the dates which are available in each group. So I was wondering if anyone knows how to perform this using pandas?

CodePudding user response：

You can get all dates which exist in each group by crosstab and filter columns names:

df1 = pd.crosstab(df['group'],df['date'])

df = df[df['date'].isin(df1.columns[df1.ne(0).all()])]
print (df)
   group        date
1      A  2022-09-03
2      A  2022-09-07
4      B  2022-09-03
5      B  2022-09-07
9      C  2022-09-03
10     C  2022-09-07
13     D  2022-09-03
15     D  2022-09-07

CodePudding user response：

One option is to group on the dates, get the number of uniques and prune the original dataframe:

df.loc[df.groupby('date').group.transform('nunique').eq(df.group.nunique())]

   group        date
1      A  2022-09-03
2      A  2022-09-07
4      B  2022-09-03
5      B  2022-09-07
9      C  2022-09-03
10     C  2022-09-07
13     D  2022-09-03
15     D  2022-09-07

CodePudding user response：

You can use set operations:

# which dates are common to all groups?
keep = set.intersection(*df.groupby('group')['date'].agg(set))
# {'2022-09-03', '2022-09-07'}

# keep only the matching ones
out = df[df['date'].isin(keep)]

output:

   group        date
1      A  2022-09-03
2      A  2022-09-07
4      B  2022-09-03
5      B  2022-09-07
9      C  2022-09-03
10     C  2022-09-07
13     D  2022-09-03
15     D  2022-09-07

comparison of approaches:

# set operations
669 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# nunique
750 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 2D reshaping (crosstab)
5.45 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# on 200k rows (random or like original)
# set operations
21.1 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# nunique
26.7 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# crosstab
47.8 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)