How to reduce pandas dataframe to only those individuals with all timepoints-CodePudding

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:

import pandas as pd

ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]

df = pd.DataFrame({'id':ids,
                   'timepoint':timepoint,
                   'outcome':outcome})
print(df)

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
12   3          1        5
13   3          2        4
14   3          4        5
15   4          1        8
16   4          2        4
17   4          3        5
18   4          4        6
19   4          5        2
20   4          6        3

I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).

Here's the ideal output:

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
12   4          1        8
13   4          2        4
14   4          3        5
15   4          4        6
16   4          5        2
17   4          6        3

Any help much appreciated.

CodePudding user response：

You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:

t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]

Prints:

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
15   4          1        8
16   4          2        4
17   4          3        5
18   4          4        6
19   4          5        2
20   4          6        3