Home > Software design >  How to reduce pandas dataframe to only those individuals with all timepoints
How to reduce pandas dataframe to only those individuals with all timepoints

Time:11-03

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:

import pandas as pd

ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]

df = pd.DataFrame({'id':ids,
                   'timepoint':timepoint,
                   'outcome':outcome})
print(df)

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
12   3          1        5
13   3          2        4
14   3          4        5
15   4          1        8
16   4          2        4
17   4          3        5
18   4          4        6
19   4          5        2
20   4          6        3

I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).

Here's the ideal output:

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
12   4          1        8
13   4          2        4
14   4          3        5
15   4          4        6
16   4          5        2
17   4          6        3

Any help much appreciated.

CodePudding user response:

You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:

t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]

Prints:

    id  timepoint  outcome
0    1          1        2
1    1          2        3
2    1          3        4
3    1          4        5
4    1          5        6
5    1          6        7
6    2          1        3
7    2          2        4
8    2          3        1
9    2          4        2
10   2          5        3
11   2          6        4
15   4          1        8
16   4          2        4
17   4          3        5
18   4          4        6
19   4          5        2
20   4          6        3
  • Related