Home > database >  Pandas dataframe intersection with varying groups
Pandas dataframe intersection with varying groups

Time:05-03

I have a large pandas dataframe with varying rows and columns but looks more or less like:

time   id    angle  ...
0.0    a1    33.67  ...
0.0    b2    35.90  ...
0.0    c3    42.01  ...
0.0    d4    45.00  ...
0.1    a1    12.15  ...
0.1    b2    15.35  ...
0.1    c3    33.12  ...
0.2    a1    65.28  ...
0.2    c3    87.43  ...
0.3    a1    98.85  ...
0.3    c3    100.12 ...
0.4    a1    11.11  ...
0.4    c3    83.22  ...
...

I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?

Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:

time  id  angle  ...
0.0   a1  33.67  ...
0.1   a1  12.15  ...
0.2   a1  65.28  ...
0.3   a1  98.85  ...

0.0   c3  42.01  ...
0.1   c3  33.12  ...
0.2   c3  87.43  ...
0.3   c3  100.12 ...

Code tried so far:

#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)

Which outputs:

time  id  angle ...
(0.0   a1  33.67
...
0.4   a1  11.11)

(0.0  b2  35.90
0.1   b2  15.35)

(0.0  c3  42.01
...
0.4   c3  83.22)

(0.0  d4  45.00)

But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.

Any ideas on a fairly simple way to achieve this with pandas dataframes?

CodePudding user response:

If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:

df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
print (df1)
id       a1     b2      c3    d4
time                            
0.0   33.67  35.90   42.01  45.0
0.1   12.15  15.35   33.12   NaN
0.2   65.28    NaN   87.43   NaN
0.3   98.85    NaN  100.12   NaN
0.4   11.11    NaN   83.22   NaN

There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:

print (df1.dropna(axis=1))
id       a1      c3
time               
0.0   33.67   42.01
0.1   12.15   33.12
0.2   65.28   87.43
0.3   98.85  100.12
0.4   11.11   83.22

df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
print (df2)
   id  time   angle
0  a1   0.0   33.67
1  a1   0.1   12.15
2  a1   0.2   65.28
3  a1   0.3   98.85
4  a1   0.4   11.11
5  c3   0.0   42.01
6  c3   0.1   33.12
7  c3   0.2   87.43
8  c3   0.3  100.12
9  c3   0.4   83.22

CodePudding user response:

There are many ways to define the filter you're asking for:

df.groupby('id').filter(lambda x: len(x) > 4)
    # OR
df.groupby('id').filter(lambda x: x['time'].eq(0.4).any())
    # OR
df.groupby('id').filter(lambda x: x['time'].max() == 0.4)

Output:

    time  id   angle
0    0.0  a1   33.67
2    0.0  c3   42.01
4    0.1  a1   12.15
6    0.1  c3   33.12
7    0.2  a1   65.28
8    0.2  c3   87.43
9    0.3  a1   98.85
10   0.3  c3  100.12
11   0.4  a1   11.11
12   0.4  c3   83.22
  • Related