Home > front end >  Some ide on how to filter a pandas dataframe
Some ide on how to filter a pandas dataframe

Time:01-13

I want to filter a pandas data-frame. Obviously I can do that with a few loops but will not be efficient for huge amounts of data, so I wanted to ask for some idea on how to do it.

I have data organised for example as,

c1=t c2=detector(1,2,or3)  c3
1     1
1     2
2     1
2     2
2     3
3     2
3     3
4     1
4     3
...

It can be seen as times of a simulation, c1=t, and c2 is the type of detector that is simulated. And for some times you do not have data of one of the detectors randomly, so I want to divide the data in dataframes each for same-detectors times.

The first column is a index increasing column, and the second column can have only 3 values (1,2,3), I want to separate this dataframe in 3,

  • One that contains all rows where for a same c1 where you have 1, and 2 values
  • One that contains all rows where for a same c1 where you have 1, and 3 values
  • One that contains all rows where for a same c1 where you have 2, and 3 values
  • The last that contain all three values 1,2,3

There are no unique c1 rows.

i want to know how will be the pythonic way to filter this dataframe.

CodePudding user response:

You can aggregate sets, so ordering of values in c2 is not important:

s = df.groupby('c1')['c2'].agg(set)
print (s)
c1
1       {1, 2}
2    {1, 2, 3}
3       {2, 3}
4       {1, 3}
Name: c2, dtype: object

Then filter c1 by matching sets:

df1 = df[df['c1'].isin(s.index[s.eq(set([1,2]))])]
df2 = df[df['c1'].isin(s.index[s.eq(set([1,3]))])]
df3 = df[df['c1'].isin(s.index[s.eq(set([2,3]))])]
df4 = df[df['c1'].isin(s.index[s.eq(set([1,2,3]))])]

Another idea is create dictionary of DataFrames:

s = df.groupby('c1')['c2'].agg(set)
print (s)

c1
1       {1, 2}
2    {1, 2, 3}
3       {2, 3}
4       {1, 3}
Name: c2, dtype: object

vals = [(1,2),(1,3),(2,3),(1,2,3)]

dfs = {'_'.join(map(str, x)): df[df['c1'].isin(s.index[s.eq(set(x))])] for x in vals}
print (dfs)
{'1_2':    c1  c2
0   1   1
1   1   2, '1_3':    c1  c2
7   4   1
8   4   3, '2_3':    c1  c2
5   3   2
6   3   3, '1_2_3':    c1  c2
2   2   1
3   2   2
4   2   3}

print (dfs['1_2'])
print (dfs['1_3'])
print (dfs['2_3'])
print (dfs['1_2_3'])
  • Related