Home > Software design >  How do you drop rows from Dask where the value count doesn't meet a certain threshold?
How do you drop rows from Dask where the value count doesn't meet a certain threshold?

Time:05-12

I'm working with a fairly large dataset. The uncompressed CSV is about 20 GB. I'm trying to use Dask, but am not very familiar with it. I usually use Pandas. I'm trying to drop rows where the number of instances of a particular value in a column are less than a certain threshold. Here is an example:

Original dataset:

|icao   | callsign | reg    | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|123456 | IBE6827  | EC-LUK | A333     |
|123456 | IBE6827  | EC-LUK | A333     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |

If the threshold is 3, the resulting dataframe would be:

|icao   | callsign | reg    | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |

I found a way to do it, but it seems very convoluted. I feel like there should be a simpler way. In Pandas it would be this:

threshold = 3
inputFrame = inputFrame.groupby('icao').filter(lambda x: len(x) >= threshold)

However, there is no filter() in Dask. Here is the convoluted code that I got to work:

threshold = 3
a = inputFrame.groupby('icao').count().reg
a = a.to_frame()
a = a.rename(columns={'reg':'count'})
inputFrame = inputFrame.merge(a, how='left', on='icao')
inputFrame = inputFrame[(inputFrame['count'] >= threshold )]

Is there an easier way to do this?

CodePudding user response:

The answer to this question may depend on the definition of 'easier', but here are two alternative ways to do it:

Strategy #1: Build icao series up with dummy column for groupby and count, join with initial df, then drop dummy column.

threshold = 3
inputFrame = inputFrame.join(
    inputFrame.icao.to_frame().assign(DROPME=0).
        groupby('icao').count().query(f'DROPME >= {threshold}'),
    on='icao', how='inner').drop(columns='DROPME').compute()

Strategy #2: Do groupby and count, then drop all remaining columns from initial df and join with initial df.

threshold = 3
inputFrame = inputFrame.join(
    inputFrame.
        groupby('icao').count().query(f'reg >= {threshold}').
            drop(columns=set(inputFrame.columns) - set(['icao'])),
    on='icao', how='inner').compute()

Test input:

      icao    callsign       reg    acftType
1  abcdef    ETH720      ET-ASJ    B738
2  abcdef    ETH720      ET-ASJ    B738
3  abcdef    ETH720      ET-ASJ    B738
4  123456    IBE6827     EC-LUK    A333
5  123456    IBE6827     EC-LUK    A333
6  789ghi    FRH571      OO-ACE    B744
7  789ghi    FRH571      OO-ACE    B744
8  789ghi    FRH571      OO-ACE    B744
9  789ghi    FRH571      OO-ACE    B744

Output:

      icao    callsign       reg    acftType
1  abcdef    ETH720      ET-ASJ    B738
2  abcdef    ETH720      ET-ASJ    B738
3  abcdef    ETH720      ET-ASJ    B738
6  789ghi    FRH571      OO-ACE    B744
7  789ghi    FRH571      OO-ACE    B744
8  789ghi    FRH571      OO-ACE    B744
9  789ghi    FRH571      OO-ACE    B744
  • Related