How do you drop rows from Dask where the value count doesn't meet a certain threshold?-CodePudding

I'm working with a fairly large dataset. The uncompressed CSV is about 20 GB. I'm trying to use Dask, but am not very familiar with it. I usually use Pandas. I'm trying to drop rows where the number of instances of a particular value in a column are less than a certain threshold. Here is an example:

Original dataset:

|icao   | callsign | reg    | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|123456 | IBE6827  | EC-LUK | A333     |
|123456 | IBE6827  | EC-LUK | A333     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |

If the threshold is 3, the resulting dataframe would be:

|icao   | callsign | reg    | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|abcdef | ETH720   | ET-ASJ | B738     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |
|789ghi | FRH571   | OO-ACE | B744     |

I found a way to do it, but it seems very convoluted. I feel like there should be a simpler way. In Pandas it would be this:

threshold = 3
inputFrame = inputFrame.groupby('icao').filter(lambda x: len(x) >= threshold)

However, there is no filter() in Dask. Here is the convoluted code that I got to work:

threshold = 3
a = inputFrame.groupby('icao').count().reg
a = a.to_frame()
a = a.rename(columns={'reg':'count'})
inputFrame = inputFrame.merge(a, how='left', on='icao')
inputFrame = inputFrame[(inputFrame['count'] >= threshold )]

Is there an easier way to do this?

CodePudding user response：

The answer to this question may depend on the definition of 'easier', but here are two alternative ways to do it:

Strategy #1: Build icao series up with dummy column for groupby and count, join with initial df, then drop dummy column.

threshold = 3
inputFrame = inputFrame.join(
    inputFrame.icao.to_frame().assign(DROPME=0).
        groupby('icao').count().query(f'DROPME >= {threshold}'),
    on='icao', how='inner').drop(columns='DROPME').compute()

Strategy #2: Do groupby and count, then drop all remaining columns from initial df and join with initial df.

threshold = 3
inputFrame = inputFrame.join(
    inputFrame.
        groupby('icao').count().query(f'reg >= {threshold}').
            drop(columns=set(inputFrame.columns) - set(['icao'])),
    on='icao', how='inner').compute()

Test input:

      icao    callsign       reg    acftType
1  abcdef    ETH720      ET-ASJ    B738
2  abcdef    ETH720      ET-ASJ    B738
3  abcdef    ETH720      ET-ASJ    B738
4  123456    IBE6827     EC-LUK    A333
5  123456    IBE6827     EC-LUK    A333
6  789ghi    FRH571      OO-ACE    B744
7  789ghi    FRH571      OO-ACE    B744
8  789ghi    FRH571      OO-ACE    B744
9  789ghi    FRH571      OO-ACE    B744

Output:

      icao    callsign       reg    acftType
1  abcdef    ETH720      ET-ASJ    B738
2  abcdef    ETH720      ET-ASJ    B738
3  abcdef    ETH720      ET-ASJ    B738
6  789ghi    FRH571      OO-ACE    B744
7  789ghi    FRH571      OO-ACE    B744
8  789ghi    FRH571      OO-ACE    B744
9  789ghi    FRH571      OO-ACE    B744