I'm working with a fairly large dataset. The uncompressed CSV is about 20 GB. I'm trying to use Dask, but am not very familiar with it. I usually use Pandas. I'm trying to drop rows where the number of instances of a particular value in a column are less than a certain threshold. Here is an example:
Original dataset:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|123456 | IBE6827 | EC-LUK | A333 |
|123456 | IBE6827 | EC-LUK | A333 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
If the threshold is 3, the resulting dataframe would be:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
I found a way to do it, but it seems very convoluted. I feel like there should be a simpler way. In Pandas it would be this:
threshold = 3
inputFrame = inputFrame.groupby('icao').filter(lambda x: len(x) >= threshold)
However, there is no filter() in Dask. Here is the convoluted code that I got to work:
threshold = 3
a = inputFrame.groupby('icao').count().reg
a = a.to_frame()
a = a.rename(columns={'reg':'count'})
inputFrame = inputFrame.merge(a, how='left', on='icao')
inputFrame = inputFrame[(inputFrame['count'] >= threshold )]
Is there an easier way to do this?
CodePudding user response:
The answer to this question may depend on the definition of 'easier', but here are two alternative ways to do it:
Strategy #1: Build icao
series up with dummy column for groupby and count, join with initial df, then drop dummy column.
threshold = 3
inputFrame = inputFrame.join(
inputFrame.icao.to_frame().assign(DROPME=0).
groupby('icao').count().query(f'DROPME >= {threshold}'),
on='icao', how='inner').drop(columns='DROPME').compute()
Strategy #2: Do groupby and count, then drop all remaining columns from initial df and join with initial df.
threshold = 3
inputFrame = inputFrame.join(
inputFrame.
groupby('icao').count().query(f'reg >= {threshold}').
drop(columns=set(inputFrame.columns) - set(['icao'])),
on='icao', how='inner').compute()
Test input:
icao callsign reg acftType
1 abcdef ETH720 ET-ASJ B738
2 abcdef ETH720 ET-ASJ B738
3 abcdef ETH720 ET-ASJ B738
4 123456 IBE6827 EC-LUK A333
5 123456 IBE6827 EC-LUK A333
6 789ghi FRH571 OO-ACE B744
7 789ghi FRH571 OO-ACE B744
8 789ghi FRH571 OO-ACE B744
9 789ghi FRH571 OO-ACE B744
Output:
icao callsign reg acftType
1 abcdef ETH720 ET-ASJ B738
2 abcdef ETH720 ET-ASJ B738
3 abcdef ETH720 ET-ASJ B738
6 789ghi FRH571 OO-ACE B744
7 789ghi FRH571 OO-ACE B744
8 789ghi FRH571 OO-ACE B744
9 789ghi FRH571 OO-ACE B744