Instruct Dask what records to include in each partition-CodePudding

I have bank accounts records where each row is the monthly balance of the account:

Acct.Number   Date             Balance
1111          2020-03-31        1000
1111          2020-04-30        1300
1111          2020-05-31        1100
1111          2020-06-30        1400
2222          2020-03-31       31000
2222          2020-04-30       32000
2222          2020-05-31       31100
2222          2020-06-30       31300
 .                .               .
 .                .               .
9999          2020-03-31        8000
9999          2020-04-30        8800
9999          2020-05-31        8100
9999          2020-06-30        8500

Assume there are 10 million accounts and 10 years of data. What I need to do is to apply a function to each account.

For example, for each account I need to take the mean of the 10 years and calculate the difference between that mean and the balance of that date. For account 1111 it will look like this (the mean is 1200):

Acct.Number   Date             Balance     Difference
1111          2020-03-31        1000       -200
1111          2020-04-30        1300        100
1111          2020-05-31        1100       -100
1111          2020-06-30        1400        200

This is my thinking: once I have the data in multiple Dask partitions where each partition contains N accounts, with map_partitions I can process in parallel the accounts in each partition.

This works if an account with all its dates is in a single partition. If I let Dask arbitrarily separate the data into partitions, this will not happen. Is there a way to instruct Dask what data should be included in each partition?

As a side note, I create the Dask dataframe and partitions with Dask read_sql_query. The data is ordered by account/date.

CodePudding user response：

Since the data is sorted by account, when loading the dataframe using read_sql_query function, specifying index_col="Acct.Number" will make sure that each partition contains information on a specific account number. It's also possible for multiple partitions to contain a specific account details, but the operations on these partitions will be faster, because dask will be aware of their contents.