I have bank accounts records where each row is the monthly balance of the account:
Acct.Number Date Balance
1111 2020-03-31 1000
1111 2020-04-30 1300
1111 2020-05-31 1100
1111 2020-06-30 1400
2222 2020-03-31 31000
2222 2020-04-30 32000
2222 2020-05-31 31100
2222 2020-06-30 31300
. . .
. . .
9999 2020-03-31 8000
9999 2020-04-30 8800
9999 2020-05-31 8100
9999 2020-06-30 8500
Assume there are 10 million accounts and 10 years of data. What I need to do is to apply a function to each account.
For example, for each account I need to take the mean of the 10 years and calculate the difference between that mean and the balance of that date. For account 1111 it will look like this (the mean is 1200):
Acct.Number Date Balance Difference
1111 2020-03-31 1000 -200
1111 2020-04-30 1300 100
1111 2020-05-31 1100 -100
1111 2020-06-30 1400 200
This is my thinking: once I have the data in multiple Dask partitions where each partition contains N accounts, with map_partitions
I can process in parallel the accounts in each partition.
This works if an account with all its dates is in a single partition. If I let Dask arbitrarily separate the data into partitions, this will not happen. Is there a way to instruct Dask what data should be included in each partition?
As a side note, I create the Dask dataframe and partitions with Dask read_sql_query. The data is ordered by account/date.
CodePudding user response:
Since the data is sorted by account, when loading the dataframe using read_sql_query
function, specifying index_col="Acct.Number"
will make sure that each partition contains information on a specific account number. It's also possible for multiple partitions to contain a specific account details, but the operations on these partitions will be faster, because dask will be aware of their contents.