Partition a dataframe with a condition-CodePudding

I have a dataframe which looks like this

       vid   sid   pid  ts
1      101    123  ...
2      102    125
3      102    125
4      102    125

Essentially vid is a visitor id and sid is a session ID

I am trying to partition the df, which has a length of about 1.7 mil rows into smaller dataframes of about ~100k in length.

for i in range(0, len(df), s):
    sdf = df.iloc[i:i s]

However, I do not want to slice the dataframe in the middle of a session (so where the last row in a sliced portion isn't the last.

For example, below would be a problem because it slices the dataframe where the session id sid is still occurring

         vid   sid   pid  ts
99999    101    144  ...
99999    102    145
100000   102    145
--------------------------
100001   102    145

I'm looking for some sort of way to make it such that if the cut off occurs where the sids are cut off, to simply push the cut off until the sids are no longer the same, like

for i in range(0, len(df), s):
    if i['sid'][-1] != (i 1)['sid']:
        sdf = df.iloc[i:i s]
    else:
      # check until sessions are no longer equal

CodePudding user response：

You can use dask for that

import dask.dataframe as dd
ddf = dd.from_pandas(df.set_index('sid'), npartitions=17).reset_index()

Note, that the number of partitions is not enforced to be always 17. dask might decide to partition differently to keep indices in one partition - which is exactly what you want. Alternatively you could also specify a size or number of rows I think.

Then you could either get the partitions with something like

ddf.get_partition(3).compute()

or directly use dask for the distributed computing as that is what it was made for.