I have a dataframe which looks like this
vid sid pid ts
1 101 123 ...
2 102 125
3 102 125
4 102 125
Essentially vid
is a visitor id
and sid
is a session ID
I am trying to partition the df, which has a length of about 1.7 mil rows into smaller dataframes of about ~100k in length.
for i in range(0, len(df), s):
sdf = df.iloc[i:i s]
However, I do not want to slice the dataframe in the middle of a session (so where the last row in a sliced portion isn't the last.
For example, below would be a problem because it slices the dataframe where the session id sid
is still occurring
vid sid pid ts
99999 101 144 ...
99999 102 145
100000 102 145
--------------------------
100001 102 145
I'm looking for some sort of way to make it such that if the cut off occurs where the sids
are cut off, to simply push the cut off until the sids are no longer the same, like
for i in range(0, len(df), s):
if i['sid'][-1] != (i 1)['sid']:
sdf = df.iloc[i:i s]
else:
# check until sessions are no longer equal
CodePudding user response:
You can use dask for that
import dask.dataframe as dd
ddf = dd.from_pandas(df.set_index('sid'), npartitions=17).reset_index()
Note, that the number of partitions is not enforced to be always 17
. dask might decide to partition differently to keep indices in one partition - which is exactly what you want. Alternatively you could also specify a size or number of rows I think.
Then you could either get the partitions with something like
ddf.get_partition(3).compute()
or directly use dask for the distributed computing as that is what it was made for.