Home > Blockchain >  Slicing operation with Dask in a optimal way with Python
Slicing operation with Dask in a optimal way with Python

Time:08-21

I have few questions for the slicing operation. in pandas we can do operation as follows -:

df["A"].iloc[0]
df["B"].iloc[-1]

# here df["A"],df["B"] is sorted

as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it

df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]

this way is really time-consuming when the dataframe is large, is there any other way to do this operation?

https://docs.dask.org/en/latest/dataframe-indexing.html

https://docs.dask.org/en/latest/array-slicing.html

I tried working with this, but it does not work.

CodePudding user response:

The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.

For this specifically, use

first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html

last: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html

Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.

CodePudding user response:

It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.

The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].

This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

  • Related