How to get hash of string column in Polars or Pyarrow-CodePudding

I have a Pandas DataFrame/Polars dataframe / Pyarrow table with a string key column. You can assume the strings are random. I want to partition that dataframe into N smaller dataframes based on this key column.

With an integer column, I can just use df1 = df[df.key % N == 1], df2 = df[df.key % N == 2] etc.

My best guess at how you are going to do that with a string column is apply a hash function (e.g. summing the ascii values of the string) to convert it to an integer column, then use the modulus.

Please let me know what's the most efficient way this can be done in either Pandas, Polars or Pyarrow, ideally with pure columnar operations within the API. Doing a df.apply is likely too slow for my use case.

CodePudding user response：

I would try using hash_rows to see how it performs on your dataset and computing platform. (Note that in the calculation, I'm effectively selecting only the key field and running the hash_rows on that)

N = 50
df = df.with_column(
    pl.lit(df.select(['key']).hash_rows() % N).alias('hash')
)

I just ran this on dataset with almost 49 million records on a 32-core system, and it completed within seconds. (The 'key' field in my dataset was last names of people.)

I should also note, there's a partition_by method that may be of help in the partitioning.