split Python DataFrame into k parts with index and iterate over them in a loop-CodePudding

I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low. I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.

What I need is roughly this:

I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it

To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):

print(dftest)

#     rowid        type    lon  year
# 1       1        Tomt    NaN  2021
# 2       2    Lägenhet  12.72  2022
# 3       3    Lägenhet    NaN  2017
# 4       4       Villa  17.95  2016
# 5       5      Radhus  17.95  2021
# 6       6       Villa  17.95  2016
# 7       7  Fritidshus  18.64  2020
# 8       8       Villa  18.64  2019
# 9       9       Villa  18.63  2021
# 10     10       Villa  18.63  2019
# 11     11       Villa  17.66  2017
# 12     12      Radhus  17.66  2022

So here is what I tried:

dfs = np.array_split(dftest, 6)

for j in range(0,6):
    print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))

where I get output:

# dfs[0] has 2 obs 1 to  2
# dfs[1] has 2 obs 3 to  4
# dfs[2] has 2 obs 5 to  6
# dfs[3] has 2 obs 7 to  8
# dfs[4] has 2 obs 9 to  10
# dfs[5] has 2 obs 11 to  12

So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:

for row in tqdm(dfs[0].itertuples()):
    x = row.type
    y = foo.bar(x, param="param0")
    i = row[0]
    dfs[0].x[i, 'anotherColumn'] = baz(y)
    #... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")

I suppose this should be possible to automate or even run in parallel (how?)

And in the end I'll try putting them together (no clue if this would work), possibly something like this:

pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])

If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step

PS. I'm using a Jupyter notebook on localhost with Python3.

CodePudding user response：

As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:

#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)

def foo(df):
    # do something with df
    df['new_col'] = df.sum(axis=1)
    return df

if __name__ == '__main__':
    ROW = 10000
    COL = 10
    df = pd.DataFrame(np.random.random((ROW, COL)))
    res = df.chunk_apply(foo, axis=0)
    print(res.head())

Out:
          0         1         2  ...         8         9   new_col
0  0.735248  0.393912  0.966608  ...  0.261675  0.207216  6.276589
1  0.256962  0.461601  0.341175  ...  0.688134  0.607418  5.297881
2  0.335974  0.093897  0.622115  ...  0.442783  0.115127  3.102827
3  0.488585  0.709927  0.209429  ...  0.942065  0.126600  4.367873
4  0.619996  0.704085  0.685806  ...  0.626539  0.145320  4.901926