Home > OS >  Panda DataFrame, How to make it faster?
Panda DataFrame, How to make it faster?

Time:09-23

I have two DataFrames like the followings:

bdata=
               A     B     C     D
2018-01-01   NaN   NaN   NaN   NaN
2018-01-02   NaN   NaN   NaN   NaN
2018-01-03   NaN   NaN   NaN   NaN
2018-01-04   NaN   NaN   NaN   NaN
2018-01-05   NaN   NaN  97.0   NaN
2018-01-06  92.0  69.0  20.0  75.0
2018-01-07  49.0  60.0  56.0   NaN
2018-01-08  56.0   NaN  95.0  15.0
2018-01-09  72.0   NaN   NaN  84.0
2018-01-10   NaN   NaN  74.0   NaN

sdata=
              A    B   C    D
2018-01-01  NaN  NaN NaN  NaN
2018-01-02  NaN  NaN NaN  NaN
2018-01-03  NaN  NaN NaN  NaN
2018-01-04  NaN  NaN NaN  NaN
2018-01-05  7.0  NaN NaN  NaN
2018-01-06  NaN  NaN NaN  NaN
2018-01-07  NaN  NaN NaN  3.0
2018-01-08  NaN  NaN NaN  NaN
2018-01-09  NaN  NaN NaN  NaN
2018-01-10  NaN  5.0 NaN  0.0

I want to calculate the ratio of the first non-NaN element in sdata which has a larger index than its counterpart in bdata to the first non-NaN element in bdata in the same column. Calculation should be made on each day, for example on 2018-01-05, the ratio should be:

  A          B          C          D
NaN/92.0   5.0/69.0   NaN/97.0   3.0/75.0

on 2018-01-06, the ratio should be:

  A          B          C          D
NaN/92.0   5.0/69.0   NaN/20.0   3.0/75.0

on 2018-01-07, the ratio should be:

  A          B          C          D
NaN/49.0   5.0/60.0   NaN/56.0   0.0/15.0

I'm using the following codes:

ratiosum=0
for idate in bdata.index[:-1]:
    bdata1=bdata.loc[bdata.index>=idate].fillna(method='ffill')
    sdata1=sdata.loc[bdata.index>=idate][bdata1.shift(1)>0] #bdata is always >0
    bvalue1=bdata1.fillna(method='bfill').iloc[0]
    svalue1=sdata1.fillna(method='bfill').iloc[0]
    ratiosum  =(svalue1/bvalue1).sum() #Actually, I only want the sum of all ratios

It works, but very slow because I have thousands of days and columns. I don't know if someone can improve the performance of this calculation. Thanks!

CodePudding user response:

Performance of your code is penalized by the for loop which involves for each date:

  • logical indexing (query and mask);
  • data creation (fill).

I have the feeling that the following MCVE will perform the desired computations in a single pass just by properly aligning data for the division.

First let's recreate your datasets:

import io
import pandas as pd

bdata = pd.read_fwf(io.StringIO("""      Time     A     B     C     D
2018-01-01   NaN   NaN   NaN   NaN
2018-01-02   NaN   NaN   NaN   NaN
2018-01-03   NaN   NaN   NaN   NaN
2018-01-04   NaN   NaN   NaN   NaN
2018-01-05   NaN   NaN  97.0   NaN
2018-01-06  92.0  69.0  20.0  75.0
2018-01-07  49.0  60.0  56.0   NaN
2018-01-08  56.0   NaN  95.0  15.0
2018-01-09  72.0   NaN   NaN  84.0
2018-01-10   NaN   NaN  74.0   NaN
"""))

sdata = pd.read_fwf(io.StringIO("""      Time    A    B   C    D
2018-01-01  NaN  NaN NaN  NaN
2018-01-02  NaN  NaN NaN  NaN
2018-01-03  NaN  NaN NaN  NaN
2018-01-04  NaN  NaN NaN  NaN
2018-01-05  7.0  NaN NaN  NaN
2018-01-06  NaN  NaN NaN  NaN
2018-01-07  NaN  NaN NaN  3.0
2018-01-08  NaN  NaN NaN  NaN
2018-01-09  NaN  NaN NaN  NaN
2018-01-10  NaN  5.0 NaN  0.0
"""))

bdata["Time"] = pd.to_datetime(bdata["Time"])
sdata["Time"] = pd.to_datetime(sdata["Time"])

Then we fill and shift to align data for the division and perform the operation:

bdata = bdata.set_index("Time").ffill().bfill()
sdata = sdata.set_index("Time").shift(-1).bfill()

ratios = sdata.div(bdata)

#                    A         B   C     D
# Time                                    
# 2018-01-01  0.076087  0.072464 NaN  0.04
# 2018-01-02  0.076087  0.072464 NaN  0.04
# 2018-01-03  0.076087  0.072464 NaN  0.04
# 2018-01-04  0.076087  0.072464 NaN  0.04
# 2018-01-05       NaN  0.072464 NaN  0.04
# 2018-01-06       NaN  0.072464 NaN  0.04
# 2018-01-07       NaN  0.083333 NaN  0.00
# 2018-01-08       NaN  0.083333 NaN  0.00
# 2018-01-09       NaN  0.083333 NaN  0.00
# 2018-01-10       NaN       NaN NaN   NaN

Those results are at least correct w.r.t. your expected output.

Then having the sum of ratios by date is just about aggregating over the right axis:

sums = ratios.sum(axis=1)

# Time
# 2018-01-01    0.188551
# 2018-01-02    0.188551
# 2018-01-03    0.188551
# 2018-01-04    0.188551
# 2018-01-05    0.112464
# 2018-01-06    0.112464
# 2018-01-07    0.083333
# 2018-01-08    0.083333
# 2018-01-09    0.083333
# 2018-01-10    0.000000

More generally, to improve performance:

  • Avoid creating for loop to iterate over a DataFrame there are almost always other ways to prevent the loop usage;
  • Don't use logical indexing within a loop, if you feel you have to then prefer group by operations instead they are probably more appropriate;
  • Don't recreate data, it wastes resources.

CodePudding user response:

Based on jlandercy's answer, I made small changes and the results are the same as my previous for-loop results, but much faster:

import io
import pandas as pd

bdata = pd.read_fwf(io.StringIO("""      Time     A     B     C     D
2018-01-01   NaN   NaN   NaN   NaN
2018-01-02   NaN   NaN   NaN   NaN
2018-01-03   NaN   NaN   NaN   NaN
2018-01-04   NaN   NaN   NaN   NaN
2018-01-05   NaN   NaN  97.0   NaN
2018-01-06  92.0  69.0  20.0  75.0
2018-01-07  49.0  60.0  56.0   NaN
2018-01-08  56.0   NaN  95.0  15.0
2018-01-09  72.0   NaN   NaN  84.0
2018-01-10   NaN   NaN  74.0   NaN
"""))

sdata = pd.read_fwf(io.StringIO("""      Time    A    B   C    D
2018-01-01  NaN  NaN NaN  NaN
2018-01-02  NaN  NaN NaN  NaN
2018-01-03  NaN  NaN NaN  NaN
2018-01-04  NaN  NaN NaN  NaN
2018-01-05  7.0  NaN NaN  NaN
2018-01-06  NaN  NaN NaN  NaN
2018-01-07  NaN  NaN NaN  3.0
2018-01-08  NaN  NaN NaN  NaN
2018-01-09  NaN  NaN NaN  NaN
2018-01-10  NaN  5.0 NaN  0.0
"""))

bdata["Time"] = pd.to_datetime(bdata["Time"])
sdata["Time"] = pd.to_datetime(sdata["Time"])
bdata = bdata.set_index("Time")
sdata = sdata.set_index("Time")
    
sdata=sdata.shift(-1).bfill()
ratiosum=(sdata/bdata).bfill().sum().sum()
print(ratiosum)

I do appreciate your help, jlandercy!

  • Related