I have two DataFrames like the followings:
bdata=
A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 NaN NaN 97.0 NaN
2018-01-06 92.0 69.0 20.0 75.0
2018-01-07 49.0 60.0 56.0 NaN
2018-01-08 56.0 NaN 95.0 15.0
2018-01-09 72.0 NaN NaN 84.0
2018-01-10 NaN NaN 74.0 NaN
sdata=
A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 7.0 NaN NaN NaN
2018-01-06 NaN NaN NaN NaN
2018-01-07 NaN NaN NaN 3.0
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN NaN NaN
2018-01-10 NaN 5.0 NaN 0.0
I want to calculate the ratio of the first non-NaN element in sdata which has a larger index than its counterpart in bdata to the first non-NaN element in bdata in the same column. Calculation should be made on each day, for example on 2018-01-05, the ratio should be:
A B C D
NaN/92.0 5.0/69.0 NaN/97.0 3.0/75.0
on 2018-01-06, the ratio should be:
A B C D
NaN/92.0 5.0/69.0 NaN/20.0 3.0/75.0
on 2018-01-07, the ratio should be:
A B C D
NaN/49.0 5.0/60.0 NaN/56.0 0.0/15.0
I'm using the following codes:
ratiosum=0
for idate in bdata.index[:-1]:
bdata1=bdata.loc[bdata.index>=idate].fillna(method='ffill')
sdata1=sdata.loc[bdata.index>=idate][bdata1.shift(1)>0] #bdata is always >0
bvalue1=bdata1.fillna(method='bfill').iloc[0]
svalue1=sdata1.fillna(method='bfill').iloc[0]
ratiosum =(svalue1/bvalue1).sum() #Actually, I only want the sum of all ratios
It works, but very slow because I have thousands of days and columns. I don't know if someone can improve the performance of this calculation. Thanks!
CodePudding user response:
Performance of your code is penalized by the for loop which involves for each date:
- logical indexing (query and mask);
- data creation (fill).
I have the feeling that the following MCVE will perform the desired computations in a single pass just by properly aligning data for the division.
First let's recreate your datasets:
import io
import pandas as pd
bdata = pd.read_fwf(io.StringIO(""" Time A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 NaN NaN 97.0 NaN
2018-01-06 92.0 69.0 20.0 75.0
2018-01-07 49.0 60.0 56.0 NaN
2018-01-08 56.0 NaN 95.0 15.0
2018-01-09 72.0 NaN NaN 84.0
2018-01-10 NaN NaN 74.0 NaN
"""))
sdata = pd.read_fwf(io.StringIO(""" Time A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 7.0 NaN NaN NaN
2018-01-06 NaN NaN NaN NaN
2018-01-07 NaN NaN NaN 3.0
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN NaN NaN
2018-01-10 NaN 5.0 NaN 0.0
"""))
bdata["Time"] = pd.to_datetime(bdata["Time"])
sdata["Time"] = pd.to_datetime(sdata["Time"])
Then we fill and shift to align data for the division and perform the operation:
bdata = bdata.set_index("Time").ffill().bfill()
sdata = sdata.set_index("Time").shift(-1).bfill()
ratios = sdata.div(bdata)
# A B C D
# Time
# 2018-01-01 0.076087 0.072464 NaN 0.04
# 2018-01-02 0.076087 0.072464 NaN 0.04
# 2018-01-03 0.076087 0.072464 NaN 0.04
# 2018-01-04 0.076087 0.072464 NaN 0.04
# 2018-01-05 NaN 0.072464 NaN 0.04
# 2018-01-06 NaN 0.072464 NaN 0.04
# 2018-01-07 NaN 0.083333 NaN 0.00
# 2018-01-08 NaN 0.083333 NaN 0.00
# 2018-01-09 NaN 0.083333 NaN 0.00
# 2018-01-10 NaN NaN NaN NaN
Those results are at least correct w.r.t. your expected output.
Then having the sum of ratios by date is just about aggregating over the right axis:
sums = ratios.sum(axis=1)
# Time
# 2018-01-01 0.188551
# 2018-01-02 0.188551
# 2018-01-03 0.188551
# 2018-01-04 0.188551
# 2018-01-05 0.112464
# 2018-01-06 0.112464
# 2018-01-07 0.083333
# 2018-01-08 0.083333
# 2018-01-09 0.083333
# 2018-01-10 0.000000
More generally, to improve performance:
- Avoid creating for loop to iterate over a DataFrame there are almost always other ways to prevent the loop usage;
- Don't use logical indexing within a loop, if you feel you have to then prefer group by operations instead they are probably more appropriate;
- Don't recreate data, it wastes resources.
CodePudding user response:
Based on jlandercy's answer, I made small changes and the results are the same as my previous for-loop results, but much faster:
import io
import pandas as pd
bdata = pd.read_fwf(io.StringIO(""" Time A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 NaN NaN 97.0 NaN
2018-01-06 92.0 69.0 20.0 75.0
2018-01-07 49.0 60.0 56.0 NaN
2018-01-08 56.0 NaN 95.0 15.0
2018-01-09 72.0 NaN NaN 84.0
2018-01-10 NaN NaN 74.0 NaN
"""))
sdata = pd.read_fwf(io.StringIO(""" Time A B C D
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
2018-01-05 7.0 NaN NaN NaN
2018-01-06 NaN NaN NaN NaN
2018-01-07 NaN NaN NaN 3.0
2018-01-08 NaN NaN NaN NaN
2018-01-09 NaN NaN NaN NaN
2018-01-10 NaN 5.0 NaN 0.0
"""))
bdata["Time"] = pd.to_datetime(bdata["Time"])
sdata["Time"] = pd.to_datetime(sdata["Time"])
bdata = bdata.set_index("Time")
sdata = sdata.set_index("Time")
sdata=sdata.shift(-1).bfill()
ratiosum=(sdata/bdata).bfill().sum().sum()
print(ratiosum)
I do appreciate your help, jlandercy!