I have a similar, but larger data set with more dates and over ten thousand rows. Usually, it takes 3mins or longer to run the code and plot. I think the problem comes from loop. Looping is time-consuming in python. In this case, would be appreciated if someone knows how to rewrite the code to make it faster.
data = {'Date' : ["2022-07-01"]*5000 ["2022-07-02"]*5000 ["2022-07-03"]*5000,
'OB1' : range(1,15001),
'OB2' : range(1,15001)}
df = pd.DataFrame(data)
# multi-indexing
df = df.set_index(['Date'])
# loop for plot
i = 1
fig, axs = plt.subplots(nrows = 1, ncols = 3, sharey = True)
fig.subplots_adjust(wspace=0)
for j, sub_df in df.groupby(level=0):
plt.subplot(130 i)
x = sub_df['OB1']
y = sub_df['OB2']
plt.barh(x, y)
i = i 1
plt.show()
CodePudding user response:
The slowness comes from the barh
function, which involves drawing many rectangles. While your example is already pretty slow (a minute on my laptop), this one runs in less than a second. I replaced barh
with fill_betweenx
, which fills the area between two curves (here 0 and the height of bars) instead of drawing rectangles. It goes much faster but is not strictly the same. Also, I use the option step=post
, so if you zoom, you will have a bar-style graph.
import pandas as pd
import matplotlib.pyplot as plt
data = {
"Date": ["2022-07-01"] * 5000
["2022-07-02"] * 5000
["2022-07-03"] * 5000,
"OB1": range(1, 15001),
"OB2": range(1, 15001),
}
df = pd.DataFrame(data)
# multi-indexing
df = df.set_index(["Date"])
# loop for plot
i = 1
fig, axs = plt.subplots(nrows=1, ncols=3, sharey=True)
fig.subplots_adjust(wspace=0)
for j, sub_df in df.groupby(level=0):
plt.subplot(130 i)
x = sub_df["OB1"]
y = sub_df["OB2"]
# plt.barh(x, y)
plt.fill_betweenx(y, 0, x, step="post")
i = i 1
plt.show()