I would like to iterate through rows of a dataframe df_mask
(4368 rows x 232 columns), generate a Pandas Series series
and recreate a dataframe container
from the Series. My problem with the code below is that it takes several minutes to complete.
How could I fasten the code execution ?
df_prices = get_prices_df()
container = pd.DataFrame()
for idx, row in df_mask.iterrows():
cols = row[row == True].index
series = df_prices.loc[idx, cols].rank(axis=0, ascending=False, na_option='bottom').le(10)
df = pd.DataFrame([series])
container = pd.concat([container, df], axis=0).fillna(False)
CodePudding user response:
Assuming your input data is similar to this.
np.random.seed(10)
df_prices = pd.DataFrame(np.random.choice(list(range(10)), size=100).reshape(10,-1))
df_mask = pd.DataFrame(np.random.choice([True, False], size=100).reshape(10,-1))
then you can create container
without loop for
using where
directly on the full dataframe df_prices
with the dataframe df_mask
. rank
along the columns (axis=1) because with this method you don't iterate, then compare to 10 (here 5 for the example) and fillna
with False
(although I don't think it is necessary but I don't have time to check that).
container_fast= (
df_prices.where(df_mask)
.rank(axis=1, ascending=False, na_option='bottom')
.le(5) # replace by 10, but in the used input it makes all True
.fillna(False)
)
print(container_fast)
0 1 2 3 4 5 6 7 8 9
0 True True False False False False True True True False
1 True True True False False True False False False True
2 True True False True False True False False False False
3 False False True False False False False False False True
4 True False False True False False False True False True
5 False True True False False True False False False True
6 False False True False False False True False True True
7 False False False False False True True False False True
8 False True False False True True False False True False
9 True False True False True False True False False True
creating container
like you do, then if I do (container == container_fast).all().all()
I get True
.
CodePudding user response:
Instead of iterating through rows (which is inefficient) try using apply
function. You can find the documentation here.
You can do something like this:
def helper_function(row):
row.index # Gives you the index of the row
row['column_name'] # Gives you the value of specific column in that row
# do some logic in here
return something
new_series = yor_df.apply(helper_function, axis=1)
You can iterate through both rows and columns.