how can I speed up the process of summing values in this python code?-CodePudding

I have df.Ah, column of a dataframe, which has either positive values or zeros. I want to store into one another dataframe the sum of values before a zero and follow that with one another zero.

for example : df.Ah = [ 1,2,3,0,46,0,24,1] , gives as out put [6,0,46,0,25].

my attempt was:

lst = df.Ah;
lst1 = pd.DataFrame(index=np.arange(100000000)); #if I replace values of a pre-existing column code #speeds up


summ = 0;
for i , elem in enumerate(lst):
     
    if elem != 0:
        summ = summ   elem;
    else:
        if summ:
            lst1.loc[i,'Ah']=sum;
            
        lst1.loc[i,'Ah']=elem;
        summ = 0;
...

if summ:
    lst1.iloc[i 1, 0] = summ;

#If I print the index i for each loop , it generates 100.000 prints for each minute;
# which means it would take around five hours to complete checking 31 milion values
# of my dataframe and I don't have all that time for this basic operation.

is there a way to speed up this code ?

CodePudding user response：

You can use the cumsum method from Pandas to calculate the cumulative sum of the values in df.Ah and then create a new DataFrame using the indices where the cumulative sum is 0 as the index. This will avoid the need to use a loop, which can be slow when working with large DataFrames.

Here's an example of how you could do this:

# Calculate the cumulative sum of values in df.Ah
df['cumulative_sum'] = df.Ah.cumsum()

# Create a new DataFrame using the indices where the cumulative sum is 0 as the index
df1 = pd.DataFrame(df.cumulative_sum.values, index=df.index[df.cumulative_sum == 0], columns=['Ah'])

# Append a 0 to the end of the new DataFrame if necessary
if df1.iloc[-1] != 0:
    df1 = df1.append(pd.DataFrame([0], columns=['Ah']))

This should be much faster than using a loop, since the cumsum method is implemented in C and can be much faster than a Python loop.

CodePudding user response：

There are a few ways you can increase the performance of the code you provided. One way to improve the performance is to avoid using the loc method to index into the dataframe. The loc method is relatively slow compared to other indexing methods, so using it in a loop can be inefficient. Instead, you can use the iloc method to index into the dataframe by position, which is much faster. Here is how you can modify your code to use the iloc method instead of the loc method:

lst = df.Ah;
lst1 = pd.DataFrame(index=df.index)

summ = 0;
for i, elem in enumerate(lst):
    if elem != 0:
        summ = summ   elem;
    else:
        if summ:
            lst1.iloc[i, 0] = summ;
        lst1.iloc[i, 0] = elem;
        summ = 0;

if summ:
    lst1.iloc[i 1, 0] = summ;

Another way to improve the performance of your code is to avoid using a for loop to iterate over the values in the Ah column. Pandas dataframes are optimized for vectorized operations, so using a for loop can be slow. Instead, you can use vectorized operations to perform the same calculations on the entire Ah column at once, which will be much faster. Here is how you can modify your code to use vectorized operations instead of a for loop:

lst1 = pd.DataFrame(index=df.index)

summ = df.Ah.cumsum()
zeros = df.Ah.eq(0)

if df.Ah.iloc[-1] != 0:
    summ = summ.append(pd.Series([df.Ah.iloc[-1]], index=[df.index[-1]   1]))

lst1['Ah'] = summ.where(zeros).ffill().shift(-1)