how con I speed up this loop where is used iloc?-CodePudding

I have a dataframe with columns related to :Timestamp(sample every ten seconds), Current(Ampere) and Volatage of a battery in a train. I know that if the current is <0 the battery is supplying energy to train , >10 : charging mode , else keeping at 100% of charge. The cycle begins when the battery is discharged and starts suppling energy to train and it fineshes when charge gets back to 100%. I want to to know how much charge ( Current* Hour) the battery provides the train within each cycle.

I'm not allowed to share the data but what I did is pretty much this :

#create empty column for storing the energy 
df['Energy_T2']="";

#extracting data only when energy is being supplied or battery is in keeping mode:
df_fullp=df_full[df_full.HMI_IBatt_T2<10];
# where HMI_IBatt_T2 is the current column

Next step would be to loop in the current column and find where the cycle begins and store in df['Energy_T2'] the amount of energy suplied for each delta time and than sum the energy for each cycle.

data preparation:

c=df_fullp.columns.get_loc('HMI_IBatt_T2');
d=df_fullp.columns.get_loc('Energy_T2');
e=df_fullp.columns.get_loc('Timestamp');
Vehiclegp= df_fullp.groupby(['Vehicle']);

since samples refer to different vehicles with different batteries groupby helps me dividing the column in blocks related to each vehicle.

my attempt with the loop was :

df_fullp.iloc[0,df_fullp.columns.get_loc('Energy_T2')]=0; 

for v in df_fullp.Vehicle.unique():
  s1=(Vehiclegp.get_group(v));
  for i in range(2,df_fullp.shape[0]):
      if s1.iloc[i,c]>=0: 
        if s1.iloc[i 1,c]>=0: 
          df_fullp.iloc[i,d]=0; #we are in keeping mode so zero energy stored
        else: #we store energy as Timedelta*Current(A*h)
          df_fullp.iloc[i,d]= (s1.iloc[i,e]-s1.iloc[i-1,e])*s1.iloc[i,c]/3600;  
      else:
        df_fullp.iloc[i,d]= (s1.iloc[i,e]-s1.iloc[i-1,e])*s1.iloc[i,c]/3600;

Which doesn't converge in a reasonable amount of time and I don't even know if the loop makes sense.

I must say I started coding recently with python, so I don't know how to use much of the libraries available, so is there a way to speed the code considering that the amount of rows is 26900725?

CodePudding user response：

This is not an answer but rather a rough sketch. A correct version will require exploration and testing.

Your code will be easier to wrap your head around if you operate on Dataframes as a whole rather than its individual elements. (aka aggregate operations)

For example...

(s1.iloc[i,e]-s1.iloc[i-1,e]) can be written s1['Timestamp'].diff()

The conditional

      if s1.iloc[i,c]>=0: 
        if s1.iloc[i 1,c]>=0:

can be written s1['HMI_IBatt_T2'] >= 0 & s1['HMI_IBatt_T2'].shift(-1) >= 0

The iteration over vehicles can be done

for v, s1 in Vehiclegp:
   ...

Combined you will end up with something like the following...

df_fullp['Energy_T2'] = 0
for v, s1 in Vehiclegp:
   condition = s1['HMI_IBatt_T2'] >= 0 & s1['HMI_IBatt_T2'].shift(-1) >= 0
   delta_time = s1['Timestamp'].diff()
   s1['Energy_T2'] = s1['Energy_T2'].mask(condition, delta_time*s1['HMI_IBatt_T2']/3600

You will see an improvement in performance because under the hood aggregate operations are fast. Pandas has a bit of a learning curve but once you understand how it operates it is easy to use.

CodePudding user response：

I would first rewrite your code like this

# c=df_fullp.columns.get_loc('HMI_IBatt_T2');
# d=df_fullp.columns.get_loc('Energy_T2');
# e=df_fullp.columns.get_loc('Timestamp');

Vehiclegp= df_fullp.groupby(['Vehicle']);

# all Energy_T2 column equal 0
df_fullp.iloc[:,df_fullp.columns.get_loc('Energy_T2')]=0

#for v in df_fullp.Vehicle.unique():
# you already have groups
for _, vehicle in Vehiclegp:
    vehicle['HMI_IBatt_T2_shift'] = vehicle.HMI_IBatt_T2.shift(-1)
    vehicle['Timestamp_shift'] = vehicle.Timestamp.shift()
  #for i in range(2,df_fullp.shape[0]): # maybe you made a mistake 
                                        #and we should run through each group?
    for row in group.iloc[2:-1].itertuples(): 
        if row.HMI_IBatt_T2 >=0 
            if row.HMI_IBatt_T2_shift < 0:
        #if s1.iloc[i 1,c]>=0: 
        #  df_fullp.iloc[i,d]=0; #we are in keeping mode so zero energy stored
        #we store energy as Timedelta*Current(A*h)
                df_fullp.loc[row.Index,'Energy_T2']= (row.Timestamp - row.Timestamp_shift)*row.HMI_IBatt_T2/3600;  
        else:
            df_fullp.loc[row.Index,'Energy_T2']= (row.Timestamp - row.Timestamp_shift)*row.HMI_IBatt_T2/3600;

It can also be vectorized and do without a loop (if I understand correctly what you need to get in the end)

    def foo(vehicle):
        vehicle['HMI_IBatt_T2_shift'] = vehicle.HMI_IBatt_T2.shift(-1)
        vehicle['Timestamp_shift'] = vehicle.Timestamp.shift()
        vehicle = vehicle.dropna()
        mask = (vehicle.HMI_IBatt_T2 < 0)|(vehicle.HMI_IBatt_T2 >= 0 & vehicle.HMI_IBatt_T2_shift<0)
        vehicle['Energy_T2'] = (vehicle.Timestamp - vehicle.Timestamp_shift)*vehicle.HMI_IBatt_T2/3600
        vehicle.loc[~mask, 'Energy_T2'] = 0
        return vehicle
   
result = df_fullp.groupby(['Vehicle']).apply(foo)

Moving further in the direction of improving performance, we can use parallel-pandas to parallelize calculations across all the cores of your CPU

# pip install parallel-pandas
from parallel_pandas import ParallelPandas

ParallelPandas.initialize(n_cpu=8)

#p_apply is parallel analogue of apply method
result = df_fullp.groupby(['Vehicle']).p_apply(foo)