I have a dataframe with columns related to :Timestamp(sample every ten seconds), Current(Ampere) and Volatage of a battery in a train. I know that if the current is <0 the battery is supplying energy to train , >10 : charging mode , else keeping at 100% of charge. The cycle begins when the battery is discharged and starts suppling energy to train and it fineshes when charge gets back to 100%. I want to to know how much charge ( Current* Hour) the battery provides the train within each cycle.
I'm not allowed to share the data but what I did is pretty much this :
#create empty column for storing the energy
df['Energy_T2']="";
#extracting data only when energy is being supplied or battery is in keeping mode:
df_fullp=df_full[df_full.HMI_IBatt_T2<10];
# where HMI_IBatt_T2 is the current column
Next step would be to loop in the current column and find where the cycle begins and store in df['Energy_T2'] the amount of energy suplied for each delta time and than sum the energy for each cycle.
data preparation:
c=df_fullp.columns.get_loc('HMI_IBatt_T2');
d=df_fullp.columns.get_loc('Energy_T2');
e=df_fullp.columns.get_loc('Timestamp');
Vehiclegp= df_fullp.groupby(['Vehicle']);
since samples refer to different vehicles with different batteries groupby helps me dividing the column in blocks related to each vehicle.
my attempt with the loop was :
df_fullp.iloc[0,df_fullp.columns.get_loc('Energy_T2')]=0;
for v in df_fullp.Vehicle.unique():
s1=(Vehiclegp.get_group(v));
for i in range(2,df_fullp.shape[0]):
if s1.iloc[i,c]>=0:
if s1.iloc[i 1,c]>=0:
df_fullp.iloc[i,d]=0; #we are in keeping mode so zero energy stored
else: #we store energy as Timedelta*Current(A*h)
df_fullp.iloc[i,d]= (s1.iloc[i,e]-s1.iloc[i-1,e])*s1.iloc[i,c]/3600;
else:
df_fullp.iloc[i,d]= (s1.iloc[i,e]-s1.iloc[i-1,e])*s1.iloc[i,c]/3600;
Which doesn't converge in a reasonable amount of time and I don't even know if the loop makes sense.
I must say I started coding recently with python, so I don't know how to use much of the libraries available, so is there a way to speed the code considering that the amount of rows is 26900725?
CodePudding user response:
This is not an answer but rather a rough sketch. A correct version will require exploration and testing.
Your code will be easier to wrap your head around if you operate on Dataframes as a whole rather than its individual elements. (aka aggregate operations)
For example...
(s1.iloc[i,e]-s1.iloc[i-1,e])
can be written s1['Timestamp'].diff()
The conditional
if s1.iloc[i,c]>=0:
if s1.iloc[i 1,c]>=0:
can be written s1['HMI_IBatt_T2'] >= 0 & s1['HMI_IBatt_T2'].shift(-1) >= 0
The iteration over vehicles can be done
for v, s1 in Vehiclegp:
...
Combined you will end up with something like the following...
df_fullp['Energy_T2'] = 0
for v, s1 in Vehiclegp:
condition = s1['HMI_IBatt_T2'] >= 0 & s1['HMI_IBatt_T2'].shift(-1) >= 0
delta_time = s1['Timestamp'].diff()
s1['Energy_T2'] = s1['Energy_T2'].mask(condition, delta_time*s1['HMI_IBatt_T2']/3600
You will see an improvement in performance because under the hood aggregate operations are fast. Pandas has a bit of a learning curve but once you understand how it operates it is easy to use.
CodePudding user response:
I would first rewrite your code like this
# c=df_fullp.columns.get_loc('HMI_IBatt_T2');
# d=df_fullp.columns.get_loc('Energy_T2');
# e=df_fullp.columns.get_loc('Timestamp');
Vehiclegp= df_fullp.groupby(['Vehicle']);
# all Energy_T2 column equal 0
df_fullp.iloc[:,df_fullp.columns.get_loc('Energy_T2')]=0
#for v in df_fullp.Vehicle.unique():
# you already have groups
for _, vehicle in Vehiclegp:
vehicle['HMI_IBatt_T2_shift'] = vehicle.HMI_IBatt_T2.shift(-1)
vehicle['Timestamp_shift'] = vehicle.Timestamp.shift()
#for i in range(2,df_fullp.shape[0]): # maybe you made a mistake
#and we should run through each group?
for row in group.iloc[2:-1].itertuples():
if row.HMI_IBatt_T2 >=0
if row.HMI_IBatt_T2_shift < 0:
#if s1.iloc[i 1,c]>=0:
# df_fullp.iloc[i,d]=0; #we are in keeping mode so zero energy stored
#we store energy as Timedelta*Current(A*h)
df_fullp.loc[row.Index,'Energy_T2']= (row.Timestamp - row.Timestamp_shift)*row.HMI_IBatt_T2/3600;
else:
df_fullp.loc[row.Index,'Energy_T2']= (row.Timestamp - row.Timestamp_shift)*row.HMI_IBatt_T2/3600;
It can also be vectorized and do without a loop (if I understand correctly what you need to get in the end)
def foo(vehicle):
vehicle['HMI_IBatt_T2_shift'] = vehicle.HMI_IBatt_T2.shift(-1)
vehicle['Timestamp_shift'] = vehicle.Timestamp.shift()
vehicle = vehicle.dropna()
mask = (vehicle.HMI_IBatt_T2 < 0)|(vehicle.HMI_IBatt_T2 >= 0 & vehicle.HMI_IBatt_T2_shift<0)
vehicle['Energy_T2'] = (vehicle.Timestamp - vehicle.Timestamp_shift)*vehicle.HMI_IBatt_T2/3600
vehicle.loc[~mask, 'Energy_T2'] = 0
return vehicle
result = df_fullp.groupby(['Vehicle']).apply(foo)
Moving further in the direction of improving performance, we can use parallel-pandas to parallelize calculations across all the cores of your CPU
# pip install parallel-pandas
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
#p_apply is parallel analogue of apply method
result = df_fullp.groupby(['Vehicle']).p_apply(foo)