I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019. But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract, year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas? many thanks:)
############################## here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})
CodePudding user response:
pandas
can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday']
as their indices, just subtract them, and remove rows (by dropna
) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] = 2019
Two additional points:
In this subtraction the
year
is also covered, so I need to add2019
back.I created a small example of
df_carbon
to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.