Python/increase code efficiency about multiple columns filter-CodePudding

I was wondering if someone could help me find a more efficiency way to run my code.

I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.

the year column have only 3 elements, 2019,2020,2021

What I have to do here is to substract every value in 2020 and 2021 from 2019. But its more complicated that I need to match the weekday columns.

For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract, year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)

And here is my code, it can run, but it tooks me hours:(

for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
    try:
        data_2 = df_carbon[(df_carbon['country'] == i[1]) 
                  & (df_carbon['sector'] == i[2]) 
                  & (df_carbon['year'] == i[0]) 
                  & (df_carbon['month'] == i[3]) 
                  & (df_carbon['week'] == i[4]) 
                  & (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
        data_1 = df_carbon[(df_carbon['country'] == i[1]) 
                  & (df_carbon['sector'] == i[2]) 
                  & (df_carbon['year'] == 2019) 
                  & (df_carbon['month'] == i[3]) 
                  & (df_carbon['week'] == i[4]) 
                  & (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
        co2.append(data_2-data_1)
        country.append(i[1])
        sector.append(i[2])
        year.append(i[0])
        month.append(i[3])
        week.append(i[4])
        weekday.append(i[5])
    except:
        pass

I changed the for loops to itertools, but it still not fast enough, any other ideas? many thanks:)

############################## here is the sample dataset


    country  co2  sector  date  week  weekday  year  month
    Brazil  108.767782  Power   2019-01-01  0   1   2019    1
    China   14251.044482    Power   2019-01-01  0   1   2019    1
    EU27 & UK   1886.493814  Power  2019-01-01  0   1   2019    1
    France  53.856398   Power   2019-01-01  0   1   2019    1
    Germany  378.323440  Power   2019-01-01  0   1   2019    1
    Japan   21.898788   IA  2021-11-30  48  1   2021    11
    Russia  19.773822   IA  2021-11-30  48  1   2021    11
    Spain   42.293944   IA  2021-11-30  48  1   2021    11
    UK  56.425121   IA  2021-11-30  48  1   2021    11
    US  166.425000  IA  2021-11-30  48  1   2021    11

or this

import pandas as pd
pd.DataFrame({
    'year': [2019, 2020, 2021],
    'co2': [1,2,3],
    'country': ['Brazil', 'Brazil', 'Brazil'],
    'sector': ['power', 'power', 'power'],
    'month': [1, 1, 1],
    'week': [0,0,0],
    'weekday': [0,0,0]
})

CodePudding user response：

pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.

df_carbon = pd.DataFrame({
    'year': [2019, 2020, 2021],
    'co2': [1,2,3],
    'country': ['ab', 'ab', 'bc']
})

index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']

df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)

ans = (df_rest - df_2019).reset_index().dropna()
ans['year']  = 2019

Two additional points:

In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.