Python - Filtering conditions efficiently-CodePudding

I have this data:

df = pd.DataFrame({'date': {0: '2012-03-02',  1: '2012-04-01',  2: '2012-04-12',  3: '2012-04-14',  4: '2012-04-21',  5: '2012-05-12',  6: '2012-06-23',  7: '2012-06-25',  8: '2012-06-26',  9: '2012-06-27'},
                   'type': {0: 'Holiday',  1: 'Other',  2: 'Event',  3: 'Holiday',  4: 'Event',  5: 'Holiday',  6: 'Other',  7: 'Holiday',  8: 'Holiday',  9: 'Event'},
                   'locale': {0: 'Local',  1: 'Regional',  2: 'National',  3: 'Local',  4: 'National',  5: 'National',  6: 'National',  7: 'Regional',  8: 'Regional',  9: 'Local'}})

And this function:

def preprocess(data_input):
    data = data_input.copy()
    
    conditions = [
        (data['type'] ==  'Holiday') & (data.locale == 'National'),
        (data['type'] ==  'Holiday') & (data.locale == 'Regional'),
        (data['type'] ==  'Holiday') & (data.locale == 'Local')
        ]
    values = [3,2,1]

    data.loc[:,'Holiday'] = np.select(conditions, values, default=0)
    
    
    conditions = [
        (data['type'] ==  'Event') & (data.locale == 'National'),
        (data['type'] ==  'Event') & (data.locale == 'Regional'),
        (data['type'] ==  'Event') & (data.locale == 'Local')
        ]
    values = [3,2,1]

    data.loc[:,'Events'] = np.select(conditions, values, default=0)

    return(data)

When I run this function over df I will get:

preprocess(df)

Which is the expected result. However, I'm wondering about efficiency and best practices. I feel that both conditions use too many lines of code and that they could be written more elegant and efficiently, but I can't figure out how. I tried with np.where() but I struggled when I found out I would have to filter the data frames in each argument.

Any suggestions?

CodePudding user response：

Overall I think your code looks fine, it's readable and easily understandable. In terms of efficiency, you're using boolean masks for your conditions and not committing any sort of pandas performance anti-pattern that I can see.

One major-ish pandas issue is that you have a redundant call to copy(). This isn't cheap, it results in Python creating a deep copy of your whole dataframe. It's also not necessary, you can just return the modified df rather than making a new copy the whole time and returning it. This is pretty standard style in pandas, it's why pandas operations usually look like df = df.do_a_thing(). The new value of df is returned via do_a_thing() rather than modified in place. (The in_place flag exists for a lot of methods, but it isn't the default for a reason).

One nitpick I'd say is to be explicit when using categorical variables. Your values variable is a categorical variable that's derived conditionally from the dataframe, but you are treating it as a series of integers.

In terms of general software engineering, there is one clear area of improvement that I can see. Your preprocess method is duplicating code for a singular purpose. I have refactored it such that the preprocess method takes the type_column (e.g. 'Holiday' and 'Event') as a parameter, and applies the transformation after that. Now if you have an issue with your preprocess logic, or want to make a change, you'll only need to do it in one area rather than two.

df = pd.DataFrame({'date': {0: '2012-03-02',  1: '2012-04-01',  2: '2012-04-12',  3: '2012-04-14',  4: '2012-04-21',  5: '2012-05-12',  6: '2012-06-23',  7: '2012-06-25',  8: '2012-06-26',  9: '2012-06-27'},
                   'type': {0: 'Holiday',  1: 'Other',  2: 'Event',  3: 'Holiday',  4: 'Event',  5: 'Holiday',  6: 'Other',  7: 'Holiday',  8: 'Holiday',  9: 'Event'},
                   'locale': {0: 'Local',  1: 'Regional',  2: 'National',  3: 'Local',  4: 'National',  5: 'National',  6: 'National',  7: 'Regional',  8: 'Regional',  9: 'Local'}})


def preprocess(data, type_column):
    conditions = [
        (data['type'] == type_column) & (data.locale == 'National'),
        (data['type'] == type_column) & (data.locale == 'Regional'),
        (data['type'] == type_column) & (data.locale == 'Local')
    ]
    # These values are categorical variables, (i.e. it's dependent on the combination of two columns).
    # Better to be explicit about what type they are rather than leaving them as ints.
    values = pd.Categorical([3, 2, 1])

    data.loc[:, type_column] = np.select(conditions, values, default=0)

    return data


df = preprocess(df, 'Holiday')
df = preprocess(df, 'Event')
print(df)

CodePudding user response：

You can try Series.map with Series.where

locale_map = {'National': '3',
              'Regional': '2',
              'Local': '1'}

df['Holiday'] = df['locale'].map(locale_map).where(df['type'].eq('Holiday'), 0)
df['Events'] = df['locale'].map(locale_map).where(df['type'].eq('Event'), 0)

print(df)

         date     type    locale Holiday Events
0  2012-03-02  Holiday     Local       1      0
1  2012-04-01    Other  Regional       0      0
2  2012-04-12    Event  National       0      3
3  2012-04-14  Holiday     Local       1      0
4  2012-04-21    Event  National       0      3
5  2012-05-12  Holiday  National       3      0
6  2012-06-23    Other  National       0      0
7  2012-06-25  Holiday  Regional       2      0
8  2012-06-26  Holiday  Regional       2      0
9  2012-06-27    Event     Local       0      1