Home > Net >  Create a new column with date ranges
Create a new column with date ranges

Time:10-24

I am trying to create a new column in a data frame based on dates and date ranges. I am a beginner, and have tried several of the answers here but I cannot get them to work. Here is my original code:

    df_accident["bank_holidays_2010"] = df_accident["date"].map(
                 {'Easter': [d.strftime('%d/%m/%Y') for d in pd.date_range('02/04/2010', periods=4)],
                  'Mayday': [d.strftime('%d/%m/%Y') for d in pd.date_range('03/05/2010', periods=1)],
                  'Spring Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range('31/05/2010', periods=1)],
                  'Summer Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range('30/08/2010', periods=1)],
                  'Christmas and New Year': [d.strftime('%d/%m/%Y') for d in pd.date_range('25/12/2010', periods=9)]
                  }
                                             )

and repeated for other years. I get an error:

None of [Index(['Easter', 'Mayday', 'Spring Bank Holiday', 'Summer Bank Holiday',\n       'Christmas and New Year'],\n      dtype='object')] are in the [columns]"

I have tried:

hols = {'Easter': [d.strftime('%d/%m/%Y') for d in pd.date_range('02/04/2010', periods=4)],
                  'Mayday': [d.strftime('%d/%m/%Y') for d in pd.date_range('03/05/2010', periods=1)],
                  'Spring Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range('31/05/2010', periods=1)],
                  'Summer Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range('30/08/2010', periods=1)],
                  'Christmas and New Year': [d.strftime('%d/%m/%Y') for d in pd.date_range('25/12/2010', periods=9)]
                  }

and:

bank_holidays_2010 = {'Easter': ('02/04/2010', '03/04/2010', '05/04/2010', '06/04/2010'),
                  'Mayday': ('03/05/2010'),
                  'Spring Bank Holiday': ('31/05/2010'),
                  'Summer Bank Holiday': ('30/08/2010'),
                  'Christmas and New Year': ('25/12/2010', '26/12/2010', '27/12/2010', '28/12/2010', '29/12/2010', '30/12/2010', '31/12/2010', '01/01/2011', '02/01/2011')
                  }

Returns same error.

What I would like to achieve is values for the dates and date ranges:

df_accident['bank_holidays_2010'].value_counts()

Easter                  466921
Mayday                  301039
Spring Bank Holiday     132195
Christmas and New Year  92931

CodePudding user response:

You need convert value of your d to key to be mapped to date value of date column

d = {
    'Easter': [d.strftime('%d/%m/%Y') for d in pd.date_range(pd.to_datetime('02/04/2010', dayfirst=True), periods=4)],
    'Mayday': [d.strftime('%d/%m/%Y') for d in pd.date_range(pd.to_datetime('03/05/2010', dayfirst=True), periods=1)],
    'Spring Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range(pd.to_datetime('31/05/2010', dayfirst=True), periods=1)],
    'Summer Bank Holiday': [d.strftime('%d/%m/%Y') for d in pd.date_range(pd.to_datetime('30/08/2010', dayfirst=True), periods=1)],
    'Christmas and New Year': [d.strftime('%d/%m/%Y') for d in pd.date_range(pd.to_datetime('25/12/2010', dayfirst=True), periods=9)]
}

d = {v:k for k, lst in d.items() for v in lst}
print(d)

{'02/04/2010': 'Easter', '03/04/2010': 'Easter', '04/04/2010': 'Easter', '05/04/2010': 'Easter', '03/05/2010': 'Mayday', '31/05/2010': 'Spring Bank Holiday', '30/08/2010': 'Summer Bank Holiday', '25/12/2010': 'Christmas and New Year', '26/12/2010': 'Christmas and New Year', '27/12/2010': 'Christmas and New Year', '28/12/2010': 'Christmas and New Year', '29/12/2010': 'Christmas and New Year', '30/12/2010': 'Christmas and New Year', '31/12/2010': 'Christmas and New Year', '01/01/2011': 'Christmas and New Year', '02/01/2011': 'Christmas and New Year'}

CodePudding user response:

Your first problem is that you're using dates in dd/mm/yyyy format, which will not necessarily give you the correct results from pd.date_range - for example

pd.date_range('02/04/2010', periods=4)

gives

['2010-02-04', '2010-02-05', '2010-02-06', '2010-02-07']

which is not what you want...

You should always use dates in ISO format, you can always format them on output if required.

Your second issue is that the map is the wrong way around, it is supposed to be of the form { from : to }. You can generate a new dict which meets that requirement like this:

date_hols = { date : holiday for holiday, dates in hols.items() for date in dates }

This gives:

{'02/04/2010': 'Easter', '03/04/2010': 'Easter', '04/04/2010': 'Easter', '05/04/2010': 'Easter', '03/05/2010': 'Mayday', '31/05/2010': 'Spring Bank Holiday', '31/08/2010': 'Summer Bank Holiday', '25/12/2010': 'Christmas and New Year', '26/12/2010': 'Christmas and New Year', '27/12/2010': 'Christmas and New Year', '28/12/2010': 'Christmas and New Year', '29/12/2010': 'Christmas and New Year', '30/12/2010': 'Christmas and New Year', '31/12/2010': 'Christmas and New Year', '01/01/2011': 'Christmas and New Year', '02/01/2011': 'Christmas and New Year'}

You can then apply this map, filling entries which don't map with an empty string:

df_accident['bank_holidays_2010'] = df_accident['date'].map(date_hols).fillna('')

Now you can groupby to get counts (I've assumed we're summing a column called num):

df_accident.groupby('bank_holidays_2010')['num'].sum().reset_index()

Output for random num data:

       bank_holidays_2010   num
0                          3489
1  Christmas and New Year    51
2                  Easter    42
3                  Mayday    12
4     Spring Bank Holiday     0
5     Summer Bank Holiday    10
  • Related