Home > Software design >  How to extract component from a pandas datetime column and assign them
How to extract component from a pandas datetime column and assign them

Time:11-30

The following code for getting the week number and year works:

import pandas as pd

df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])

df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year

        date  weekNo  year
0 2021-12-05      48  2021
1 2021-12-12      49  2021
2 2021-12-19      50  2021
3 2021-12-26      51  2021
4 2022-01-02      52  2022
5 2022-01-09       1  2022
6 2022-01-16       2  2022
7 2022-01-23       3  2022
8 2022-01-30       4  2022
9 2022-02-06       5  2022

but,

df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)

Gives the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)

TypeError: %d format: a number is required, not Series

I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.

CodePudding user response:

  • Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
  • Providing there is a format code for the value to be extracted, dt.strftime can be used.
    • This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
    • '%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
    • '%Y': Year with century as a decimal number.
import pandas as pd

# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])

# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')

# display(df)
        date weekYear
0 2021-12-05  48/2021
1 2021-12-12  49/2021
2 2021-12-19  50/2021
3 2021-12-26  51/2021
4 2022-01-02  52/2022
5 2022-01-09  01/2022
6 2022-01-16  02/2022
7 2022-01-23  03/2022
8 2022-01-30  04/2022
9 2022-02-06  05/2022

Timing for 1M rows

df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])

%%timeit
df.date.dt.strftime('%V/%Y')

[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

You can just use:

df['weekYear'] = df['date'].dt.isocalendar().week.astype(str)   '/'    df['date'].dt.year.astype(str)

Or using pandas.Series.str.cat

df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')

Or using list comprehension

df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]

Timing for 1M rows

df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])

%%timeit
df['date'].dt.isocalendar().week.astype(str)   '/'    df['date'].dt.year.astype(str)

[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')

[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]

[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:

df["date"] = pd.to_datetime(df["date"]) 
df['weekYear'] =  df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))

Timing for 1M rows

df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])

%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))

[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

  • There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
  • Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
  • See this answer to modify the code to create a timing analysis plot with a varying number of rows.
  • See IPython: %timeit for the option descriptions.
import pandas as pd

# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])

# functions
def test1(d):
    return d.date.dt.strftime('%V/%Y')


def test2(d):
    return d['date'].dt.isocalendar().week.astype(str)   '/'    d['date'].dt.year.astype(str)


def test3(d):
    return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')


def test4(d):
    return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]


def test5(d):
    return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)

print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')

test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
  • Related