Home > Mobile >  How to extract year and month from string in a dataframe
How to extract year and month from string in a dataframe

Time:10-11

1. Question

I have a dataframe, and the Year-Month column contains the year and month which I want to extract.

For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.

My current solution is to use apply and lambda function:

df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))

But it's super slow on a huge dataframe.

How to do it more efficiently?

2. Solutions

Thanks for your wisdom, I summarized each one's solution with the code:

df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})

df = df.join(
    df['Year-Month']
    .str.split('-', expand=True)
    .set_axis(['year','month'], axis='columns')
)
  • (2) convert the datatype from object (str) into datetime format @Neele22

import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
  • (3) use regex or datetime to extract year and month @mozway

df['Year-Month'].str.extract(r'(?P<year>\d )-(?P<month>\d )').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d )-(\d )').astype(int)

Or use datetime:

date = pd.to_datetime(df['Year-Month'])

df['year'] = date.dt.year
df['month'] = date.dt.month

3. Follow up question

But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.

For example, if I want to get the data which is no later than 2 months after the timestamp of each record.

import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]

This code will have type error for subtracting the converted Year-Month column with actual datetime column.

TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects

The types for these two columns are:

  • Year-Month is datetime64[ns]
  • timestamp is datetime64[ns, UTC]

Then, I tried to specify utc=True when changing Year-Month to datetime type:

df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")

But I got Value Error.

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

4. Take away

  • If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.

  • If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.

CodePudding user response:

df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})

df = df.join(
    df['Year-Month']
    .str.split('-', expand=True)
    .set_axis(['year','month'], axis='columns')
)

CodePudding user response:

You can use a regex for that.

Creating a new DataFrame:

df['Year-Month'].str.extract(r'(?P<year>\d )-(?P<month>\d )').astype(int)

If you want to assign the output to the same DataFrame while removing the original Year-Month:

df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d )-(\d )').astype(int)

Example input:

  Year-Month
0    2022-10

output:

   year month
0  2022    10
alternative using datetime:

You can also use a datetime intermediate

date = pd.to_datetime(df['Year-Month'])

df['year'] = date.dt.year
df['month'] = date.dt.month

output:

  Year-Month  year  month
0    2022-10  2022     10

CodePudding user response:

You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.

import pandas as pd

df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
  • Related