Home > database >  how to extract digits from column names in a pandas dataframe?
how to extract digits from column names in a pandas dataframe?

Time:03-03

I have

df.columns
Index(['location', 'date', 
       'deaths_2020_all_ages', 'average_deaths_2015_2019_all_ages',
       'deaths_2015_all_ages', 'deaths_2016_all_ages', 'deaths_2017_all_ages',
       'deaths_2018_all_ages', 'deaths_2019_all_ages', 'deaths_2010_all_ages',
       'deaths_2011_all_ages', 'deaths_2012_all_ages', 'deaths_2013_all_ages',
       'deaths_2014_all_ages', 'Week', 'deaths_2021_all_ages', 'w'],
      dtype='object')

I want to rename the deaths columns such as 'deaths_2014_all_ages' to '2014'.

I have tried this

cols =[x for x in df.columns[df.columns.str.extract(r'([ab])?(\d)')]]

# KeyError: 2

and

cols =[x for x in df.columns[df.columns.str.replace('[a-z] _\d{4}_[a-z] _[a-z] ', '\d{4}', regex=True)]]

# KeyError: '\\d'
# error: bad escape \d at position 0

any advice? thank you.

CodePudding user response:

Assuming you only want to extract the dates from the columns starting in "death", you could use a regex (here: first number or full string if not starting with "death"):

df.columns = df.columns.str.extract('(\d |^(?!death).*$)', expand=False)

columns before:

Index(['location', 'date', 'deaths_2020_all_ages',
       'average_deaths_2015_2019_all_ages', 'deaths_2015_all_ages',
       'deaths_2016_all_ages', 'deaths_2017_all_ages', 'deaths_2018_all_ages',
       'deaths_2019_all_ages', 'deaths_2010_all_ages', 'deaths_2011_all_ages',
       'deaths_2012_all_ages', 'deaths_2013_all_ages', 'deaths_2014_all_ages',
       'Week', 'deaths_2021_all_ages', 'w'],
      dtype='object')

columns after:

Index(['location', 'date', '2020', 'average_deaths_2015_2019_all_ages', '2015',
       '2016', '2017', '2018', '2019', '2010', '2011', '2012', '2013', '2014',
       'Week', '2021', 'w'],
      dtype='object')

CodePudding user response:

You can do it in the following way:

df.columns = df.columns.str.extract('(\d )', expand=False)

CodePudding user response:

If all columns that you want to rename have this exact format and you don't want to rename the columns containing averages you could do this (as an option that doesn't require regex):

cols = [ c.split("_")[1] if "deaths_" in c else c for c in df.columns ]

Which on a smaller examples would look like:

>>> df.columns = [ "deaths_123_all", "deaths_456_all", "w" ]
>>> [ c.split("_")[1] if "deaths_" in c else c for c in df.columns ]
[ "123", "456", "w" ]

CodePudding user response:

You can use

import re
df = df.rename(columns = lambda x: re.sub('^\D*(\d ).*', r'\1',x))

If deaths_ must appear before the digits, just add it:

df = df.rename(columns = lambda x: re.sub('^\D*deaths_(\d ).*', r'\1',x))

See the regex demo. The ^\D*(\d ).* regex matches

  • ^ - start of string
  • \D* - zero or more non-digit chars
  • (\d ) - Group 1: one or more digits
  • .* - the rest of the line.

The \1 in the replacement is a backreference to Group 1 value.

Your regex approach can be fixed with

df = df.rename(columns = lambda x: re.sub('^.*?_(\d{4})_.*', r'\1',x))
# or
df = df.rename(columns = lambda x: re.sub('^.*?deaths_(\d{4})_.*', r'\1',x))

where

  • ^.*?_ - matches the start of string, then any zero or more chars other than line break chars as few as possible and then a _
  • (\d{4}) - Group 1: four digits
  • _ - an underscore
  • .* - the rest of the line.
  • Related