I have
df.columns
Index(['location', 'date',
'deaths_2020_all_ages', 'average_deaths_2015_2019_all_ages',
'deaths_2015_all_ages', 'deaths_2016_all_ages', 'deaths_2017_all_ages',
'deaths_2018_all_ages', 'deaths_2019_all_ages', 'deaths_2010_all_ages',
'deaths_2011_all_ages', 'deaths_2012_all_ages', 'deaths_2013_all_ages',
'deaths_2014_all_ages', 'Week', 'deaths_2021_all_ages', 'w'],
dtype='object')
I want to rename the deaths columns such as 'deaths_2014_all_ages' to '2014'.
I have tried this
cols =[x for x in df.columns[df.columns.str.extract(r'([ab])?(\d)')]]
# KeyError: 2
and
cols =[x for x in df.columns[df.columns.str.replace('[a-z] _\d{4}_[a-z] _[a-z] ', '\d{4}', regex=True)]]
# KeyError: '\\d'
# error: bad escape \d at position 0
any advice? thank you.
CodePudding user response:
Assuming you only want to extract the dates from the columns starting in "death", you could use a regex (here: first number or full string if not starting with "death"):
df.columns = df.columns.str.extract('(\d |^(?!death).*$)', expand=False)
columns before:
Index(['location', 'date', 'deaths_2020_all_ages',
'average_deaths_2015_2019_all_ages', 'deaths_2015_all_ages',
'deaths_2016_all_ages', 'deaths_2017_all_ages', 'deaths_2018_all_ages',
'deaths_2019_all_ages', 'deaths_2010_all_ages', 'deaths_2011_all_ages',
'deaths_2012_all_ages', 'deaths_2013_all_ages', 'deaths_2014_all_ages',
'Week', 'deaths_2021_all_ages', 'w'],
dtype='object')
columns after:
Index(['location', 'date', '2020', 'average_deaths_2015_2019_all_ages', '2015',
'2016', '2017', '2018', '2019', '2010', '2011', '2012', '2013', '2014',
'Week', '2021', 'w'],
dtype='object')
CodePudding user response:
You can do it in the following way:
df.columns = df.columns.str.extract('(\d )', expand=False)
CodePudding user response:
If all columns that you want to rename have this exact format and you don't want to rename the columns containing averages you could do this (as an option that doesn't require regex):
cols = [ c.split("_")[1] if "deaths_" in c else c for c in df.columns ]
Which on a smaller examples would look like:
>>> df.columns = [ "deaths_123_all", "deaths_456_all", "w" ]
>>> [ c.split("_")[1] if "deaths_" in c else c for c in df.columns ]
[ "123", "456", "w" ]
CodePudding user response:
You can use
import re
df = df.rename(columns = lambda x: re.sub('^\D*(\d ).*', r'\1',x))
If deaths_
must appear before the digits, just add it:
df = df.rename(columns = lambda x: re.sub('^\D*deaths_(\d ).*', r'\1',x))
See the regex demo. The ^\D*(\d ).*
regex matches
^
- start of string\D*
- zero or more non-digit chars(\d )
- Group 1: one or more digits.*
- the rest of the line.
The \1
in the replacement is a backreference to Group 1 value.
Your regex approach can be fixed with
df = df.rename(columns = lambda x: re.sub('^.*?_(\d{4})_.*', r'\1',x))
# or
df = df.rename(columns = lambda x: re.sub('^.*?deaths_(\d{4})_.*', r'\1',x))
where
^.*?_
- matches the start of string, then any zero or more chars other than line break chars as few as possible and then a_
(\d{4})
- Group 1: four digits_
- an underscore.*
- the rest of the line.