QUESTION: Is there a function to simplify this whole process?
So I was trying to clean data. Data source: UN Energy Table (will automatically download xls file) And the data in question is the 'Country' column after this xls is turned into a dataframe.
The task was to remove the parentheses and numbers attached on the Country name. What I did was, find all country names containing numbers or parentheses, turning it into a list, finding the clean name and replacing them one by one through a loop.
# Finding all dirty country names
dirtyNames = df1[df1['Country'].str.contains('[A-Za-z ][0-9/(/)]')==True]
# Changing them to list
dirtyNames = dirtyNames['Country'].tolist()
for name in dirtyNames:
clean = re.split('[0-9/(/)]', name)[0]
df1.replace(name,clean, inplace=True)
but is there a function for this? I feel like there must be a function for it if I have to make a loop. I tried examples from the internet, fixing my dataset into these,
df['first_five_Letter']=df['Country (region)'].str.extract(r'(^w{5})')
and other similar methods, but I keep getting the AttributeError: Can only use .str accessor with string values! error.
CodePudding user response:
There is a very straightforward way of doing it by chaining regexes.
s=[Some list of Countries with numbers and parentheses]
for i,x in enumerate(s):
s[i]=re.sub("[0-9]", "", (re.sub("\)","",(re.sub("\(", "", s[i])))))
or
if 's' is a column in dataframe 'df',
for i,each in df['s'].iteritems():
df.loc[i,'s'] = re.sub("[0-9]", "", (re.sub("\)","",(re.sub("\(", "", df.loc[i,'s'])))))
CodePudding user response:
You can use
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['XXX(12)', 'YYYY5000', '(ZZZ)15', np.nan]})
df.loc[pd.isna(df['Country']), 'Country'] = ""
df['Country'] = df['Country'].astype(str).str.replace(r'[0-9()] ', '', regex=True)
df.loc[df['Country'] == '', 'Country'] = np.nan
Here,
df.loc[pd.isna(df['Country']), 'Country'] = ""
- converts allNaN
values to empty strings.astype(str)
- converts data to string type.str.replace(r'[0-9()] ', '', regex=True)
- removes all digits,(
and)
charsdf.loc[df['Country'] == '', 'Country'] = np.nan
- converts empty strings back toNaN
.
CodePudding user response:
Since you want to remove both numbers or text in parentheses (or possibly both, but with the numbers always at the end) at the end of the string with nothing, you can do that with one regex replacement:
df = pd.read_excel('Energy Indicators.xls', skiprows=17, usecols='C:F',names=['Country', 'Supply', 'per Capita', 'Renewable'], skipfooter=38)
df['Country'] = df['Country'].str.replace(r'(?:\s*\(.*\))?\d*$', '', regex=True)
Tested on the actual dataset, this gives for df['Country']
:
[
'Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia',
'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'British Virgin Islands',
'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde',
'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands', 'Central African Republic',
'Chad', 'Chile', 'China', 'China, Hong Kong Special Administrative Region',
'China, Macao Special Administrative Region', 'Colombia', 'Comoros', 'Congo',
'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao',
'Cyprus', 'Czech Republic', "Democratic People's Republic of Korea",
'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica',
'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea',
'Eritrea', 'Estonia', 'Ethiopia', 'Faeroe Islands', 'Falkland Islands',
'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'Gabon',
'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland',
'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau',
'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia',
'Iran', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica',
'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait',
'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon',
'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands',
'Martinique', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Mongolia',
'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua',
'Niger', 'Nigeria', 'Niue', 'Northern Mariana Islands', 'Norway', 'Oman',
'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Korea',
'Republic of Moldova', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda',
'Saint Helena', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Pierre and Miquelon',
'Saint Vincent and the Grenadines', 'Samoa', 'Sao Tome and Principe',
'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone',
'Singapore', 'Sint Maarten', 'Slovakia', 'Slovenia', 'Solomon Islands',
'Somalia', 'South Africa', 'South Sudan', 'Spain', 'Sri Lanka', 'State of Palestine',
'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic',
'Tajikistan', 'Thailand', 'The former Yugoslav Republic of Macedonia',
'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine',
'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland',
'United Republic of Tanzania', 'United States of America', 'United States Virgin Islands',
'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Viet Nam',
'Wallis and Futuna Islands', 'Yemen', 'Zambia', 'Zimbabwe'
]