I’m working on cleaning and EDA of a time series dataset of revenues. For some of the entries, the values are prefaced with an ‘(R) ‘ meaning the value has been revised, and is shown like (R) 1000. Example:
df = pd.DataFrame({
'year': ['2005', '2006', '2007'],
'revenue': [500, (R) 1000, 2200]})
Strangely, the data type for this column is still showing as float64 and works when compiling a lineplot. In the original Excel spreadsheet, when going to highlight a particular cell, the (R) disappears and only displays the numerical value.
I have developed a working code as follows:
df['revenue'] = df['revenue'].replace('(R) ','', regex=True)
This code does not return any errors, but it is unsuccessful in removing the (R) values from this column when looking at the dataframe. This (R) seems to work as some kind of placeholder, but I cannot figure out how to remove it, and it conflicts with my other data.
Basically, I just want to change values such as (R) 1000 to 1000
CodePudding user response:
Assuming:
df = pd.DataFrame({
'year': ['2005', '2006', '2007'],
'revenue': [500, '(R) 1000', 2200]})
You can use:
df['revenue'] = (df['revenue'].str.extract('(\d )$', expand=False)
.fillna(df['revenue'])
.astype(int)
)
Output:
year revenue
0 2005 500
1 2006 1000
2 2007 2200
previous answer
Use pandas.to_numeric
:
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
To replace with a given value, combine with fillna
:
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce').fillna(1000)
CodePudding user response:
This should remove all letters and parenthesis from your strings
df['revenue'].replace('[A-Za-z)(]','',regex=True).astype(int)