Unable to strip non-numeric values from column in Pandas dataframe-CodePudding

I’m working on cleaning and EDA of a time series dataset of revenues. For some of the entries, the values are prefaced with an ‘(R) ‘ meaning the value has been revised, and is shown like (R) 1000. Example:

df = pd.DataFrame({
    'year': ['2005', '2006', '2007'], 
    'revenue': [500, (R) 1000, 2200]})

Strangely, the data type for this column is still showing as float64 and works when compiling a lineplot. In the original Excel spreadsheet, when going to highlight a particular cell, the (R) disappears and only displays the numerical value.

I have developed a working code as follows:

df['revenue'] = df['revenue'].replace('(R) ','', regex=True)

This code does not return any errors, but it is unsuccessful in removing the (R) values from this column when looking at the dataframe. This (R) seems to work as some kind of placeholder, but I cannot figure out how to remove it, and it conflicts with my other data.

Basically, I just want to change values such as (R) 1000 to 1000

CodePudding user response：

Assuming:

df = pd.DataFrame({
    'year': ['2005', '2006', '2007'], 
    'revenue': [500, '(R) 1000', 2200]})

You can use:

df['revenue'] = (df['revenue'].str.extract('(\d )$', expand=False)
                 .fillna(df['revenue'])
                 .astype(int)
                 )

Output:

   year  revenue
0  2005      500
1  2006     1000
2  2007     2200

previous answer

Use pandas.to_numeric:

df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

To replace with a given value, combine with fillna:

df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce').fillna(1000)

CodePudding user response：

This should remove all letters and parenthesis from your strings

df['revenue'].replace('[A-Za-z)(]','',regex=True).astype(int)