Home > Blockchain >  Remove Character Values from Specific Columns Using R
Remove Character Values from Specific Columns Using R

Time:01-05

I have the following dataframe, and I need to perform several functions on 4 specific columns. Here they are names as 'adjusted_cost','adjusted_cost_2','market_value','market_value_2'. However, the names keep changing with different dataframes, like 'cost', 'cost_2', 'value', 'value_2'.

structure(list(company = c("Investments", "Debt", 
"Cash", "", ""), interest_rate = c("", 
"", "7.46", "8.44", ""), adjusted_cost = c("", "", "$", 
"", ""), adjusted_cost_2 = c("", "", "3,586 (EUR 3,381)", "3,564 (GBP 2,955)", "7,150"
), market_value = c("", "", "$3,505 (EUR 3,578)", "3,450(GBP 3,090)", 
""), market_value_2 = c("", "", "$3,505 (EUR 3,578)", "3,450(GBP 3,090)", 
"6,955")), row.names = c(NA, 5L), class = "data.frame")
                                                                  ​               ​        ​

I would like to remove the values in parentheses, ie (EUR 3,381) and (GBP 2,955) and then use the parse_number method to keep only the numbers from these specific columns.

Desired Output

company  interest_rate   adjusted_cost  adjusted_cost_2 market_value  market_value_2
    <chr>   <chr>             <dbl>        <dbl>          <dbl>           <dbl>
1   Investments                 
2   Debt                    
3   Cash    7.46                            3586           3505          3505
4           8.44                            3564           3450          3450
5                                                          7150          6955

Any suggestions would be appreciated. Thanks!

CodePudding user response:

You could do this entirely in base R with a combination of lapply and gsub, or alternatively if you want to stick to readr::parse_number(), using lapply and readr::parse_number:

# columns to change
changecols <- c('adjusted_cost', 'adjusted_cost_2', 'market_value', 'market_value_2')

#' Alternative if your column names of interest change but 
#' ALWAYS have "cost" and "value" and no other columns have 
#' those words

changecols <- names(df)[grep("cost|value", names(df))]
# [1] "adjusted_cost"   "adjusted_cost_2" "market_value"    "market_value_2"

Base R

df2[changecols] <- lapply(df2[changecols], function(x)
    as.numeric(gsub("\\s*\\([^\\)] \\)|,|\\$", "", x))
  )

If you wanted the space instead of NA for reporting purposes, just remove the as.numeric() wrapper.

Using readr::parse_number()

df[changecols] <- lapply(df[changecols], readr::parse_number)

In both cases the output is identical:

#       company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
# 1 Investments                          NA              NA           NA             NA
# 2        Debt                          NA              NA           NA             NA
# 3        Cash          7.46            NA            3586         3505           3505
# 4                      8.44            NA            3564         3450           3450
# 5                                      NA            7150           NA           6955

If you want to replace NA with a blank for reporting purposes with the parse_number approach, simply:

df[is.na(df)] <- ""

#       company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
# 1 Investments                                                                        
# 2        Debt                                                                        
# 3        Cash          7.46                          3586         3505           3505
# 4                      8.44                          3564         3450           3450
# 5                                                    7150                        6955

CodePudding user response:

We can use parse_number after looping across the columns of interest

library(dplyr)
df1 %>% 
   mutate(across(adjusted_cost:market_value_2, ~ readr::parse_number(.x)))

-output

  company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
1 Investments                          NA              NA           NA             NA
2        Debt                          NA              NA           NA             NA
3        Cash          7.46            NA            3586         3505           3505
4                      8.44            NA            3564         3450           3450
5                                      NA            7150           NA           6955

Or using base R

df1[4:ncol(df1)] <- lapply(df1[4:ncol(df1)], \(x) 
    as.numeric(sub("\\s*\\(.*", "", gsub("[$,]", "", x))))

-output

> df1
      company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
1 Investments                                          NA           NA             NA
2        Debt                                          NA           NA             NA
3        Cash          7.46             $            3586           NA             NA
4                      8.44                          3564         3450           3450
5                                                    7150           NA           6955

NOTE: We just need to use only parse_number to get the output desired. In addition, blanks ("") are considered as character, so it is better to return NA for numeric columns

  • Related