I have the following dataframe, and I need to perform several functions on 4 specific columns. Here they are names as 'adjusted_cost','adjusted_cost_2','market_value','market_value_2'
. However, the names keep changing with different dataframes, like 'cost', 'cost_2', 'value', 'value_2'
.
structure(list(company = c("Investments", "Debt",
"Cash", "", ""), interest_rate = c("",
"", "7.46", "8.44", ""), adjusted_cost = c("", "", "$",
"", ""), adjusted_cost_2 = c("", "", "3,586 (EUR 3,381)", "3,564 (GBP 2,955)", "7,150"
), market_value = c("", "", "$3,505 (EUR 3,578)", "3,450(GBP 3,090)",
""), market_value_2 = c("", "", "$3,505 (EUR 3,578)", "3,450(GBP 3,090)",
"6,955")), row.names = c(NA, 5L), class = "data.frame")
I would like to remove the values in parentheses, ie (EUR 3,381) and (GBP 2,955)
and then use the parse_number
method to keep only the numbers from these specific columns.
Desired Output
company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Investments
2 Debt
3 Cash 7.46 3586 3505 3505
4 8.44 3564 3450 3450
5 7150 6955
Any suggestions would be appreciated. Thanks!
CodePudding user response:
You could do this entirely in base R with a combination of lapply
and gsub
, or alternatively if you want to stick to readr::parse_number()
, using lapply
and readr::parse_number
:
# columns to change
changecols <- c('adjusted_cost', 'adjusted_cost_2', 'market_value', 'market_value_2')
#' Alternative if your column names of interest change but
#' ALWAYS have "cost" and "value" and no other columns have
#' those words
changecols <- names(df)[grep("cost|value", names(df))]
# [1] "adjusted_cost" "adjusted_cost_2" "market_value" "market_value_2"
Base R
df2[changecols] <- lapply(df2[changecols], function(x)
as.numeric(gsub("\\s*\\([^\\)] \\)|,|\\$", "", x))
)
If you wanted the space instead of NA
for reporting purposes, just remove the as.numeric()
wrapper.
Using readr::parse_number()
df[changecols] <- lapply(df[changecols], readr::parse_number)
In both cases the output is identical:
# company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
# 1 Investments NA NA NA NA
# 2 Debt NA NA NA NA
# 3 Cash 7.46 NA 3586 3505 3505
# 4 8.44 NA 3564 3450 3450
# 5 NA 7150 NA 6955
If you want to replace NA
with a blank for reporting purposes with the parse_number
approach, simply:
df[is.na(df)] <- ""
# company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
# 1 Investments
# 2 Debt
# 3 Cash 7.46 3586 3505 3505
# 4 8.44 3564 3450 3450
# 5 7150 6955
CodePudding user response:
We can use parse_number
after looping across
the columns of interest
library(dplyr)
df1 %>%
mutate(across(adjusted_cost:market_value_2, ~ readr::parse_number(.x)))
-output
company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
1 Investments NA NA NA NA
2 Debt NA NA NA NA
3 Cash 7.46 NA 3586 3505 3505
4 8.44 NA 3564 3450 3450
5 NA 7150 NA 6955
Or using base R
df1[4:ncol(df1)] <- lapply(df1[4:ncol(df1)], \(x)
as.numeric(sub("\\s*\\(.*", "", gsub("[$,]", "", x))))
-output
> df1
company interest_rate adjusted_cost adjusted_cost_2 market_value market_value_2
1 Investments NA NA NA
2 Debt NA NA NA
3 Cash 7.46 $ 3586 NA NA
4 8.44 3564 3450 3450
5 7150 NA 6955
NOTE: We just need to use only parse_number
to get the output desired. In addition, blanks (""
) are considered as character, so it is better to return NA
for numeric columns