I am trying to combine multiple spreadsheets (about 20) using a functional programming approach. Each spreadsheet contains an individual year of data. They are messy, with columns not named or name of same column changing across the spreadsheets.
I originally did all the cleaning up individually for each spreadsheet but want to learn how to do it with a functional programming to make it more reproducible.
My approach was to build a regex to match all the different names of the specified column, then rename the column using a custom function/regex. I thought I could then use map_dfr
to apply this function to all the different spreadsheets to produce a final dataframe to work with.
However I have encountered 2 problems:
the regex engine in R seems to have the global parameter on and no way to switch it off. I want to match the the different possibilities in the regex expression in sequence and stop when it finds the first match, not all matches. For example, after I import the spreadsheets sometimes there are mulitple unamed columns which get given names
...1
etc. I only want to match the first instance. I cannot seem to work out if it possible to disable the global parameter, or a cleverer way of writing the regex to stop after the first match. Also is there another, perhaps better, way of approaching this?When I pass my custom function, which seems to work well enough on individual dataframes, I get an error from
map_df
which I am not quite sure why.
I have produced a minimal reprex below, which I think highlights the issues.
All thoughts greatly received, including alternative approaches to this, as this must be a very common problem people come across. Thanks.
library(tidyverse)
year_1 <- tribble(
~`...1`, ~admissions,
"Hospital 1", 10,
"Hospital 2", 100,
"hospital 3", 200
)
year_2 <- tribble(
~provider_code, ~`...2`, ~admissions,
"H1", "Hospital 1", 20,
"H2", "Hospital 2", 400,
"H3", "hospital 3", 500
)
year_3 <- tribble(
~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
"H1", "Hospital 1", 350,
"H2", "Hospital 2", 350,
"H3", "hospital 3", 550
)
clean_up_area_column_name <- function(x){
rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
}
clean_up_area_column_name(year_1)
#> # A tibble: 3 × 2
#> area admissions
#> <chr> <dbl>
#> 1 Hospital 1 10
#> 2 Hospital 2 100
#> 3 hospital 3 200
clean_up_area_column_name(year_2)
#> # A tibble: 3 × 3
#> provider_code area admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 20
#> 2 H2 Hospital 2 400
#> 3 H3 hospital 3 500
clean_up_area_column_name(year_3)
#> # A tibble: 3 × 3
#> area1 area2 admissions
#> <chr> <chr> <dbl>
#> 1 H1 Hospital 1 350
#> 2 H2 Hospital 2 350
#> 3 H3 hospital 3 550
test_df <- map_dfr(c(year_1, year_2, year_3), clean_up_area_column_name)
#> Error in UseMethod("rename"): no applicable method for 'rename' applied to an object of class "character"
Created on 2022-08-08 by the reprex package (v2.0.1)
CodePudding user response:
Passing multiple data.frames to map
requires a list
test_df <- map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 x 5
area admissions provider_code area1 area2
<chr> <dbl> <chr> <chr> <chr>
1 Hospital 1 10 NA NA NA
2 Hospital 2 100 NA NA NA
3 hospital 3 200 NA NA NA
4 Hospital 1 20 H1 NA NA
5 Hospital 2 400 H2 NA NA
6 hospital 3 500 H3 NA NA
7 NA 350 NA H1 Hospital 1
8 NA 350 NA H2 Hospital 2
9 NA 550 NA H3 hospital 3
CodePudding user response:
If you only want to grab the first instances, as you say, then the following tweak to your function should work. Rename any "area1" to "area", then de-select the remaining "area" columns names with trailing digits (area2, area3 etc).
clean_up_area_column_name <- function(x){
rename({{x}},
area = matches("\\.{3}[0-9]|commissioning region|hospital provider")) %>%
rename(., area = matches("area1")) %>%
select(-any_of(matches("area\\d")))
}
I'm not sure what you expect year_3 to return as it seems your regex is matching the provider_code
as well as area
:
map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)
# A tibble: 9 × 3
area admissions provider_code
<chr> <dbl> <chr>
1 Hospital 1 10 NA
2 Hospital 2 100 NA
3 hospital 3 200 NA
4 Hospital 1 20 H1
5 Hospital 2 400 H2
6 hospital 3 500 H3
7 H1 350 NA
8 H2 350 NA
9 H3 550 NA