functional programming problems -- map

I am trying to combine multiple spreadsheets (about 20) using a functional programming approach. Each spreadsheet contains an individual year of data. They are messy, with columns not named or name of same column changing across the spreadsheets.

I originally did all the cleaning up individually for each spreadsheet but want to learn how to do it with a functional programming to make it more reproducible.

My approach was to build a regex to match all the different names of the specified column, then rename the column using a custom function/regex. I thought I could then use map_dfr to apply this function to all the different spreadsheets to produce a final dataframe to work with.

However I have encountered 2 problems:

the regex engine in R seems to have the global parameter on and no way to switch it off. I want to match the the different possibilities in the regex expression in sequence and stop when it finds the first match, not all matches. For example, after I import the spreadsheets sometimes there are mulitple unamed columns which get given names ...1 etc. I only want to match the first instance. I cannot seem to work out if it possible to disable the global parameter, or a cleverer way of writing the regex to stop after the first match. Also is there another, perhaps better, way of approaching this?
When I pass my custom function, which seems to work well enough on individual dataframes, I get an error from map_df which I am not quite sure why.

I have produced a minimal reprex below, which I think highlights the issues.

All thoughts greatly received, including alternative approaches to this, as this must be a very common problem people come across. Thanks.

library(tidyverse)

year_1 <- tribble(
  ~`...1`, ~admissions,
  "Hospital 1", 10,
  "Hospital 2", 100,
  "hospital 3", 200
)

year_2 <- tribble(
  ~provider_code, ~`...2`, ~admissions,
  "H1", "Hospital 1", 20,
  "H2", "Hospital 2", 400,
  "H3", "hospital 3", 500
)

year_3 <- tribble(
  ~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
  "H1", "Hospital 1", 350,
  "H2", "Hospital 2", 350,
  "H3", "hospital 3", 550
)


clean_up_area_column_name <- function(x){
  rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
  }

clean_up_area_column_name(year_1)
#> # A tibble: 3 × 2
#>   area       admissions
#>   <chr>           <dbl>
#> 1 Hospital 1         10
#> 2 Hospital 2        100
#> 3 hospital 3        200

clean_up_area_column_name(year_2)
#> # A tibble: 3 × 3
#>   provider_code area       admissions
#>   <chr>         <chr>           <dbl>
#> 1 H1            Hospital 1         20
#> 2 H2            Hospital 2        400
#> 3 H3            hospital 3        500

clean_up_area_column_name(year_3)
#> # A tibble: 3 × 3
#>   area1 area2      admissions
#>   <chr> <chr>           <dbl>
#> 1 H1    Hospital 1        350
#> 2 H2    Hospital 2        350
#> 3 H3    hospital 3        550

test_df <- map_dfr(c(year_1, year_2, year_3), clean_up_area_column_name)
#> Error in UseMethod("rename"): no applicable method for 'rename' applied to an object of class "character"

^{Created on 2022-08-08 by the reprex package (v2.0.1)}

CodePudding user response：

Passing multiple data.frames to map requires a list

test_df <- map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)

# A tibble: 9 x 5
  area       admissions provider_code area1 area2     
  <chr>           <dbl> <chr>         <chr> <chr>     
1 Hospital 1         10 NA            NA    NA        
2 Hospital 2        100 NA            NA    NA        
3 hospital 3        200 NA            NA    NA        
4 Hospital 1         20 H1            NA    NA        
5 Hospital 2        400 H2            NA    NA        
6 hospital 3        500 H3            NA    NA        
7 NA                350 NA            H1    Hospital 1
8 NA                350 NA            H2    Hospital 2
9 NA                550 NA            H3    hospital 3

CodePudding user response：

If you only want to grab the first instances, as you say, then the following tweak to your function should work. Rename any "area1" to "area", then de-select the remaining "area" columns names with trailing digits (area2, area3 etc).

clean_up_area_column_name <- function(x){
rename({{x}}, 
       area = matches("\\.{3}[0-9]|commissioning region|hospital provider")) %>% 
  rename(., area = matches("area1")) %>% 
   select(-any_of(matches("area\\d")))
}

I'm not sure what you expect year_3 to return as it seems your regex is matching the provider_code as well as area:

map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)

# A tibble: 9 × 3
  area       admissions provider_code
  <chr>           <dbl> <chr>        
1 Hospital 1         10 NA           
2 Hospital 2        100 NA           
3 hospital 3        200 NA           
4 Hospital 1         20 H1           
5 Hospital 2        400 H2           
6 hospital 3        500 H3           
7 H1                350 NA           
8 H2                350 NA           
9 H3                550 NA