Home > front end >  R: Tidyverse - Merging only the dublicate columns in a tibble
R: Tidyverse - Merging only the dublicate columns in a tibble

Time:07-06

Some columns of my tibble are torn into two columns. I would like to merge them back together. The dublicate columns have the same name and read_delim() adds "...2" and "...3" to have identical column names. There shouldn't be two numerical values in a dublicate column, but it would be nice, if the code could handle this exception (there the mean of both would be nice). It frequently occurs, that both dublicate columns contain NAs. Some columns occur only once (like Date&Time, PYRANO#1, ...). "Date&Time" is the only consistent column without NAs.

The data looks like this:

head(df)

A tibble: 6 × 10

  | ------------------- | ------------------- | ------------------ | -----------------
  | `Date&Time`         |`SNOWDEPTH#1#HS...2` |`SNOWDEPTH#1#HS...3`| `PYRANO#1#RSWR…`
  | <dttm>              |               <dbl> |           <dbl>    |        <dbl>
  |:-------------------:|:-------------------:|:------------------:|:----------------- 
1 | 1997-11-19 16:30:00 |                   0 |               NA   |            NA
2 | 1997-11-19 17:00:00 |                  NA |               10   |            NA
3 | 1997-11-19 17:30:00 |                   9 |               NA   |            NA
4 | 1997-11-19 18:00:00 |                  NA |               NA   |            NA
5 | 1997-11-19 18:30:00 |                   9 |               NA   |            NA
6 | 1997-11-19 19:00:00 |                   9 |               NA   |            NA

# with 6 more variables: `MODEL_SNOWPACK#1#SWE` <dbl>,
#   `THERMO_HYGRO#1#TA_30MIN_MEAN...6` <dbl>,
#   `THERMO_HYGRO#1#TA_30MIN_MEAN...7` <dbl>,
#   `IRTHERMO#1#TSS_30MIN_MEAN...8` <dbl>,
#   `IRTHERMO#1#TSS_30MIN_MEAN...9` <dbl>,
#   `SNOWTHERMO#1#TS0_30MIN_MEAN` <dbl>

I would like to use a for-loop to loop through many of these files, but unfortunately the dublicate columns aren't always the same. Ideally the code should find dublicate columns and merge them automatically.

What I have tried so far:

substr(colnames(df), 1, 7)

[1] "Date&Ti" "SNOWDEP" "SNOWDEP" "PYRANO#" "MODEL_S" "THERMO_" "THERMO_" [8] "IRTHERM" "IRTHERM" "SNOWTHE"

df %>% 
      group_by(., substr(colnames(.), 1, 7), na.rm=TRUE) %>% 
      summarise_all()

Error in group_by(): ! Problem adding computed columns. Caused by error in mutate(): ! Problem while computing ..1 = substr(colnames(.), 1, 7). ✖ ..1 must be size 407400 or 1, not 10. Run rlang::last_error() to see where the error occurred.

Thanks a lot for your help!

CodePudding user response:

A possible solution is to turn it into a long format, remove the "..." (etc.) and then transform it back to wide format with a function:

library(tidyverse)

df |>
  pivot_longer(-date) |>
  mutate(name = str_remove(name, "\\.\\.\\.\\d")) |>
  pivot_wider(values_fn = ~ mean(., na.rm = TRUE))

Output:

# A tibble: 6 × 2
   date `SNOWDEPTH#1#HS`
  <int>            <dbl>
1     1                0
2     2               10
3     3                9
4     4              NaN
5     5                9
6     6                9

And some data:

df <- tibble(date = 1:6,
             `SNOWDEPTH#1#HS...2` = c(0, NA, 9, NA, 9, 9),
             `SNOWDEPTH#1#HS...3` = c(NA, 10, NA, NA, NA, NA)) 
  • Related