Home > database >  Combine csv files with redundant columns R
Combine csv files with redundant columns R

Time:10-06

I have a series of .csv files that look like this :

a.csv contains

id, a, b, c
1, 10, 0, 0
2, 3, 0 , 0
3, 20, 0, 0

b.csv contains

id, a, b, c
1, 0, 7, 0
2, 0, 9, 0
3, 0, 14, 0

c.csv contains

id, a, b, c
1, 0, 0, 12
2, 0, 0, 8
3, 0, 0, 22

I'm trying to figure out the most efficient way to read them in and create a dataframe that looks like this

id, a, b, c
1, 10, 7, 12
2, 3, 9, 8
3, 20, 14, 22

What would be the best way to do this if there are many more files with many more columns and rows? tidyverse is preferred.

CodePudding user response:

How about this. If all redundant columns have zeros, then you can go long, filter out the zeros, bind the rows, and then go wide.

library(tidyverse)

df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0")

df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0")

df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22")

list(df_a, df_b, df_c)|>
  map(\(d) pivot_longer(d, cols = -id) |>
        filter(value >0)) |>
  bind_rows() |>
  pivot_wider(names_from = name, values_from = value)
#> # A tibble: 3 x 4
#>      id     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    10     7    12
#> 2     2     3     9     8
#> 3     3    20    14    22

Or better yet, read in the data marking 0 as NA and then coalesce the data frames.


df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0", na = "0")

df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0", na = "0")

df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22", na = "0")


coalesce(df_a, df_b, df_c)
#> # A tibble: 3 x 4
#>      id     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    10     7    12
#> 2     2     3     9     8
#> 3     3    20    14    22

Or if you can read the data in with NA, you can define 0 as NA:

list(df_a, df_b, df_c) |>
  map(\(d) mutate(d, across(everything(), \(x) ifelse(x == 0, NA, x)))) |>
  reduce(coalesce)
#> # A tibble: 3 x 4
#>      id     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    10     7    12
#> 2     2     3     9     8
#> 3     3    20    14    22

CodePudding user response:

A base R solution, given the symmetry of the files.

Read the files

file_names <- list.files(pattern="^[abc]\\.csv")

lis <- sapply(file_names, function(x) list(read.csv(x, header=T)))
lis
$a.csv
  id  a b c
1  1 10 0 0
2  2  3 0 0
3  3 20 0 0

$b.csv
  id a  b c
1  1 0  7 0
2  2 0  9 0
3  3 0 14 0

$c.csv
  id a b  c
1  1 0 0 12
2  2 0 0  8
3  3 0 0 22

Combine the columns

column_names <- c("a","b","c")

cbind( lis[["a.csv"]]["id"], sapply(lis, function(x) rowSums(x[column_names])) )
  id a.csv b.csv c.csv
1  1    10     7    12
2  2     3     9     8
3  3    20    14    22
  • Related