I have a series of .csv
files that look like this :
a.csv
contains
id, a, b, c
1, 10, 0, 0
2, 3, 0 , 0
3, 20, 0, 0
b.csv
contains
id, a, b, c
1, 0, 7, 0
2, 0, 9, 0
3, 0, 14, 0
c.csv
contains
id, a, b, c
1, 0, 0, 12
2, 0, 0, 8
3, 0, 0, 22
I'm trying to figure out the most efficient way to read them in and create a dataframe that looks like this
id, a, b, c
1, 10, 7, 12
2, 3, 9, 8
3, 20, 14, 22
What would be the best way to do this if there are many more files with many more columns and rows? tidyverse
is preferred.
CodePudding user response:
How about this. If all redundant columns have zeros, then you can go long, filter out the zeros, bind the rows, and then go wide.
library(tidyverse)
df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0")
df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0")
df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22")
list(df_a, df_b, df_c)|>
map(\(d) pivot_longer(d, cols = -id) |>
filter(value >0)) |>
bind_rows() |>
pivot_wider(names_from = name, values_from = value)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
Or better yet, read in the data marking 0 as NA and then coalesce the data frames.
df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0", na = "0")
df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0", na = "0")
df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22", na = "0")
coalesce(df_a, df_b, df_c)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
Or if you can read the data in with NA, you can define 0 as NA:
list(df_a, df_b, df_c) |>
map(\(d) mutate(d, across(everything(), \(x) ifelse(x == 0, NA, x)))) |>
reduce(coalesce)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
CodePudding user response:
A base R
solution, given the symmetry of the files.
Read the files
file_names <- list.files(pattern="^[abc]\\.csv")
lis <- sapply(file_names, function(x) list(read.csv(x, header=T)))
lis
$a.csv
id a b c
1 1 10 0 0
2 2 3 0 0
3 3 20 0 0
$b.csv
id a b c
1 1 0 7 0
2 2 0 9 0
3 3 0 14 0
$c.csv
id a b c
1 1 0 0 12
2 2 0 0 8
3 3 0 0 22
Combine the columns
column_names <- c("a","b","c")
cbind( lis[["a.csv"]]["id"], sapply(lis, function(x) rowSums(x[column_names])) )
id a.csv b.csv c.csv
1 1 10 7 12
2 2 3 9 8
3 3 20 14 22