I would like to know if there is a good practice to update an arrow
dataset. Imagine I have data that I first write as follow:
suppressMessages(library(dplyr))
suppressMessages(library(arrow))
td <- tempdir()
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Create an arrow dataset with a partitioning based on cyl
:
write_dataset(mtcars, td, partitioning = "cyl")
We can see that there are 3 folders, one for each cyl
.
dir(td)
#> [1] "cyl=4" "cyl=6" "cyl=8"
Now, lets open the dataset and filter to only keep cyl == 6
and re-write it to the same folder:
open_dataset(td) |>
filter(cyl == 6) |>
write_dataset(td, partitioning = "cyl")
There are still 3 sub-folders:
dir(td)
#> [1] "cyl=4" "cyl=6" "cyl=8"
All the original data is still there because re-writing cyl == 6
did not remove cyl == 4
and cyl == 8
:
open_dataset(td) |>
distinct(cyl) |>
collect()
#> # A tibble: 3 × 1
#> cyl
#> <int>
#> 1 6
#> 2 4
#> 3 8
My question is how one would proceed to update an existing dataset?
Created on 2022-08-31 with reprex v2.0.2
CodePudding user response:
Depends on what you mean by "update". For one understanding of "update", that's what you did: you overwrote the cyl=6
values and didn't touch any others.
write_dataset()
has an existing_data_behavior
argument that governs this. Default is to overwrite individual files, but you can have it "error"
if partition dirs already exist, or "delete_matching"
would in this example delete cyl=6/*
and write new files for that partition.
See https://arrow.apache.org/docs/r/reference/write_dataset.html for details.