I need to subset an arrow dataset according to some columns.
I am using the write_dataset
function in the R arrow package with the partitioning
set on the columns I want to use to subset the dataset.
If I set the min_rows_per_group, let's say to 10, do all the groups that have less than ten raws get discarded?
(Filtering the groups with less than ten entries is something I'd like to do).
Thanks!
CodePudding user response:
write_dataset
should not drop rows regardless of configuration.
Once the end of the input is reached then any queued writes (due to min_rows_per_group
or min_rows_per_file
) which were waiting for more data will be launched with whatever they have.
If your goal is to filter groups with less than 10 entries then you can probably accomplish this with dplyr. Something like this should give reasonable performance:
> arrow::write_parquet(mtcars, '/tmp/mtcars.parquet')
> ds = arrow::open_dataset('/tmp/mtcars.parquet')
> counts = ds %>% count(cyl) %>% filter(n > 10)
> ds %>% inner_join(counts, by="cyl") %>% collect()
mpg cyl disp hp drat wt qsec vs am gear carb n
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 11
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 14
3 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14
4 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 11
5 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 11
6 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14
7 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14
8 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
9 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 14
10 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 14
11 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14
12 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 11
13 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 11
14 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 11
15 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 11
16 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 14
17 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 14
18 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 14
19 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 14
20 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 11
21 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 11
22 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 11
23 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 14
25 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11