I'm working with a large .txt
file and I'm trying to use arrow
to process it. I opened the dataset, inspected it and it looks ok.
Next, I tried to partition it based on a column called COUNTY and convert it to parquet format. However, I'm getting this error:
pq_path <- "ebird/files"
data |>
group_by(COUNTY) |>
write_dataset(path = pq_path, format = "parquet")
Error in env_bind0(env, data) : attempt to use zero-length variable name
What am I doing wrong?
Full code:
data <- open_dataset(
sources = "ebd_PT_prv_relDec-2022/ebd_PT_prv_relDec-2022.txt",
format = "tsv"
)
data |> glimpse()
FileSystemDataset with 1 csv file
8,896,216 rows x 50 columns
$ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabOfOrnithology:EB…
$ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021-07-18 11:50:07…
$ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 2033, 23151, 7125…
$ CATEGORY <string> "species", "species", "species", "species", "species", "species", "species", "spec…
$ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibase-E074D706", "a…
$ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged Gull", "Gargane…
$ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirundo", "Larus mic…
$ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Cory's Sh…
$ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Calonectr…
$ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X…
$ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portugal"…
$ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT"…
$ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores", "Região Autó…
$ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-11", "PT-11", "…
$ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribeira Grande", "C…
$ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-AB", "PT-20-RG"…
$ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michaels, Azores", "S…
$ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L2168609", "L29264…
$ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H", "H", "P", "P…
$ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.77506, 38.70212, 38…
$ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752, -25.459442, -…
$ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04-09, 1957-11-26…
$ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, NA, NA, …
$ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr3529974", "obsr35299…
$ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", "S125610727", "…
$ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental", "Historical"…
$ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62", "P20", "P20"…
$ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIRD", "EBIRD", "E…
$ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3,…
$ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marshall J. Iliff on…
$ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra-se taxidermada…
$ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
>
pq_path <- "ebird/files"
data |>
group_by(COUNTY) |>
write_dataset(path = pq_path, format = "parquet")
Error in env_bind0(env, data) : attempt to use zero-length variable name
CodePudding user response:
You don't need to use group_by
: if you want to partition the data, use the partitioning=
(and optionally hive_style=
) arguments:
pq_path <- "mtpq"
arrow::write_dataset(mtcars, pq_path, partitioning="cyl", format="parquet")
list.files("mtpq", recursive = TRUE, full.names = TRUE)
# [1] "mtpq/cyl=4/part-0.parquet" "mtpq/cyl=6/part-0.parquet" "mtpq/cyl=8/part-0.parquet"
If you want the subdirs to be just "4"
, etc, then add hive_style=FALSE
.
I think for your data, you can simply use:
write_dataset(data, pq_path, partitioning = "COUNTY", format = "parquet")