此函数允许您编写一个数据集。通过写入更高效的二进制存储格式并指定相关的分区,您可以显著加快读取和查询的速度。
用法
write_dataset(
dataset,
path,
format = c("parquet", "feather", "arrow", "ipc", "csv", "tsv", "txt", "text"),
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
existing_data_behavior = c("overwrite", "error", "delete_matching"),
max_partitions = 1024L,
max_open_files = 900L,
max_rows_per_file = 0L,
min_rows_per_group = 0L,
max_rows_per_group = bitwShiftL(1, 20),
create_directory = TRUE,
preserve_order = FALSE,
...
)参数
- dataset
Dataset、RecordBatch、Table、
arrow_dplyr_query或data.frame。如果是arrow_dplyr_query,查询将被执行并将结果写入。这意味着如果需要,您可以在写入之前使用select()、filter()、mutate()等函数来转换数据。- path
path,URI,或引用目标目录的
SubTreeFileSystem(如果目录不存在,则会创建该目录)- format
文件格式的字符串标识符。默认为 "parquet"(参见 FileFormat)
- partitioning
Partitioning或用作分区键(作为路径段写入)的字符向量。默认为当前group_by()的列。- basename_template
待写入文件名的字符串模板。必须包含
"{i}",它将被自动递增的整数替换以生成数据文件的基本名称。例如,"part-{i}.arrow"将产生"part-0.arrow", ...。如果未指定,默认值为"part-{i}.<默认扩展名>"。- hive_style
逻辑值:将分区段写入为 Hive 风格(
key1=value1/key2=value2/file.ext)还是仅作为纯值。默认为TRUE。- existing_data_behavior
当目标目录中已有数据时使用的行为。必须是 "overwrite"(覆盖)、"error"(报错)或 "delete_matching"(删除匹配项)之一。
"overwrite"(默认值):新创建的文件将覆盖现有文件
"error":如果目标目录不为空,则操作将失败
"delete_matching":如果数据将被写入某些分区,写入器将删除这些现有的分区,而保留未写入数据的分区。
- max_partitions
任何批次可能被写入的最大分区数。默认为 1024L。
- max_open_files
写入操作期间允许保持打开状态的最大文件数。如果大于 0,则限制可保持打开状态的最大文件数。如果尝试打开过多的文件,最近最少使用的文件将被关闭。如果此设置过低,可能会导致数据碎片化为许多小文件。默认值为 900,这也允许扫描器在达到 Linux 默认限制 1024 之前保持一定数量的文件打开。
- max_rows_per_file
每个文件的最大行数。如果大于 0,则限制单个文件中放置的行数。默认值为 0L。
- min_rows_per_group
当积累到此行数时,将行组写入磁盘。默认值为 0L。
- max_rows_per_group
单个组中允许的最大行数;当超过此行数时,将进行拆分,下一组行将被写入下一个组。此值必须设置得大于
min_rows_per_group。默认值为 1024 * 1024。- create_directory
是否创建写入目标的目录。需要存储后端具有适当的权限。如果设置为 FALSE,则在传统的层次文件系统上写入时,假设目录已存在。默认值为 TRUE。
- preserve_order
保持行的顺序。
- ...
特定格式的附加参数。有关可用的 Parquet 选项,请参阅
write_parquet()。可用的 Feather 选项有:use_legacy_format逻辑值:写入的数据格式需确保 Arrow 库 0.14 及更低版本可以读取。默认为FALSE。您也可以通过设置环境变量ARROW_PRE_0_15_IPC_FORMAT=1来启用此功能。metadata_version:类似于 "V5" 的字符串或等效整数,表示 Arrow IPC MetadataVersion。默认值 (NULL) 将使用最新版本,除非设置了环境变量ARROW_PRE_1_0_METADATA_VERSION=1,此时将使用 V4。codec:一个 Codec,用于压缩已写入文件的正文缓冲区。默认值 (NULL) 将不会压缩正文缓冲区。null_fallback:在使用 Hive 风格分区时,用于代替缺失值(NA或NULL)的字符。请参阅hive_partition()。
示例
# You can write datasets partitioned by the values in a column (here: "cyl").
# This creates a structure of the form cyl=X/part-Z.parquet.
one_level_tree <- tempfile()
write_dataset(mtcars, one_level_tree, partitioning = "cyl")
list.files(one_level_tree, recursive = TRUE)
#> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"
# You can also partition by the values in multiple columns
# (here: "cyl" and "gear").
# This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
two_levels_tree <- tempfile()
write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
list.files(two_levels_tree, recursive = TRUE)
#> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet"
#> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet"
#> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet"
#> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet"
# In the two previous examples we would have:
# X = {4,6,8}, the number of cylinders.
# Y = {3,4,5}, the number of forward gears.
# Z = {0,1,2}, the number of saved parts, starting from 0.
# You can obtain the same result as as the previous examples using arrow with
# a dplyr pipeline. This will be the same as two_levels_tree above, but the
# output directory will be different.
library(dplyr)
two_levels_tree_2 <- tempfile()
mtcars |>
group_by(cyl, gear) |>
write_dataset(two_levels_tree_2)
list.files(two_levels_tree_2, recursive = TRUE)
#> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet"
#> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet"
#> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet"
#> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet"
# And you can also turn off the Hive-style directory naming where the column
# name is included with the values by using `hive_style = FALSE`.
# Write a structure X/Y/part-Z.parquet.
two_levels_tree_no_hive <- tempfile()
mtcars |>
group_by(cyl, gear) |>
write_dataset(two_levels_tree_no_hive, hive_style = FALSE)
list.files(two_levels_tree_no_hive, recursive = TRUE)
#> [1] "4/3/part-0.parquet" "4/4/part-0.parquet" "4/5/part-0.parquet"
#> [4] "6/3/part-0.parquet" "6/4/part-0.parquet" "6/5/part-0.parquet"
#> [7] "8/3/part-0.parquet" "8/5/part-0.parquet"