3 读取和写入数据 - 多个文件

3.1 概述

当使用 Apache Arrow 将文件读取到 R 中时,您可以读取

  • 单个文件作为数据框或 Arrow 表加载到内存中
  • 单个文件太大而无法加载到内存中,作为 Arrow 数据集
  • 多个和分区文件作为 Arrow 数据集

本章包含与使用 Apache Arrow 读取和写入太大而无法放入内存的文件以及多个或分区文件作为 Arrow 数据集相关的食谱。在以下几种情况下,您可能希望将数据作为 Arrow 数据集读取

  • 您的单个数据文件太大而无法加载到内存中
  • 您的数据在多个文件中被分区
  • 您希望从 dplyr 查询中获得更快的性能
  • 您希望能够利用 Arrow 的计算函数

可以读取 Parquet、Feather(也称为 Arrow IPC)和 CSV 或其他文本分隔格式中的分区数据。如果您选择分区多文件格式,我们建议使用 Parquet 或 Feather(Arrow IPC),与 CSV 相比,它们在元数据和压缩方面的能力可以提高性能。

3.2 将数据写入磁盘 - Parquet

您希望将数据写入磁盘中的单个 Parquet 文件。

3.2.1 解决方案

write_dataset(dataset = airquality, path = "airquality_data")

3.2.2 讨论

open_dataset()write_dataset() 的默认格式是 Parquet。

3.3 写入分区数据 - Parquet

您希望根据数据中的列将多个 Parquet 数据文件保存到磁盘中的分区中。

3.3.1 解决方案

write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))

如您所见,这将根据提供的分区变量 Month 创建文件夹。

list.files("airquality_partitioned")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"

3.3.2 讨论

数据根据 Month 列中的值写入不同的文件夹。默认行为是使用 Hive 风格(即“col_name=value”文件夹名称)分区。

# Take a look at the files in this directory
list.files("airquality_partitioned", recursive = TRUE)
## [1] "Month=5/part-0.parquet" "Month=6/part-0.parquet" "Month=7/part-0.parquet"
## [4] "Month=8/part-0.parquet" "Month=9/part-0.parquet"

您可以指定多个分区变量以添加额外的分区级别。

write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day"))
list.files("airquality_partitioned_deeper")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"

如果您查看这些文件夹中的一个,您会发现数据随后根据第二个分区变量 Day 进行分区。

# Take a look at the files in this directory
list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
##  [1] "Day=1/part-0.parquet"  "Day=10/part-0.parquet" "Day=11/part-0.parquet"
##  [4] "Day=12/part-0.parquet" "Day=13/part-0.parquet" "Day=14/part-0.parquet"
##  [7] "Day=15/part-0.parquet" "Day=16/part-0.parquet" "Day=17/part-0.parquet"
## [10] "Day=18/part-0.parquet" "Day=19/part-0.parquet" "Day=2/part-0.parquet" 
## [13] "Day=20/part-0.parquet" "Day=21/part-0.parquet" "Day=22/part-0.parquet"
## [16] "Day=23/part-0.parquet" "Day=24/part-0.parquet" "Day=25/part-0.parquet"
## [19] "Day=26/part-0.parquet" "Day=27/part-0.parquet" "Day=28/part-0.parquet"
## [22] "Day=29/part-0.parquet" "Day=3/part-0.parquet"  "Day=30/part-0.parquet"
## [25] "Day=31/part-0.parquet" "Day=4/part-0.parquet"  "Day=5/part-0.parquet" 
## [28] "Day=6/part-0.parquet"  "Day=7/part-0.parquet"  "Day=8/part-0.parquet" 
## [31] "Day=9/part-0.parquet"

有两种不同的方法可以指定用于分区的变量 - 既可以通过上面的 partitioning 变量,也可以通过在您的数据上使用 dplyr::group_by() - 分组变量将形成分区。

write_dataset(dataset = group_by(airquality, Month, Day),
  path = "airquality_groupby")
# Take a look at the files in this directory
list.files("airquality_groupby", recursive = TRUE)
##   [1] "Month=5/Day=1/part-0.parquet"  "Month=5/Day=10/part-0.parquet"
##   [3] "Month=5/Day=11/part-0.parquet" "Month=5/Day=12/part-0.parquet"
##   [5] "Month=5/Day=13/part-0.parquet" "Month=5/Day=14/part-0.parquet"
##   [7] "Month=5/Day=15/part-0.parquet" "Month=5/Day=16/part-0.parquet"
##   [9] "Month=5/Day=17/part-0.parquet" "Month=5/Day=18/part-0.parquet"
##  [11] "Month=5/Day=19/part-0.parquet" "Month=5/Day=2/part-0.parquet" 
##  [13] "Month=5/Day=20/part-0.parquet" "Month=5/Day=21/part-0.parquet"
##  [15] "Month=5/Day=22/part-0.parquet" "Month=5/Day=23/part-0.parquet"
##  [17] "Month=5/Day=24/part-0.parquet" "Month=5/Day=25/part-0.parquet"
##  [19] "Month=5/Day=26/part-0.parquet" "Month=5/Day=27/part-0.parquet"
##  [21] "Month=5/Day=28/part-0.parquet" "Month=5/Day=29/part-0.parquet"
##  [23] "Month=5/Day=3/part-0.parquet"  "Month=5/Day=30/part-0.parquet"
##  [25] "Month=5/Day=31/part-0.parquet" "Month=5/Day=4/part-0.parquet" 
##  [27] "Month=5/Day=5/part-0.parquet"  "Month=5/Day=6/part-0.parquet" 
##  [29] "Month=5/Day=7/part-0.parquet"  "Month=5/Day=8/part-0.parquet" 
##  [31] "Month=5/Day=9/part-0.parquet"  "Month=6/Day=1/part-0.parquet" 
##  [33] "Month=6/Day=10/part-0.parquet" "Month=6/Day=11/part-0.parquet"
##  [35] "Month=6/Day=12/part-0.parquet" "Month=6/Day=13/part-0.parquet"
##  [37] "Month=6/Day=14/part-0.parquet" "Month=6/Day=15/part-0.parquet"
##  [39] "Month=6/Day=16/part-0.parquet" "Month=6/Day=17/part-0.parquet"
##  [41] "Month=6/Day=18/part-0.parquet" "Month=6/Day=19/part-0.parquet"
##  [43] "Month=6/Day=2/part-0.parquet"  "Month=6/Day=20/part-0.parquet"
##  [45] "Month=6/Day=21/part-0.parquet" "Month=6/Day=22/part-0.parquet"
##  [47] "Month=6/Day=23/part-0.parquet" "Month=6/Day=24/part-0.parquet"
##  [49] "Month=6/Day=25/part-0.parquet" "Month=6/Day=26/part-0.parquet"
##  [51] "Month=6/Day=27/part-0.parquet" "Month=6/Day=28/part-0.parquet"
##  [53] "Month=6/Day=29/part-0.parquet" "Month=6/Day=3/part-0.parquet" 
##  [55] "Month=6/Day=30/part-0.parquet" "Month=6/Day=4/part-0.parquet" 
##  [57] "Month=6/Day=5/part-0.parquet"  "Month=6/Day=6/part-0.parquet" 
##  [59] "Month=6/Day=7/part-0.parquet"  "Month=6/Day=8/part-0.parquet" 
##  [61] "Month=6/Day=9/part-0.parquet"  "Month=7/Day=1/part-0.parquet" 
##  [63] "Month=7/Day=10/part-0.parquet" "Month=7/Day=11/part-0.parquet"
##  [65] "Month=7/Day=12/part-0.parquet" "Month=7/Day=13/part-0.parquet"
##  [67] "Month=7/Day=14/part-0.parquet" "Month=7/Day=15/part-0.parquet"
##  [69] "Month=7/Day=16/part-0.parquet" "Month=7/Day=17/part-0.parquet"
##  [71] "Month=7/Day=18/part-0.parquet" "Month=7/Day=19/part-0.parquet"
##  [73] "Month=7/Day=2/part-0.parquet"  "Month=7/Day=20/part-0.parquet"
##  [75] "Month=7/Day=21/part-0.parquet" "Month=7/Day=22/part-0.parquet"
##  [77] "Month=7/Day=23/part-0.parquet" "Month=7/Day=24/part-0.parquet"
##  [79] "Month=7/Day=25/part-0.parquet" "Month=7/Day=26/part-0.parquet"
##  [81] "Month=7/Day=27/part-0.parquet" "Month=7/Day=28/part-0.parquet"
##  [83] "Month=7/Day=29/part-0.parquet" "Month=7/Day=3/part-0.parquet" 
##  [85] "Month=7/Day=30/part-0.parquet" "Month=7/Day=31/part-0.parquet"
##  [87] "Month=7/Day=4/part-0.parquet"  "Month=7/Day=5/part-0.parquet" 
##  [89] "Month=7/Day=6/part-0.parquet"  "Month=7/Day=7/part-0.parquet" 
##  [91] "Month=7/Day=8/part-0.parquet"  "Month=7/Day=9/part-0.parquet" 
##  [93] "Month=8/Day=1/part-0.parquet"  "Month=8/Day=10/part-0.parquet"
##  [95] "Month=8/Day=11/part-0.parquet" "Month=8/Day=12/part-0.parquet"
##  [97] "Month=8/Day=13/part-0.parquet" "Month=8/Day=14/part-0.parquet"
##  [99] "Month=8/Day=15/part-0.parquet" "Month=8/Day=16/part-0.parquet"
## [101] "Month=8/Day=17/part-0.parquet" "Month=8/Day=18/part-0.parquet"
## [103] "Month=8/Day=19/part-0.parquet" "Month=8/Day=2/part-0.parquet" 
## [105] "Month=8/Day=20/part-0.parquet" "Month=8/Day=21/part-0.parquet"
## [107] "Month=8/Day=22/part-0.parquet" "Month=8/Day=23/part-0.parquet"
## [109] "Month=8/Day=24/part-0.parquet" "Month=8/Day=25/part-0.parquet"
## [111] "Month=8/Day=26/part-0.parquet" "Month=8/Day=27/part-0.parquet"
## [113] "Month=8/Day=28/part-0.parquet" "Month=8/Day=29/part-0.parquet"
## [115] "Month=8/Day=3/part-0.parquet"  "Month=8/Day=30/part-0.parquet"
## [117] "Month=8/Day=31/part-0.parquet" "Month=8/Day=4/part-0.parquet" 
## [119] "Month=8/Day=5/part-0.parquet"  "Month=8/Day=6/part-0.parquet" 
## [121] "Month=8/Day=7/part-0.parquet"  "Month=8/Day=8/part-0.parquet" 
## [123] "Month=8/Day=9/part-0.parquet"  "Month=9/Day=1/part-0.parquet" 
## [125] "Month=9/Day=10/part-0.parquet" "Month=9/Day=11/part-0.parquet"
## [127] "Month=9/Day=12/part-0.parquet" "Month=9/Day=13/part-0.parquet"
## [129] "Month=9/Day=14/part-0.parquet" "Month=9/Day=15/part-0.parquet"
## [131] "Month=9/Day=16/part-0.parquet" "Month=9/Day=17/part-0.parquet"
## [133] "Month=9/Day=18/part-0.parquet" "Month=9/Day=19/part-0.parquet"
## [135] "Month=9/Day=2/part-0.parquet"  "Month=9/Day=20/part-0.parquet"
## [137] "Month=9/Day=21/part-0.parquet" "Month=9/Day=22/part-0.parquet"
## [139] "Month=9/Day=23/part-0.parquet" "Month=9/Day=24/part-0.parquet"
## [141] "Month=9/Day=25/part-0.parquet" "Month=9/Day=26/part-0.parquet"
## [143] "Month=9/Day=27/part-0.parquet" "Month=9/Day=28/part-0.parquet"
## [145] "Month=9/Day=29/part-0.parquet" "Month=9/Day=3/part-0.parquet" 
## [147] "Month=9/Day=30/part-0.parquet" "Month=9/Day=4/part-0.parquet" 
## [149] "Month=9/Day=5/part-0.parquet"  "Month=9/Day=6/part-0.parquet" 
## [151] "Month=9/Day=7/part-0.parquet"  "Month=9/Day=8/part-0.parquet" 
## [153] "Month=9/Day=9/part-0.parquet"

每个这些文件夹都包含 1 个或多个 Parquet 文件,其中包含相关数据的分区。

list.files("airquality_groupby/Month=5/Day=10")
## [1] "part-0.parquet"

请注意,当分区列中存在 NA 值时,这些值将写入 col_name=__HIVE_DEFAULT_PARTITION__ 目录。

3.4 读取分区数据

您希望将分区数据文件作为 Arrow 数据集读取。

3.4.1 解决方案

# Read data from directory
air_data <- open_dataset("airquality_partitioned_deeper")

# View data
air_data
## FileSystemDataset with 153 Parquet files
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
## 
## See $metadata for additional Schema metadata

3.4.2 讨论

分区允许您将数据拆分到多个文件和文件夹中,避免将所有数据存储在单个文件中的问题。当使用 Arrow 时,这可以提供进一步的优势,因为 Arrow 只会读取任何给定分析所需的必要分区文件。

3.5 将数据写入磁盘 - Feather/Arrow IPC 格式

您希望将数据写入磁盘中的单个 Feather/Arrow IPC 文件。

3.5.1 解决方案

write_dataset(dataset = airquality,
  path = "airquality_data_feather",
  format = "feather")

3.6 将 Feather/Arrow IPC 数据读入 Arrow 数据集

您希望将 Feather/Arrow IPC 数据读入 Arrow 数据集。

3.6.1 解决方案

# write Arrow file to use in this example
write_dataset(dataset = airquality,
  path = "airquality_data_arrow",
  format = "arrow")

# read into R
open_dataset("airquality_data_arrow", format = "arrow")
## FileSystemDataset with 1 Feather file
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
## 
## See $metadata for additional Schema metadata

3.7 将数据写入磁盘 - CSV 格式

您希望将数据写入磁盘中的单个 CSV 文件。

3.7.1 解决方案

write_dataset(dataset = airquality,
  path = "airquality_data_csv",
  format = "csv")

3.8 将 CSV 数据读入 Arrow 数据集

您希望将 CSV 数据读入 Arrow 数据集。

3.8.1 解决方案

# write CSV file to use in this example
write_dataset(dataset = airquality,
  path = "airquality_data_csv",
  format = "csv")

# read into R
open_dataset("airquality_data_csv", format = "csv")
## FileSystemDataset with 1 csv file
## Ozone: int64
## Solar.R: int64
## Wind: double
## Temp: int64
## Month: int64
## Day: int64

3.9 读入 CSV 数据集(无标题)

您希望读入一个包含无标题 CSV 文件的数据集。

3.9.1 解决方案

# write CSV file to use in this example
dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")]
dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")]

dir.create("airquality")
write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)

# read into R
open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
## FileSystemDataset with 2 csv files
## Month: int64
## Day: int64
## Temp: int64

3.9.2 讨论

如果您的数据集由无标题 CSV 文件组成,您必须提供每个列的名称。您可以通过多种方式做到这一点 - 通过 column_names 参数(如上所示)或通过模式。

open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
## FileSystemDataset with 2 csv files
## Month: int32
## Day: int32
## Temp: int32

使用模式的另一个优势是您还可以控制列的数据类型。如果您同时提供列名和模式,则 column_names 中的值必须与 schema 字段名匹配。

3.10 写入压缩分区数据

您希望保存分区文件,并使用指定的压缩算法进行压缩。

3.10.1 解决方案

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")
# View files in the directory
list.files(td, recursive = TRUE)
## [1] "part-0.parquet"

3.10.2 讨论

只要压缩算法与所选格式兼容,您就可以向 write_dataset() 提供 compression 参数。有关支持的压缩算法和默认设置的更多信息,请参见 ?write_dataset()

3.11 读取压缩数据

您希望读入已压缩的数据。

3.11.1 解决方案

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")

# Read in data
ds <- open_dataset(td) %>%
  collect()

ds
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

3.11.2 讨论

请注意,Arrow 会自动检测压缩,您无需在调用 open_dataset()read_*() 函数时提供它。