3 读取和写入数据 - 多个文件
3.1 概述
当使用 Apache Arrow 将文件读取到 R 中时,您可以读取
- 单个文件作为数据框或 Arrow 表加载到内存中
- 单个文件太大而无法加载到内存中,作为 Arrow 数据集
- 多个和分区文件作为 Arrow 数据集
本章包含与使用 Apache Arrow 读取和写入太大而无法放入内存的文件以及多个或分区文件作为 Arrow 数据集相关的食谱。在以下几种情况下,您可能希望将数据作为 Arrow 数据集读取
- 您的单个数据文件太大而无法加载到内存中
- 您的数据在多个文件中被分区
- 您希望从
dplyr
查询中获得更快的性能 - 您希望能够利用 Arrow 的计算函数
可以读取 Parquet、Feather(也称为 Arrow IPC)和 CSV 或其他文本分隔格式中的分区数据。如果您选择分区多文件格式,我们建议使用 Parquet 或 Feather(Arrow IPC),与 CSV 相比,它们在元数据和压缩方面的能力可以提高性能。
3.3 写入分区数据 - Parquet
您希望根据数据中的列将多个 Parquet 数据文件保存到磁盘中的分区中。
3.3.1 解决方案
write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))
如您所见,这将根据提供的分区变量 Month
创建文件夹。
list.files("airquality_partitioned")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"
3.3.2 讨论
数据根据 Month
列中的值写入不同的文件夹。默认行为是使用 Hive 风格(即“col_name=value”文件夹名称)分区。
# Take a look at the files in this directory
list.files("airquality_partitioned", recursive = TRUE)
## [1] "Month=5/part-0.parquet" "Month=6/part-0.parquet" "Month=7/part-0.parquet"
## [4] "Month=8/part-0.parquet" "Month=9/part-0.parquet"
您可以指定多个分区变量以添加额外的分区级别。
write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day"))
list.files("airquality_partitioned_deeper")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"
如果您查看这些文件夹中的一个,您会发现数据随后根据第二个分区变量 Day
进行分区。
# Take a look at the files in this directory
list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
## [1] "Day=1/part-0.parquet" "Day=10/part-0.parquet" "Day=11/part-0.parquet"
## [4] "Day=12/part-0.parquet" "Day=13/part-0.parquet" "Day=14/part-0.parquet"
## [7] "Day=15/part-0.parquet" "Day=16/part-0.parquet" "Day=17/part-0.parquet"
## [10] "Day=18/part-0.parquet" "Day=19/part-0.parquet" "Day=2/part-0.parquet"
## [13] "Day=20/part-0.parquet" "Day=21/part-0.parquet" "Day=22/part-0.parquet"
## [16] "Day=23/part-0.parquet" "Day=24/part-0.parquet" "Day=25/part-0.parquet"
## [19] "Day=26/part-0.parquet" "Day=27/part-0.parquet" "Day=28/part-0.parquet"
## [22] "Day=29/part-0.parquet" "Day=3/part-0.parquet" "Day=30/part-0.parquet"
## [25] "Day=31/part-0.parquet" "Day=4/part-0.parquet" "Day=5/part-0.parquet"
## [28] "Day=6/part-0.parquet" "Day=7/part-0.parquet" "Day=8/part-0.parquet"
## [31] "Day=9/part-0.parquet"
有两种不同的方法可以指定用于分区的变量 - 既可以通过上面的 partitioning
变量,也可以通过在您的数据上使用 dplyr::group_by()
- 分组变量将形成分区。
write_dataset(dataset = group_by(airquality, Month, Day),
path = "airquality_groupby")
# Take a look at the files in this directory
list.files("airquality_groupby", recursive = TRUE)
## [1] "Month=5/Day=1/part-0.parquet" "Month=5/Day=10/part-0.parquet"
## [3] "Month=5/Day=11/part-0.parquet" "Month=5/Day=12/part-0.parquet"
## [5] "Month=5/Day=13/part-0.parquet" "Month=5/Day=14/part-0.parquet"
## [7] "Month=5/Day=15/part-0.parquet" "Month=5/Day=16/part-0.parquet"
## [9] "Month=5/Day=17/part-0.parquet" "Month=5/Day=18/part-0.parquet"
## [11] "Month=5/Day=19/part-0.parquet" "Month=5/Day=2/part-0.parquet"
## [13] "Month=5/Day=20/part-0.parquet" "Month=5/Day=21/part-0.parquet"
## [15] "Month=5/Day=22/part-0.parquet" "Month=5/Day=23/part-0.parquet"
## [17] "Month=5/Day=24/part-0.parquet" "Month=5/Day=25/part-0.parquet"
## [19] "Month=5/Day=26/part-0.parquet" "Month=5/Day=27/part-0.parquet"
## [21] "Month=5/Day=28/part-0.parquet" "Month=5/Day=29/part-0.parquet"
## [23] "Month=5/Day=3/part-0.parquet" "Month=5/Day=30/part-0.parquet"
## [25] "Month=5/Day=31/part-0.parquet" "Month=5/Day=4/part-0.parquet"
## [27] "Month=5/Day=5/part-0.parquet" "Month=5/Day=6/part-0.parquet"
## [29] "Month=5/Day=7/part-0.parquet" "Month=5/Day=8/part-0.parquet"
## [31] "Month=5/Day=9/part-0.parquet" "Month=6/Day=1/part-0.parquet"
## [33] "Month=6/Day=10/part-0.parquet" "Month=6/Day=11/part-0.parquet"
## [35] "Month=6/Day=12/part-0.parquet" "Month=6/Day=13/part-0.parquet"
## [37] "Month=6/Day=14/part-0.parquet" "Month=6/Day=15/part-0.parquet"
## [39] "Month=6/Day=16/part-0.parquet" "Month=6/Day=17/part-0.parquet"
## [41] "Month=6/Day=18/part-0.parquet" "Month=6/Day=19/part-0.parquet"
## [43] "Month=6/Day=2/part-0.parquet" "Month=6/Day=20/part-0.parquet"
## [45] "Month=6/Day=21/part-0.parquet" "Month=6/Day=22/part-0.parquet"
## [47] "Month=6/Day=23/part-0.parquet" "Month=6/Day=24/part-0.parquet"
## [49] "Month=6/Day=25/part-0.parquet" "Month=6/Day=26/part-0.parquet"
## [51] "Month=6/Day=27/part-0.parquet" "Month=6/Day=28/part-0.parquet"
## [53] "Month=6/Day=29/part-0.parquet" "Month=6/Day=3/part-0.parquet"
## [55] "Month=6/Day=30/part-0.parquet" "Month=6/Day=4/part-0.parquet"
## [57] "Month=6/Day=5/part-0.parquet" "Month=6/Day=6/part-0.parquet"
## [59] "Month=6/Day=7/part-0.parquet" "Month=6/Day=8/part-0.parquet"
## [61] "Month=6/Day=9/part-0.parquet" "Month=7/Day=1/part-0.parquet"
## [63] "Month=7/Day=10/part-0.parquet" "Month=7/Day=11/part-0.parquet"
## [65] "Month=7/Day=12/part-0.parquet" "Month=7/Day=13/part-0.parquet"
## [67] "Month=7/Day=14/part-0.parquet" "Month=7/Day=15/part-0.parquet"
## [69] "Month=7/Day=16/part-0.parquet" "Month=7/Day=17/part-0.parquet"
## [71] "Month=7/Day=18/part-0.parquet" "Month=7/Day=19/part-0.parquet"
## [73] "Month=7/Day=2/part-0.parquet" "Month=7/Day=20/part-0.parquet"
## [75] "Month=7/Day=21/part-0.parquet" "Month=7/Day=22/part-0.parquet"
## [77] "Month=7/Day=23/part-0.parquet" "Month=7/Day=24/part-0.parquet"
## [79] "Month=7/Day=25/part-0.parquet" "Month=7/Day=26/part-0.parquet"
## [81] "Month=7/Day=27/part-0.parquet" "Month=7/Day=28/part-0.parquet"
## [83] "Month=7/Day=29/part-0.parquet" "Month=7/Day=3/part-0.parquet"
## [85] "Month=7/Day=30/part-0.parquet" "Month=7/Day=31/part-0.parquet"
## [87] "Month=7/Day=4/part-0.parquet" "Month=7/Day=5/part-0.parquet"
## [89] "Month=7/Day=6/part-0.parquet" "Month=7/Day=7/part-0.parquet"
## [91] "Month=7/Day=8/part-0.parquet" "Month=7/Day=9/part-0.parquet"
## [93] "Month=8/Day=1/part-0.parquet" "Month=8/Day=10/part-0.parquet"
## [95] "Month=8/Day=11/part-0.parquet" "Month=8/Day=12/part-0.parquet"
## [97] "Month=8/Day=13/part-0.parquet" "Month=8/Day=14/part-0.parquet"
## [99] "Month=8/Day=15/part-0.parquet" "Month=8/Day=16/part-0.parquet"
## [101] "Month=8/Day=17/part-0.parquet" "Month=8/Day=18/part-0.parquet"
## [103] "Month=8/Day=19/part-0.parquet" "Month=8/Day=2/part-0.parquet"
## [105] "Month=8/Day=20/part-0.parquet" "Month=8/Day=21/part-0.parquet"
## [107] "Month=8/Day=22/part-0.parquet" "Month=8/Day=23/part-0.parquet"
## [109] "Month=8/Day=24/part-0.parquet" "Month=8/Day=25/part-0.parquet"
## [111] "Month=8/Day=26/part-0.parquet" "Month=8/Day=27/part-0.parquet"
## [113] "Month=8/Day=28/part-0.parquet" "Month=8/Day=29/part-0.parquet"
## [115] "Month=8/Day=3/part-0.parquet" "Month=8/Day=30/part-0.parquet"
## [117] "Month=8/Day=31/part-0.parquet" "Month=8/Day=4/part-0.parquet"
## [119] "Month=8/Day=5/part-0.parquet" "Month=8/Day=6/part-0.parquet"
## [121] "Month=8/Day=7/part-0.parquet" "Month=8/Day=8/part-0.parquet"
## [123] "Month=8/Day=9/part-0.parquet" "Month=9/Day=1/part-0.parquet"
## [125] "Month=9/Day=10/part-0.parquet" "Month=9/Day=11/part-0.parquet"
## [127] "Month=9/Day=12/part-0.parquet" "Month=9/Day=13/part-0.parquet"
## [129] "Month=9/Day=14/part-0.parquet" "Month=9/Day=15/part-0.parquet"
## [131] "Month=9/Day=16/part-0.parquet" "Month=9/Day=17/part-0.parquet"
## [133] "Month=9/Day=18/part-0.parquet" "Month=9/Day=19/part-0.parquet"
## [135] "Month=9/Day=2/part-0.parquet" "Month=9/Day=20/part-0.parquet"
## [137] "Month=9/Day=21/part-0.parquet" "Month=9/Day=22/part-0.parquet"
## [139] "Month=9/Day=23/part-0.parquet" "Month=9/Day=24/part-0.parquet"
## [141] "Month=9/Day=25/part-0.parquet" "Month=9/Day=26/part-0.parquet"
## [143] "Month=9/Day=27/part-0.parquet" "Month=9/Day=28/part-0.parquet"
## [145] "Month=9/Day=29/part-0.parquet" "Month=9/Day=3/part-0.parquet"
## [147] "Month=9/Day=30/part-0.parquet" "Month=9/Day=4/part-0.parquet"
## [149] "Month=9/Day=5/part-0.parquet" "Month=9/Day=6/part-0.parquet"
## [151] "Month=9/Day=7/part-0.parquet" "Month=9/Day=8/part-0.parquet"
## [153] "Month=9/Day=9/part-0.parquet"
每个这些文件夹都包含 1 个或多个 Parquet 文件,其中包含相关数据的分区。
list.files("airquality_groupby/Month=5/Day=10")
## [1] "part-0.parquet"
请注意,当分区列中存在 NA
值时,这些值将写入 col_name=__HIVE_DEFAULT_PARTITION__
目录。
3.4 读取分区数据
您希望将分区数据文件作为 Arrow 数据集读取。
3.4.1 解决方案
# Read data from directory
<- open_dataset("airquality_partitioned_deeper")
air_data
# View data
air_data
## FileSystemDataset with 153 Parquet files
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
##
## See $metadata for additional Schema metadata
3.6 将 Feather/Arrow IPC 数据读入 Arrow 数据集
您希望将 Feather/Arrow IPC 数据读入 Arrow 数据集。
3.6.1 解决方案
# write Arrow file to use in this example
write_dataset(dataset = airquality,
path = "airquality_data_arrow",
format = "arrow")
# read into R
open_dataset("airquality_data_arrow", format = "arrow")
## FileSystemDataset with 1 Feather file
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
##
## See $metadata for additional Schema metadata
3.8 将 CSV 数据读入 Arrow 数据集
您希望将 CSV 数据读入 Arrow 数据集。
3.8.1 解决方案
# write CSV file to use in this example
write_dataset(dataset = airquality,
path = "airquality_data_csv",
format = "csv")
# read into R
open_dataset("airquality_data_csv", format = "csv")
## FileSystemDataset with 1 csv file
## Ozone: int64
## Solar.R: int64
## Wind: double
## Temp: int64
## Month: int64
## Day: int64
3.9 读入 CSV 数据集(无标题)
您希望读入一个包含无标题 CSV 文件的数据集。
3.9.1 解决方案
# write CSV file to use in this example
<- airquality[1:40, c("Month", "Day", "Temp")]
dataset_1 <- airquality[41:80, c("Month", "Day", "Temp")]
dataset_2
dir.create("airquality")
write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)
# read into R
open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
## FileSystemDataset with 2 csv files
## Month: int64
## Day: int64
## Temp: int64
3.9.2 讨论
如果您的数据集由无标题 CSV 文件组成,您必须提供每个列的名称。您可以通过多种方式做到这一点 - 通过 column_names
参数(如上所示)或通过模式。
open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
## FileSystemDataset with 2 csv files
## Month: int32
## Day: int32
## Temp: int32
使用模式的另一个优势是您还可以控制列的数据类型。如果您同时提供列名和模式,则 column_names
中的值必须与 schema
字段名匹配。
3.10 写入压缩分区数据
您希望保存分区文件,并使用指定的压缩算法进行压缩。
3.11 读取压缩数据
您希望读入已压缩的数据。
3.11.1 解决方案
# Create a temporary directory
<- tempfile()
td dir.create(td)
# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")
# Read in data
<- open_dataset(td) %>%
ds collect()
ds
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows