包索引 • Arrow R 包

读取数据集

将多文件数据集作为 Arrow Dataset 对象打开。

open_dataset(): 打开一个多文件数据集

open_delim_dataset() open_csv_dataset() open_tsv_dataset(): 打开一个 CSV 或其他分隔符分隔格式的多文件数据集

csv_read_options(): CSV 读取选项

csv_parse_options(): CSV 解析选项

csv_convert_options(): CSV 转换选项

写入数据集

将多文件数据集写入磁盘。

write_dataset(): 写入一个数据集

write_delim_dataset() write_csv_dataset() write_tsv_dataset(): 将数据集写入分区平面文件。

csv_write_options(): CSV 写入选项

读取文件

以各种格式读取文件作为 tibbles 或 Arrow Tables。

read_delim_arrow() read_csv_arrow() read_csv2_arrow() read_tsv_arrow(): 使用 Arrow 读取 CSV 或其他分隔文件

read_parquet(): 读取 Parquet 文件

read_feather() read_ipc_file(): 读取 Feather 文件 (一个 Arrow IPC 文件)

read_ipc_stream(): 读取 Arrow IPC 流格式

read_json_arrow(): 读取 JSON 文件

写入文件

以各种格式写入文件。

write_csv_arrow(): 将 CSV 文件写入磁盘

write_parquet(): 将 Parquet 文件写入磁盘

write_feather() write_ipc_file(): 写入 Feather 文件 (一个 Arrow IPC 文件)

write_ipc_stream(): 写入 Arrow IPC 流格式

write_to_raw(): 将 Arrow 数据写入原始向量

创建 Arrow 数据容器

用于创建 Arrow 数据容器的类和函数。

scalar(): 创建一个 Arrow Scalar

arrow_array(): 创建一个 Arrow Array

chunked_array(): 创建一个 Chunked Array

record_batch(): 创建一个 RecordBatch

arrow_table(): 创建一个 Arrow Table

buffer(): 创建一个 Buffer

vctrs_extension_array() vctrs_extension_type(): 用于通用类型向量的扩展类型

使用 Arrow 数据容器

用于将 R 对象转换为 Arrow 数据容器并组合 Arrow 数据容器的函数。

as_arrow_array(): 将对象转换为 Arrow Array

as_chunked_array(): 将对象转换为 Arrow ChunkedArray

as_record_batch(): 将对象转换为 Arrow RecordBatch

as_arrow_table(): 将对象转换为 Arrow Table

concat_arrays() c(<Array>): 连接零个或多个 Arrays

concat_tables(): 连接一个或多个 Tables

Arrow 数据类型

int8() int16() int32() int64() uint8() uint16() uint32() uint64() float16() halffloat() float32() float() float64() boolean() bool() utf8() large_utf8() binary() large_binary() fixed_size_binary() string() date32() date64() time32() time64() duration() null() timestamp() decimal() decimal128() decimal256() struct() list_of() large_list_of() fixed_size_list_of() map_of(): 创建 Arrow 数据类型

dictionary(): 创建一个字典类型

new_extension_type() new_extension_array() register_extension_type() reregister_extension_type() unregister_extension_type(): 扩展类型

vctrs_extension_array() vctrs_extension_type(): 用于通用类型向量的扩展类型

as_data_type(): 将对象转换为 Arrow DataType

infer_type() type(): 从 R 对象推断 arrow Array 类型

字段和模式

field(): 创建一个 Field

schema(): 创建一个模式或从对象中提取一个。

unify_schemas(): 组合并协调模式

as_schema(): 将对象转换为 Arrow Schema

infer_schema(): 从对象中提取模式

read_schema(): 从流中读取 Schema

计算

用于计算 Arrow 数据对象值的函数。

acero arrow-functions arrow-verbs arrow-dplyr: 在 Arrow dplyr 查询中可用的函数

call_function(): 调用 Arrow 计算函数

match_arrow() is_in(): Arrow 对象的值匹配

value_counts(): Arrow 对象的 table

list_compute_functions(): 列出可用的 Arrow C++ 计算函数

register_scalar_function(): 注册用户定义的函数

show_exec_plan(): 显示 Arrow 执行计划的详细信息

DuckDB

与 DuckDB 传递数据

to_arrow(): 从 DuckDB 连接创建 Arrow 对象

to_duckdb(): 从 Arrow 对象创建 (虚拟) DuckDB 表

文件系统

用于处理 S3 和 GCS 上的文件的函数

s3_bucket(): 连接到 AWS S3 bucket

gs_bucket(): 连接到 Google Cloud Storage (GCS) bucket

copy_files(): 在 FileSystems 之间复制文件

Flight

load_flight_server(): 加载一个 Python Flight 服务器

flight_connect(): 连接到 Flight 服务器

flight_disconnect(): 显式关闭 Flight 客户端

flight_get(): 从 Flight 服务器获取数据

flight_put(): 将数据发送到 Flight 服务器

list_flights() flight_path_exists(): 查看 Flight 服务器上的可用资源

Arrow 配置

arrow_info() arrow_available() arrow_with_acero() arrow_with_dataset() arrow_with_substrait() arrow_with_parquet() arrow_with_s3() arrow_with_gcs() arrow_with_json(): 报告有关包功能的信息

cpu_count() set_cpu_count(): 管理 libarrow 中的全局 CPU 线程池

io_thread_count() set_io_thread_count(): 管理 libarrow 中的全局 I/O 线程池

install_arrow(): 安装或升级 Arrow 库

install_pyarrow(): 安装 pyarrow 以与 reticulate 一起使用

create_package_with_all_dependencies(): 创建一个包含所有第三方依赖项的源包

输入/输出

InputStream RandomAccessFile MemoryMappedFile ReadableFile BufferReader: InputStream 类

read_message(): 从流中读取 Message

mmap_open(): 打开一个内存映射文件

mmap_create(): 创建一个给定大小的新读/写内存映射文件

OutputStream FileOutputStream BufferOutputStream: OutputStream 类

Message: Message 类

MessageReader: MessageReader 类

compression CompressedOutputStream CompressedInputStream: 压缩流类

Codec: 压缩 Codec 类

codec_is_available(): 检查压缩编解码器是否可用

文件读/写器接口

ParquetFileReader: ParquetFileReader 类

ParquetReaderProperties: ParquetReaderProperties 类

ParquetArrowReaderProperties: ParquetArrowReaderProperties 类

ParquetFileWriter: ParquetFileWriter 类

ParquetWriterProperties: ParquetWriterProperties 类

FeatherReader: FeatherReader 类

CsvTableReader JsonTableReader: Arrow CSV 和 JSON 表格读取器类

CsvReadOptions CsvWriteOptions CsvParseOptions TimestampParser CsvConvertOptions JsonReadOptions JsonParseOptions: 文件读取器选项

RecordBatchReader RecordBatchStreamReader RecordBatchFileReader: RecordBatchReader 类

RecordBatchWriter RecordBatchStreamWriter RecordBatchFileWriter: RecordBatchWriter 类

as_record_batch_reader(): 将对象转换为 Arrow RecordBatchReader

底层 C++ 封装器

Arrow C++ 对象的底层 R6 类表示，供高级用户使用。

Buffer: Buffer 类

Scalar: Arrow 标量

Array DictionaryArray StructArray ListArray LargeListArray FixedSizeListArray MapArray: 数组类

ChunkedArray: ChunkedArray 类

RecordBatch: RecordBatch 类

Schema: Schema 类

Field: Field 类

Table: Table 类

DataType: DataType 类

ArrayData: ArrayData 类

DictionaryType: DictionaryType 类

FixedWidthType: FixedWidthType 类

ExtensionType: ExtensionType 类

ExtensionArray: ExtensionArray 类

Dataset 和 Filesystem R6 类和辅助函数

R6 类和辅助函数，在使用 Arrow 中的多文件数据集时很有用。

Dataset FileSystemDataset UnionDataset InMemoryDataset DatasetFactory FileSystemDatasetFactory: 多文件数据集

dataset_factory(): 创建 DatasetFactory

Partitioning DirectoryPartitioning HivePartitioning DirectoryPartitioningFactory HivePartitioningFactory: 定义数据集的分区

Expression: Arrow 表达式

Scanner ScannerBuilder: 扫描数据集的内容

FileFormat ParquetFileFormat IpcFileFormat: 数据集文件格式

CsvFileFormat: CSV 数据集文件格式

JsonFileFormat: JSON 数据集文件格式

FileWriteOptions: 特定于格式的写入选项

FragmentScanOptions CsvFragmentScanOptions ParquetFragmentScanOptions JsonFragmentScanOptions: 特定于格式的扫描选项

hive_partition(): 构造 Hive 分区

map_batches(): 将函数应用于 RecordBatches 流

FileSystem LocalFileSystem S3FileSystem GcsFileSystem SubTreeFileSystem: FileSystem 类

FileInfo: FileSystem 条目信息

FileSelector: 文件选择器