读取和写入 Parquet 文件#

另请参阅

Parquet 格式是一种用于复杂数据的节省空间的列式存储格式。 Parquet C++ 实现是 Apache Arrow 项目的一部分，受益于与 Arrow C++ 类和工具的紧密集成。

读取 Parquet 文件#

arrow::FileReader 类将数据读取到 Arrow Tables 和 Record Batches 中。

StreamReader 类允许使用 C++ 输入流方法读取数据，以逐列逐行读取字段。提供这种方法是为了便于使用和类型安全。当然，当必须在增量读取和写入文件时对数据进行流式传输时，它也很有用。

请注意，由于类型检查以及一次处理一个列值的事实，StreamReader 的性能不会那么好。

FileReader#

要将 Parquet 数据读取到 Arrow 结构中，请使用 arrow::FileReader。要构造它，需要一个 ::arrow::io::RandomAccessFile 实例来表示输入文件。要一次读取整个文件，请使用 arrow::FileReader::ReadTable()

// #include "arrow/io/api.h"
// #include "parquet/arrow/reader.h"

arrow::MemoryPool* pool = arrow::default_memory_pool();
std::shared_ptr<arrow::io::RandomAccessFile> input;
ARROW_ASSIGN_OR_RAISE(input, arrow::io::ReadableFile::Open(path_to_file));

// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
ARROW_ASSIGN_OR_RAISE(arrow_reader, parquet::arrow::OpenFile(input, pool));

// Read entire file as a single Arrow table
std::shared_ptr<arrow::Table> table;
ARROW_RETURN_NOT_OK(arrow_reader->ReadTable(&table));

可以通过 arrow::FileReaderBuilder 辅助类获得更细粒度的选项，该类接受 ReaderProperties 和 ArrowReaderProperties 类。

要读取为批处理流，请使用 arrow::FileReader::GetRecordBatchReader() 方法来检索 arrow::RecordBatchReader。它将使用在 ArrowReaderProperties 中设置的批处理大小。

// #include "arrow/io/api.h"
// #include "parquet/arrow/reader.h"

arrow::MemoryPool* pool = arrow::default_memory_pool();

// Configure general Parquet reader settings
auto reader_properties = parquet::ReaderProperties(pool);
reader_properties.set_buffer_size(4096 * 4);
reader_properties.enable_buffered_stream();

// Configure Arrow-specific Parquet reader settings
auto arrow_reader_props = parquet::ArrowReaderProperties();
arrow_reader_props.set_batch_size(128 * 1024);  // default 64 * 1024

parquet::arrow::FileReaderBuilder reader_builder;
ARROW_RETURN_NOT_OK(
    reader_builder.OpenFile(path_to_file, /*memory_map=*/false, reader_properties));
reader_builder.memory_pool(pool);
reader_builder.properties(arrow_reader_props);

std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());

std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));

for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *rb_reader) {
  // Operate on each batch...
}

另请参阅

有关读取多文件数据集或向下推送过滤器以修剪行组，请参见表格数据集。

性能和内存效率#

对于远程文件系统，请使用读取合并（预缓冲）来减少 API 调用次数

auto arrow_reader_props = parquet::ArrowReaderProperties();
reader_properties.set_prebuffer(true);

默认设置通常针对良好的性能进行了调整，但默认情况下禁用并行列解码。在 ArrowReaderProperties 的构造函数中启用它

auto arrow_reader_props = parquet::ArrowReaderProperties(/*use_threads=*/true);

如果内存效率比性能更重要，那么

请勿在 parquet::ArrowReaderProperties 中启用读取合并（预缓冲）。
使用 arrow::FileReader::GetRecordBatchReader() 分批读取数据。
在 parquet::ReaderProperties 中启用 enable_buffered_stream。

此外，如果您知道某些列包含许多重复值，则可以将它们读取为字典编码列。这可以通过 ArrowReaderProperties 上的 set_read_dictionary 设置来启用。如果这些文件是用 Arrow C++ 编写的，并且 store_schema 已激活，则将自动读取原始 Arrow 模式，并且将覆盖此设置。

StreamReader#

StreamReader 允许使用标准 C++ 输入运算符读取 Parquet 文件，从而确保类型安全。

请注意，类型必须与架构完全匹配，即，如果架构字段是无符号的 16 位整数，则必须提供 uint16_t 类型。

异常用于发出错误信号。在以下情况下会引发 ParquetException

尝试通过提供错误的类型来读取字段。
尝试读取超出行的末尾。
尝试读取超出文件的末尾。

#include "arrow/io/file.h"
#include "parquet/stream_reader.h"

{
   std::shared_ptr<arrow::io::ReadableFile> infile;

   PARQUET_ASSIGN_OR_THROW(
      infile,
      arrow::io::ReadableFile::Open("test.parquet"));

   parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};

   std::string article;
   float price;
   uint32_t quantity;

   while ( !stream.eof() )
   {
      stream >> article >> price >> quantity >> parquet::EndRow;
      // ...
   }
}

写入 Parquet 文件#

WriteTable#

arrow::WriteTable() 函数将整个 ::arrow::Table 写入输出文件。

// #include "parquet/arrow/writer.h"
// #include "arrow/util/type_fwd.h"
using parquet::ArrowWriterProperties;
using parquet::WriterProperties;

ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, GetTable());

// Choose compression
std::shared_ptr<WriterProperties> props =
    WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();

// Opt to store Arrow schema for easier reads back into Arrow
std::shared_ptr<ArrowWriterProperties> arrow_props =
    ArrowWriterProperties::Builder().store_schema()->build();

std::shared_ptr<arrow::io::FileOutputStream> outfile;
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open(path_to_file));

ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table.get(),
                                               arrow::default_memory_pool(), outfile,
                                               /*chunk_size=*/3, props, arrow_props));

注意

默认情况下，C++ 中禁用列压缩。有关如何在写入器属性中选择压缩编解码器，请参见下方。

要逐批写出数据，请使用 arrow::FileWriter。

// #include "parquet/arrow/writer.h"
// #include "arrow/util/type_fwd.h"
using parquet::ArrowWriterProperties;
using parquet::WriterProperties;

// Data is in RBR
std::shared_ptr<arrow::RecordBatchReader> batch_stream;
ARROW_ASSIGN_OR_RAISE(batch_stream, GetRBR());

// Choose compression
std::shared_ptr<WriterProperties> props =
    WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();

// Opt to store Arrow schema for easier reads back into Arrow
std::shared_ptr<ArrowWriterProperties> arrow_props =
    ArrowWriterProperties::Builder().store_schema()->build();

// Create a writer
std::shared_ptr<arrow::io::FileOutputStream> outfile;
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open(path_to_file));
std::unique_ptr<parquet::arrow::FileWriter> writer;
ARROW_ASSIGN_OR_RAISE(
    writer, parquet::arrow::FileWriter::Open(*batch_stream->schema().get(),
                                             arrow::default_memory_pool(), outfile,
                                             props, arrow_props));

// Write each batch as a row_group
for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *batch_stream) {
  ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
  ARROW_ASSIGN_OR_RAISE(auto table,
                        arrow::Table::FromRecordBatches(batch->schema(), {batch}));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table.get(), batch->num_rows()));
}

// Write file footer and close
ARROW_RETURN_NOT_OK(writer->Close());

StreamWriter#

StreamWriter 允许使用标准 C++ 输出运算符写入 Parquet 文件，类似于使用 StreamReader 类进行读取。这种类型安全的方法还可以确保在不省略字段的情况下写入行，并允许自动（在一定数据量之后）或通过使用 EndRowGroup 流修改器显式创建新的行组。

异常用于发出错误信号。在以下情况下会引发 ParquetException

尝试使用不正确的类型写入字段。
尝试在一行中写入过多字段。
尝试跳过必填字段。

#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;

   PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("test.parquet"));

   parquet::WriterProperties::Builder builder;
   std::shared_ptr<parquet::schema::GroupNode> schema;

   // Set up builder with required compression type etc.
   // Define schema.
   // ...

   parquet::StreamWriter os{
      parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

   // Loop over some data structure which provides the required
   // fields to be written and write each row.
   for (const auto& a : getArticles())
   {
      os << a.name() << a.price() << a.quantity() << parquet::EndRow;
   }
}

Writer 属性#

要配置 Parquet 文件的写入方式，请使用 WriterProperties::Builder

#include "parquet/arrow/writer.h"
#include "arrow/util/type_fwd.h"

using parquet::WriterProperties;
using parquet::ParquetVersion;
using parquet::ParquetDataPageVersion;
using arrow::Compression;

std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
   .max_row_group_length(64 * 1024)
   .created_by("My Application")
   .version(ParquetVersion::PARQUET_2_6)
   .data_page_version(ParquetDataPageVersion::V2)
   .compression(Compression::SNAPPY)
   .build();

max_row_group_length 设置了每个行组中行数的上限，优先于写入方法中传递的 chunk_size。

你可以使用 version 设置要写入的 Parquet 版本，它决定了哪些逻辑类型可用。此外，你可以使用 data_page_version 设置数据页版本。默认情况下为 V1；设置为 V2 将允许更优化的压缩（跳过没有空间收益的页面的压缩），但并非所有读取器都支持此数据页版本。

默认情况下，压缩是关闭的，但为了充分利用 Parquet，你还应该选择一种压缩编解码器。你可以为整个文件选择一个，也可以为各个列选择一个。如果你选择混合使用，文件级别的选项将应用于没有特定压缩编解码器的列。有关选项，请参见 ::arrow::Compression。

列数据编码同样可以在文件级别或列级别应用。默认情况下，写入器将尝试对所有支持的列进行字典编码，除非字典变得太大。可以在文件级别或列级别使用 disable_dictionary() 更改此行为。当不使用字典编码时，它将回退到为列或整个文件设置的编码；默认情况下为 Encoding::PLAIN，但可以使用 encoding() 更改此设置。

#include "parquet/arrow/writer.h"
#include "arrow/util/type_fwd.h"

using parquet::WriterProperties;
using arrow::Compression;
using parquet::Encoding;

std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
  .compression(Compression::SNAPPY)        // Fallback
  ->compression("colA", Compression::ZSTD) // Only applies to column "colA"
  ->encoding(Encoding::BIT_PACKED)         // Fallback
  ->encoding("colB", Encoding::RLE)        // Only applies to column "colB"
  ->disable_dictionary("colB")             // Never dictionary-encode column "colB"
  ->build();

默认情况下，对所有列启用统计信息。你可以使用构建器上的 disable_statistics 禁用所有列或特定列的统计信息。有一个 max_statistics_size 限制了可用于最小值和最大值的最大字节数，这对于字符串或二进制 blob 等类型很有用。如果一列使用 enable_write_page_index 启用了页索引，则它不会将统计信息写入页眉，因为它已在 ColumnIndex 中重复。

还有一些特定于 Arrow 的设置可以使用 parquet::ArrowWriterProperties 进行配置

#include "parquet/arrow/writer.h"

using parquet::ArrowWriterProperties;

std::shared_ptr<ArrowWriterProperties> arrow_props = ArrowWriterProperties::Builder()
   .enable_deprecated_int96_timestamps() // default False
   ->store_schema() // default False
   ->build();

这些选项主要决定了 Arrow 类型如何转换为 Parquet 类型。启用 store_schema 将导致写入器将序列化的 Arrow schema 存储在文件元数据中。由于 Parquet schema 和 Arrow schema 之间没有双射关系，因此存储 Arrow schema 允许 Arrow 读取器更忠实地重新创建原始数据。此从 Parquet 类型到原始 Arrow 类型的映射包括

读取具有原始时区信息的时间戳（Parquet 不支持时区）；
从它们的存储类型读取 Arrow 类型（例如从 int64 列读取 Duration）；
将字符串和二进制列读回具有 64 位偏移量的大型变体；
将列读回为字典编码（Arrow 列和序列化的 Parquet 版本是否进行字典编码是独立的）。

支持的 Parquet 功能#

Parquet 格式具有许多功能，Parquet C++ 支持其中的一部分。

页面类型#

页面类型	说明
DATA_PAGE
DATA_PAGE_V2
DICTIONARY_PAGE

不支持的页面类型： INDEX_PAGE。读取 Parquet 文件时，将忽略此类型的页面。

压缩#

压缩编解码器	说明
SNAPPY
GZIP
BROTLI
LZ4	(1)
ZSTD

（1）在读取端，Parquet C++ 能够解压缩常规的 LZ4 块格式和参考 Parquet 实现使用的临时 Hadoop LZ4 格式。在写入端，Parquet C++ 始终生成临时 Hadoop LZ4 格式。

不支持的压缩编解码器： LZO。

编码#

编码	读取	写入	说明
PLAIN	✓	✓
PLAIN_DICTIONARY	✓	✓
BIT_PACKED	✓	✓	(1)
RLE	✓	✓	(1)
RLE_DICTIONARY	✓	✓	(2)
BYTE_STREAM_SPLIT	✓	✓
DELTA_BINARY_PACKED	✓	✓
DELTA_BYTE_ARRAY	✓	✓
DELTA_LENGTH_BYTE_ARRAY	✓	✓

（1）仅支持用于编码定义和重复级别以及布尔值。
（2）在写入路径中，仅当在 WriterProperties::version() 中选择了 Parquet 格式版本 2.4 或更高版本时，才会启用 RLE_DICTIONARY。

类型#

物理类型#

物理类型	映射的 Arrow 类型	说明
BOOLEAN	Boolean
INT32	Int32 / 其他	(1)
INT64	Int64 / 其他	(1)
INT96	Timestamp (nanoseconds)	(2)
FLOAT	Float32
DOUBLE	Float64
BYTE_ARRAY	Binary / 其他	(1) (3)
FIXED_LENGTH_BYTE_ARRAY	FixedSizeBinary / 其他	(1)

（1）可以映射到其他 Arrow 类型，具体取决于逻辑类型（见下文）。
（2）在写入端，必须启用 ArrowWriterProperties::support_deprecated_int96_timestamps()。
（3）在写入端，Arrow LargeBinary 也可以映射到 BYTE_ARRAY。

逻辑类型#

特定的逻辑类型可以覆盖给定物理类型的默认 Arrow 类型映射。

逻辑类型	物理类型	映射的 Arrow 类型	说明
NULL	Any	Null	(1)
INT	INT32	Int8 / UInt8 / Int16 / UInt16 / Int32 / UInt32
INT	INT64	Int64 / UInt64
DECIMAL	INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY	Decimal128 / Decimal256	(2)
DATE	INT32	Date32	(3)
TIME	INT32	Time32 (milliseconds)
TIME	INT64	Time64 (micro- or nanoseconds)
TIMESTAMP	INT64	Timestamp (milli-, micro- or nanoseconds)
STRING	BYTE_ARRAY	Utf8	(4)
LIST	Any	List	(5)
MAP	Any	Map	(6)
FLOAT16	FIXED_LENGTH_BYTE_ARRAY	HalfFloat

（1）在写入端，生成 Parquet 物理类型 INT32。
（2）在写入端，始终发出 FIXED_LENGTH_BYTE_ARRAY。
（3）在写入端，Arrow Date64 也映射到 Parquet DATE INT32。
（4）在写入端，Arrow LargeUtf8 也映射到 Parquet STRING。
（5）在写入端，Arrow LargeList 或 FixedSizedList 也映射到 Parquet LIST。
（6）在读取端，具有多个值的键不会被去重，这与 Parquet 规范相矛盾。

不支持的逻辑类型： JSON、BSON、UUID。如果在读取 Parquet 文件时遇到此类类型，则使用默认的物理类型映射（例如，Parquet JSON 列可以读取为 Arrow Binary 或 FixedSizeBinary）。

转换的类型#

虽然转换的类型在 Parquet 格式中已弃用（它们已被逻辑类型取代），但 Parquet C++ 实现会识别并发出它们，以便最大限度地与其他 Parquet 实现兼容。

特殊情况#

Arrow Extension 类型以其存储类型写出。仍然可以使用 Parquet 元数据在读取时重新创建它（请参阅下面的“Arrow 类型循环”）。

Arrow Dictionary 类型以其值类型写出。仍然可以使用 Parquet 元数据在读取时重新创建它（请参阅下面的“Arrow 类型循环”）。

Arrow 类型和 schema 循环#

虽然 Arrow 类型和 Parquet 类型之间没有双射关系，但可以将 Arrow schema 序列化为 Parquet 文件元数据的一部分。可以使用 ArrowWriterProperties::store_schema() 启用此功能。

在读取路径中，序列化的 schema 将被自动识别，并将重新创建原始 Arrow 数据，并根据需要转换 Parquet 数据。

例如，当将 Arrow LargeList 序列化为 Parquet 时

数据将作为 Parquet LIST 写出
当读回时，如果写入文件时启用了 ArrowWriterProperties::store_schema()，则 Parquet LIST 数据将被解码为 Arrow LargeList；否则，它将被解码为 Arrow List。

Parquet 字段 ID#

Parquet 格式支持一个可选的整数字段 ID，可以将其分配给给定的字段。例如，这在 Apache Iceberg 规范中使用。

在写入器端，如果 PARQUET:field_id 作为元数据键存在于 Arrow 字段上，则其值将被解析为非负整数，并用作相应 Parquet 字段的字段 ID。

在读取器端，Arrow 会将此类字段 ID 转换为元数据键，该元数据键在相应的 Arrow 字段上命名为 PARQUET:field_id。

序列化详细信息#

Arrow schema 被序列化为 Arrow IPC schema 消息，然后进行 base64 编码并存储在 Parquet 文件元数据中的 ARROW:schema 元数据键下。

限制#

不支持写入或读回具有空条目的 FixedSizedList 数据。

加密#

Parquet C++ 实现了加密规范中指定的所有功能，但列索引和 bloom 过滤器模块的加密除外。

更具体地说，Parquet C++ 支持

AES_GCM_V1 和 AES_GCM_CTR_V1 加密算法。
Footer、ColumnMetaData、Data Page、Dictionary Page、Data PageHeader、Dictionary PageHeader 模块类型的 AAD 后缀。不支持其他模块类型（ColumnIndex、OffsetIndex、BloomFilter Header、BloomFilter Bitset）。
EncryptionWithFooterKey 和 EncryptionWithColumnKey 模式。
Encrypted Footer 和 Plaintext Footer 模式。

配置#

Parquet 加密使用 parquet::encryption::CryptoFactory，它可以访问密钥管理系统 (KMS)，KMS 存储实际的加密密钥，并通过密钥 ID 引用。Parquet 加密配置仅使用密钥 ID，不使用实际密钥。

Parquet 元数据加密通过 parquet::encryption::EncryptionConfiguration 进行配置。

// Set write options with encryption configuration.
auto encryption_config = std::make_shared<parquet::encryption::EncryptionConfiguration>(
    std::string("footerKeyId"));

如果 encryption_config->uniform_encryption 设置为 true，则所有列都使用与 Parquet 元数据相同的密钥进行加密。否则，各个列将使用单独的密钥进行加密，这些密钥通过 encryption_config->column_keys 进行配置。此字段需要一个格式为 "columnKeyID1:colName1,colName2;columnKeyID3:colName3..." 的字符串。

// Set write options with encryption configuration.
auto encryption_config = std::make_shared<parquet::encryption::EncryptionConfiguration>(
    std::string("footerKeyId"));
encryption_config->column_keys =
    "columnKeyId: i, s.a, s.b, m.key_value.key, m.key_value.value, l.list.element";

请参阅完整的 Parquet 列加密示例。

注意

加密具有嵌套字段（结构体、map 或列表数据类型）的列需要内部字段的列密钥，而不是外部列本身。为外部列配置列密钥会导致此错误（此处列名为 col）

OSError: Encrypted column col not in file schema

按照惯例，map 列 m 的键和值字段分别名为 m.key_value.key 和 m.key_value.value。列表列 l 的内部字段名为 l.list.element。结构体列 s 的内部字段 f 名为 s.f。

其他#

特性	读取	写入	说明
列索引	✓	✓	(1)
偏移量索引	✓	✓	(1)
布隆过滤器	✓	✓	(2)
CRC 校验和	✓	✓

(1) 提供了对列和偏移量索引结构的访问，但数据读取 API 目前没有利用它们。
(2) 提供了用于创建、序列化和反序列化布隆过滤器的 API，但它们没有集成到数据读取 API 中。