表格数据集#

另请参阅

Arrow 数据集库提供高效处理表格、可能大于内存以及多文件数据集的功能。这包括

一个统一的接口，支持不同的来源和文件格式以及不同的文件系统（本地、云）。
源发现（爬取目录、处理具有各种分区方案的分区数据集、基本模式规范化，…）
优化的读取，带有谓词下推（过滤行）、投影（选择和派生列），以及可选的并行读取。

目前支持的文件格式有 Parquet、Feather / Arrow IPC、CSV 和 ORC（请注意，ORC 数据集目前只能读取，还不能写入）。目标是在未来扩展对其他文件格式和数据源（例如数据库连接）的支持。

读取数据集#

对于下面的示例，让我们创建一个由包含两个 parquet 文件的目录组成的小数据集

// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  return base_path;
}

（请参阅底部的完整示例：完整示例。）

数据集发现#

可以使用各种 arrow::dataset::DatasetFactory 对象创建 arrow::dataset::Dataset 对象。在这里，我们将使用 arrow::dataset::FileSystemDatasetFactory，它可以创建一个给定基目录路径的数据集

// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

我们还传递要使用的文件系统以及用于读取的文件格式。这使我们可以在（例如）读取本地文件或 Amazon S3 中的文件之间，或在 Parquet 和 CSV 之间进行选择。

除了搜索基目录之外，我们还可以手动列出文件路径。

创建 arrow::dataset::Dataset 不会开始读取数据本身。它仅爬取目录以查找所有文件（如果需要），可以使用 arrow::dataset::FileSystemDataset::files() 检索这些文件

// Print out the files crawled (only for FileSystemDataset)
for (const auto& filename : dataset->files()) {
  std::cout << filename << std::endl;
}

…并推断数据集的模式（默认情况下从第一个文件推断）

std::cout << dataset->schema()->ToString() << std::endl;

使用 arrow::dataset::Dataset::NewScan() 方法，我们可以构建一个 arrow::dataset::Scanner 并使用 arrow::dataset::Scanner::ToTable() 方法将数据集（或其中的一部分）读入 arrow::Table 中

// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

注意

根据数据集的大小，这可能需要大量内存；请参阅下面关于过滤/投影的过滤数据。

读取不同的文件格式#

以上示例使用本地磁盘上的 Parquet 文件，但 Dataset API 在多种文件格式和文件系统上提供了一致的接口。（有关后者的更多信息，请参阅从云存储读取。）目前，支持 Parquet、ORC、Feather / Arrow IPC 和 CSV 文件格式；计划在未来支持更多格式。

如果我们把表保存为 Feather 文件而不是 Parquet 文件

// Set up a dataset by writing two Feather files.
arrow::Result<std::string> CreateExampleFeatherDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/feather_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Feather files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.feather"));
  ARROW_ASSIGN_OR_RAISE(auto writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(0, 5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.feather"));
  ARROW_ASSIGN_OR_RAISE(writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  return base_path;
}

…然后我们可以通过传递一个 arrow::dataset::IpcFileFormat 来读取 Feather 文件

auto format = std::make_shared<ds::ParquetFileFormat>();
// ...
auto factory = ds::FileSystemDatasetFactory::Make(filesystem, selector, format, options)
                   .ValueOrDie();

自定义文件格式#

arrow::dataset::FileFormat 对象具有控制文件读取方式的属性。例如

auto format = std::make_shared<ds::ParquetFileFormat>();
format->reader_options.dict_columns.insert("a");

将配置列 "a" 在读取时进行字典编码。类似地，设置 arrow::dataset::CsvFileFormat::parse_options 允许我们更改诸如读取逗号分隔或制表符分隔数据之类的内容。

此外，将 arrow::dataset::FragmentScanOptions 传递给 arrow::dataset::ScannerBuilder::FragmentScanOptions() 提供了对数据扫描的细粒度控制。例如，对于 CSV 文件，我们可以更改在扫描时将哪些值转换为布尔值 true 和 false。

过滤数据#

到目前为止，我们一直在读取整个数据集，但是如果只需要数据的子集，这可能会浪费时间或内存来读取我们不需要的数据。 arrow::dataset::Scanner 提供了对读取哪些数据的控制。

在此代码段中，我们使用 arrow::dataset::ScannerBuilder::Project() 来选择要读取的列

// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

某些格式（例如 Parquet）可以通过仅从文件系统读取指定的列来降低 I/O 成本。

可以使用 arrow::dataset::ScannerBuilder::Filter() 提供过滤器，以便不匹配过滤器谓词的行将不包含在返回的表中。同样，某些格式（例如 Parquet）可以使用此过滤器来减少所需的 I/O 量。

// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

投影列#

除了选择列之外，arrow::dataset::ScannerBuilder::Project() 还可以用于更复杂的投影，例如重命名列、将它们转换为其他类型，甚至基于评估表达式来派生新列。

在这种情况下，我们传递一个用于构造列值的表达式向量和一个用于列名的向量。

// Read a dataset, but with column projection.
//
// This is useful to derive new columns from existing data. For example, here we
// demonstrate casting a column to a different type, and turning a numeric column into a
// boolean column based on a predicate. You could also rename columns or perform
// computations involving multiple columns.
arrow::Result<std::shared_ptr<arrow::Table>> ProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project(
      {
          // Leave column "a" as-is.
          cp::field_ref("a"),
          // Cast column "b" to float32.
          cp::call("cast", {cp::field_ref("b")},
                   arrow::compute::CastOptions::Safe(arrow::float32())),
          // Derive a boolean column from "c".
          cp::equal(cp::field_ref("c"), cp::literal(1)),
      },
      {"a_renamed", "b_as_float32", "c_1"}));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

这也决定了列的选择；结果表中只会存在给定的列。如果你想在现有列的基础上包含一个派生列，你可以从数据集模式构建表达式。

// Read a dataset, but with column projection.
//
// This time, we read all original columns plus one derived column. This simply combines
// the previous two examples: selecting a subset of columns by name, and deriving new
// columns with an expression.
arrow::Result<std::shared_ptr<arrow::Table>> SelectAndProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  std::vector<std::string> names;
  std::vector<cp::Expression> exprs;
  // Read all the original columns.
  for (const auto& field : dataset->schema()->fields()) {
    names.push_back(field->name());
    exprs.push_back(cp::field_ref(field->name()));
  }
  // Also derive a new column.
  names.emplace_back("b_large");
  exprs.push_back(cp::greater(cp::field_ref("b"), cp::literal(1)));
  ARROW_RETURN_NOT_OK(scan_builder->Project(exprs, names));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

注意

当组合筛选器和投影时，Arrow 将确定所有需要读取的列。例如，如果你筛选一个最终未被选择的列，Arrow 仍然会读取该列来评估筛选器。

读取和写入分区数据#

到目前为止，我们一直在处理由包含文件的平面目录组成的数据集。通常，数据集会有一个或多个经常被筛选的列。我们可以将文件组织成嵌套的目录结构，从而定义一个分区数据集，其中子目录名称包含关于存储在该目录中的数据子集的信息，而无需读取然后筛选数据。然后，我们可以更有效地筛选数据，通过使用该信息来避免加载不匹配筛选器的文件。

例如，一个按年份和月份分区的数据集可能具有以下布局：

dataset_name/
  year=2007/
    month=01/
       data0.parquet
       data1.parquet
       ...
    month=02/
       data0.parquet
       data1.parquet
       ...
    month=03/
    ...
  year=2008/
    month=01/
    ...
  ...

上面的分区方案使用 “/key=value/” 目录名称，如 Apache Hive 中所见。按照此约定，dataset_name/year=2007/month=01/data0.parquet 中的文件仅包含 year == 2007 且 month == 01 的数据。

让我们创建一个小的分区数据集。为此，我们将使用 Arrow 的数据集写入功能。

// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}

上面创建了一个包含两个子目录（“part=a” 和 “part=b”）的目录，并且写入这些目录的 Parquet 文件不再包含 “part” 列。

读取此数据集时，我们现在指定数据集应使用类似 Hive 的分区方案。

// Read an entire dataset, but with partitioning information.
arrow::Result<std::shared_ptr<arrow::Table>> ScanPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;  // Make sure to search subdirectories
  ds::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema.
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

虽然分区字段未包含在实际的 Parquet 文件中，但在扫描此数据集时，它们将被添加回结果表中。

$ ./debug/dataset_documentation_example file:///tmp parquet_hive partitioned
Found fragment: /tmp/parquet_dataset/part=a/part0.parquet
Partition expression: (part == "a")
Found fragment: /tmp/parquet_dataset/part=b/part1.parquet
Partition expression: (part == "b")
Read 20 rows
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
b: double
  -- field metadata --
  PARQUET:field_id: '2'
c: int64
  -- field metadata --
  PARQUET:field_id: '3'
part: string
----
# snip...

我们现在可以筛选分区键，如果它们不匹配筛选器，则可以完全避免加载文件。

// Read an entire dataset, but with partitioning information. Also, filter the dataset on
// the partition values.
arrow::Result<std::shared_ptr<arrow::Table>> FilterPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;
  ds::FileSystemFactoryOptions options;
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  // Filter based on the partition values. This will mean that we won't even read the
  // files whose partition expressions don't match the filter.
  ARROW_RETURN_NOT_OK(
      scan_builder->Filter(cp::equal(cp::field_ref("part"), cp::literal("b"))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

不同的分区方案#

上面的例子使用了一种类似 Hive 的目录方案，例如 “/year=2009/month=11/day=15”。我们通过传递 Hive 分区工厂来指定这一点。在这种情况下，分区键的类型是从文件路径推断出来的。

也可以直接构造分区并显式定义分区键的模式。例如：

auto part = std::make_shared<ds::HivePartitioning>(arrow::schema({
    arrow::field("year", arrow::int16()),
    arrow::field("month", arrow::int8()),
    arrow::field("day", arrow::int32())
}));

Arrow 支持另一种分区方案，“目录分区”，其中文件路径中的段表示分区键的值，而不包含名称（字段名称在段的索引中是隐式的）。例如，给定字段名称 “year”、“month” 和 “day”，一个路径可能是 “/2019/11/15”。

由于名称未包含在文件路径中，因此在构造目录分区时必须指定这些名称。

auto part = ds::DirectoryPartitioning::MakeFactory({"year", "month", "day"});

目录分区还支持提供完整模式，而不是从文件路径推断类型。

分区性能注意事项#

分区数据集有两个影响性能的方面：它增加了文件数量，并且它围绕文件创建了一个目录结构。这两者都有好处也有代价。根据配置和数据集的大小，成本可能超过收益。

因为分区将数据集分成多个文件，所以可以并行读取和写入分区数据集。但是，每个额外的文件都会增加文件系统交互处理中的一些开销。它还会增加整体数据集大小，因为每个文件都有一些共享元数据。例如，每个 parquet 文件都包含 schema 和组级统计信息。分区的数量是文件数量的下限。如果你按日期对数据集进行分区，且有一年的数据，那么你将至少有 365 个文件。如果你进一步按另一个维度（具有 1,000 个唯一值）进行分区，那么你将最多有 365,000 个文件。这种精细的分区通常会导致主要由元数据组成的小文件。

分区数据集创建嵌套的文件夹结构，这允许我们修剪扫描中加载的文件。但是，这会增加发现数据集中文件的开销，因为我们需要递归地“列出目录”以查找数据文件。过多的精细分区可能会导致问题：按日期对数据集进行一年的数据分区将需要 365 次列表调用才能找到所有文件；添加另一个基数为 1,000 的列将导致 365,365 次调用。

最佳分区布局将取决于你的数据、访问模式以及将读取数据的系统。大多数系统（包括 Arrow）都应跨越一系列文件大小和分区布局工作，但是应该避免一些极端情况。这些准则可以帮助避免一些已知的最坏情况

避免小于 20MB 且大于 2GB 的文件。
避免超过 10,000 个不同分区的分区布局。

对于在文件中有组概念的文件格式（例如 Parquet），适用类似的准则。行组可以在读取时提供并行性，并允许基于统计信息跳过数据，但是非常小的组会导致元数据成为文件大小的很大一部分。在大多数情况下，Arrow 的文件写入器为组大小提供了合理的默认值。

从其他数据源读取#

读取内存中的数据#

如果你已经在内存中拥有想要与 Datasets API 一起使用的数据（例如，筛选/投影数据，或将其写入文件系统），你可以将其包装在 arrow::dataset::InMemoryDataset 中。

auto table = arrow::Table::FromRecordBatches(...);
auto dataset = std::make_shared<arrow::dataset::InMemoryDataset>(std::move(table));
// Scan the dataset, filter, it, etc.
auto scanner_builder = dataset->NewScan();

在该示例中，我们使用 InMemoryDataset 将示例数据写入本地磁盘，该磁盘在示例的其余部分中使用。

// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}

从云存储读取#

除了本地文件，Arrow Datasets 还支持通过传递不同的文件系统从云存储系统（例如 Amazon S3）读取。

有关可用文件系统的更多详细信息，请参见文件系统文档。

关于事务和 ACID 保证的说明#

dataset API 不提供事务支持或任何 ACID 保证。这会影响读取和写入。并发读取没问题。并发写入或与读取同时进行的写入可能具有意外的行为。可以使用各种方法来避免对同一文件进行操作，例如为每个写入器使用唯一的 basename 模板，为新文件使用临时目录，或单独存储文件列表，而不是依赖目录发现。

在写入过程中意外终止进程可能会使系统处于不一致的状态。写入调用通常会在要写入的字节完全传送到 OS 页面缓存后立即返回。即使已经完成写入操作，如果在写入调用后立即发生突然断电，文件的一部分仍可能丢失。

大多数文件格式都有魔数，写在末尾。这意味着可以安全地检测和丢弃部分文件写入。 CSV 文件格式没有任何这样的概念，部分写入的 CSV 文件可能会被检测为有效。

完整示例#

// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// https://apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

// This example showcases various ways to work with Datasets. It's
// intended to be paired with the documentation.

#include <arrow/api.h>
#include <arrow/compute/cast.h>
#include <arrow/dataset/dataset.h>
#include <arrow/dataset/discovery.h>
#include <arrow/dataset/file_base.h>
#include <arrow/dataset/file_ipc.h>
#include <arrow/dataset/file_parquet.h>
#include <arrow/dataset/scanner.h>
#include <arrow/filesystem/filesystem.h>
#include <arrow/ipc/writer.h>
#include <arrow/util/iterator.h>
#include <parquet/arrow/writer.h>
#include "arrow/compute/expression.h"

#include <iostream>
#include <vector>

namespace ds = arrow::dataset;
namespace fs = arrow::fs;
namespace cp = arrow::compute;

/**
 * \brief Run Example
 *
 * ./debug/dataset-documentation-example file:///<some_path>/<some_directory> parquet
 */

// (Doc section: Reading Datasets)
// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  return base_path;
}
// (Doc section: Reading Datasets)

// (Doc section: Reading different file formats)
// Set up a dataset by writing two Feather files.
arrow::Result<std::string> CreateExampleFeatherDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/feather_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Feather files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.feather"));
  ARROW_ASSIGN_OR_RAISE(auto writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(0, 5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.feather"));
  ARROW_ASSIGN_OR_RAISE(writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  return base_path;
}
// (Doc section: Reading different file formats)

// (Doc section: Reading and writing partitioned data)
// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}
// (Doc section: Reading and writing partitioned data)

// (Doc section: Dataset discovery)
// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Dataset discovery)

// (Doc section: Filtering data)
// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Filtering data)

// (Doc section: Projecting columns)
// Read a dataset, but with column projection.
//
// This is useful to derive new columns from existing data. For example, here we
// demonstrate casting a column to a different type, and turning a numeric column into a
// boolean column based on a predicate. You could also rename columns or perform
// computations involving multiple columns.
arrow::Result<std::shared_ptr<arrow::Table>> ProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project(
      {
          // Leave column "a" as-is.
          cp::field_ref("a"),
          // Cast column "b" to float32.
          cp::call("cast", {cp::field_ref("b")},
                   arrow::compute::CastOptions::Safe(arrow::float32())),
          // Derive a boolean column from "c".
          cp::equal(cp::field_ref("c"), cp::literal(1)),
      },
      {"a_renamed", "b_as_float32", "c_1"}));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Projecting columns)

// (Doc section: Projecting columns #2)
// Read a dataset, but with column projection.
//
// This time, we read all original columns plus one derived column. This simply combines
// the previous two examples: selecting a subset of columns by name, and deriving new
// columns with an expression.
arrow::Result<std::shared_ptr<arrow::Table>> SelectAndProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  std::vector<std::string> names;
  std::vector<cp::Expression> exprs;
  // Read all the original columns.
  for (const auto& field : dataset->schema()->fields()) {
    names.push_back(field->name());
    exprs.push_back(cp::field_ref(field->name()));
  }
  // Also derive a new column.
  names.emplace_back("b_large");
  exprs.push_back(cp::greater(cp::field_ref("b"), cp::literal(1)));
  ARROW_RETURN_NOT_OK(scan_builder->Project(exprs, names));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Projecting columns #2)

// (Doc section: Reading and writing partitioned data #2)
// Read an entire dataset, but with partitioning information.
arrow::Result<std::shared_ptr<arrow::Table>> ScanPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;  // Make sure to search subdirectories
  ds::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema.
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Reading and writing partitioned data #2)

// (Doc section: Reading and writing partitioned data #3)
// Read an entire dataset, but with partitioning information. Also, filter the dataset on
// the partition values.
arrow::Result<std::shared_ptr<arrow::Table>> FilterPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;
  ds::FileSystemFactoryOptions options;
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  // Filter based on the partition values. This will mean that we won't even read the
  // files whose partition expressions don't match the filter.
  ARROW_RETURN_NOT_OK(
      scan_builder->Filter(cp::equal(cp::field_ref("part"), cp::literal("b"))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Reading and writing partitioned data #3)

arrow::Status RunDatasetDocumentation(const std::string& format_name,
                                      const std::string& uri, const std::string& mode) {
  std::string base_path;
  std::shared_ptr<ds::FileFormat> format;
  std::string root_path;
  ARROW_ASSIGN_OR_RAISE(auto fs, fs::FileSystemFromUri(uri, &root_path));

  if (format_name == "feather") {
    format = std::make_shared<ds::IpcFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path, CreateExampleFeatherDataset(fs, root_path));
  } else if (format_name == "parquet") {
    format = std::make_shared<ds::ParquetFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path, CreateExampleParquetDataset(fs, root_path));
  } else if (format_name == "parquet_hive") {
    format = std::make_shared<ds::ParquetFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path,
                          CreateExampleParquetHivePartitionedDataset(fs, root_path));
  } else {
    std::cerr << "Unknown format: " << format_name << std::endl;
    std::cerr << "Supported formats: feather, parquet, parquet_hive" << std::endl;
    return arrow::Status::ExecutionError("Dataset creating failed.");
  }

  std::shared_ptr<arrow::Table> table;
  if (mode == "no_filter") {
    ARROW_ASSIGN_OR_RAISE(table, ScanWholeDataset(fs, format, base_path));
  } else if (mode == "filter") {
    ARROW_ASSIGN_OR_RAISE(table, FilterAndSelectDataset(fs, format, base_path));
  } else if (mode == "project") {
    ARROW_ASSIGN_OR_RAISE(table, ProjectDataset(fs, format, base_path));
  } else if (mode == "select_project") {
    ARROW_ASSIGN_OR_RAISE(table, SelectAndProjectDataset(fs, format, base_path));
  } else if (mode == "partitioned") {
    ARROW_ASSIGN_OR_RAISE(table, ScanPartitionedDataset(fs, format, base_path));
  } else if (mode == "filter_partitioned") {
    ARROW_ASSIGN_OR_RAISE(table, FilterPartitionedDataset(fs, format, base_path));
  } else {
    std::cerr << "Unknown mode: " << mode << std::endl;
    std::cerr
        << "Supported modes: no_filter, filter, project, select_project, partitioned"
        << std::endl;
    return arrow::Status::ExecutionError("Dataset reading failed.");
  }
  std::cout << "Read " << table->num_rows() << " rows" << std::endl;
  std::cout << table->ToString() << std::endl;
  return arrow::Status::OK();
}

int main(int argc, char** argv) {
  if (argc < 3) {
    // Fake success for CI purposes.
    return EXIT_SUCCESS;
  }

  std::string uri = argv[1];
  std::string format_name = argv[2];
  std::string mode = argc > 3 ? argv[3] : "no_filter";

  auto status = RunDatasetDocumentation(format_name, uri, mode);
  if (!status.ok()) {
    std::cerr << status.ToString() << std::endl;
    return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}