Acero 用户指南#

本页描述了如何使用 Acero。建议您先阅读概述并熟悉基本概念。

使用 Acero#

Acero 的基本工作流程如下：

首先，创建一个 Declaration 对象图，描述计划
调用 DeclarationToXyz 方法之一来执行 Declaration。
1. 从 Declarations 图创建一个新的 ExecPlan。每个 Declaration 将对应于计划中的一个 ExecNode。此外，将添加一个接收器节点，具体取决于所使用的 DeclarationToXyz 方法。
2. 执行 ExecPlan。通常，这作为 DeclarationToXyz 调用的一部分发生，但在 DeclarationToReader 中，读取器在计划完成执行之前返回。
3. 计划完成后，它将被销毁

创建计划#

使用 Substrait#

Substrait 是创建计划（Declaration 图）的首选机制。原因有几个：

Substrait 生产者花费大量时间和精力来创建用户友好的 API，以便以简单的方式生成复杂的执行计划。例如，可以使用一系列复杂的 aggregate 节点来实现 pivot_wider 操作。生产者将为您提供一个更简单的 API，而不是手动创建所有这些 aggregate 节点。
如果您正在使用 Substrait，那么您可以轻松地切换到任何其他使用 Substrait 的引擎，如果您在某个时候发现它比 Acero 更能满足您的需求。
我们希望最终会出现基于 Substrait 的优化器和规划器工具。通过使用 Substrait，您将更容易在将来使用这些工具。

您可以自己创建 Substrait 计划，但您可能会更容易找到现有的 Substrait 生产者。例如，您可以使用 ibis-substrait 轻松地从 python 表达式创建 Substrait 计划。有一些不同的工具能够从 SQL 创建 Substrait 计划。最终，我们希望会出现基于 C++ 的 Substrait 生产者。但是，我们目前没有发现任何此类生产者。

有关从 Substrait 创建执行计划的详细说明，请参见Substrait 页面

以编程方式创建计划#

以编程方式创建执行计划比从 Substrait 创建计划更简单，但会损失一些灵活性和面向未来的保证。创建 Declaration 的最简单方法是直接实例化一个。您需要 declaration 的名称、输入向量和一个选项对象。例如：

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

上面的代码创建了一个 scan declaration（没有输入）和一个 project declaration（使用 scan 作为输入）。这很简单，但我们可以使其稍微容易一些。如果您正在创建线性 declaration 序列（如上面的示例中），那么您还可以使用 Declaration::Sequence() 函数。

  // Inputs do not have to be passed to the project node when using Sequence
  ac::Declaration plan =
      ac::Declaration::Sequence({{"scan", std::move(scan_node_options)},
                                 {"project", ac::ProjectNodeOptions({a_times_2})}});

本文档后面还有更多关于以编程方式创建计划的示例。

执行计划#

有许多不同的方法可用于执行 declaration。每种方法以略有不同的形式提供数据。由于所有这些方法都以 DeclarationTo... 开头，因此本指南通常将这些方法称为 DeclarationToXyz 方法。

DeclarationToTable#

DeclarationToTable() 方法会将所有结果累积到单个 arrow::Table 中。这可能是从 Acero 收集结果的最简单方法。这种方法的主要缺点是它需要将所有结果累积到内存中。

注意

Acero 以小块处理大型数据集。开发者指南中对此进行了更详细的描述。因此，您可能会惊讶地发现，使用 DeclarationToTable 收集的表与您的输入块的处理方式不同。例如，您的输入可能是一个包含 200 万行的大型表，其中只有一个块。然后，您的输出表可能包含 64 个块，每个块包含 32Ki 行。目前在 GH-15155 中有一个指定输出块大小的请求。

DeclarationToReader#

DeclarationToReader() 方法允许您迭代地使用结果。它将创建一个 arrow::RecordBatchReader，您可以随时从中读取。如果您没有足够快地从读取器读取，则将应用反压，并且执行计划将暂停。关闭读取器将取消正在运行的执行计划，并且读取器的析构函数将等待执行计划完成其正在执行的任何操作，因此可能会阻塞。

DeclarationToStatus#

如果您想运行计划但实际上不想使用结果，则 DeclarationToStatus() 方法很有用。例如，这在基准测试或计划具有副作用（例如数据集写入节点）时很有用。如果计划生成任何结果，它们将立即被丢弃。

直接运行计划#

如果其中一种 DeclarationToXyz 方法由于某种原因不足，则可以直接运行计划。只有在执行某些独特的操作时才需要这样做。例如，如果您创建了一个自定义接收器节点，或者如果您需要一个具有多个输出的计划。

注意

在学术文献和许多现有系统中，通常假设执行计划最多只有一个输出。Acero 中的一些东西，例如 DeclarationToXyz 方法，期望如此。但是，设计中没有任何东西严格禁止具有多个接收器节点。

有关如何执行此操作的详细说明超出了本指南的范围，但大致步骤是：

创建一个新的 ExecPlan 对象。
将接收器节点添加到您的 Declaration 对象图中（这是您需要为接收器节点创建 declaration 的唯一类型）
使用 Declaration::AddToPlan() 将您的 declaration 添加到您的计划中（如果您有多个输出，那么您将无法使用此方法，并且需要一次添加一个节点）
使用 ExecPlan::Validate() 验证计划。
使用 ExecPlan::StartProducing() 启动计划。
等待 ExecPlan::finished() 返回的 future 完成。

提供输入#

执行计划的输入数据可以来自多种来源。它通常从存储在某种文件系统上的文件中读取。输入也通常来自内存中的数据。例如，在类似 pandas 的前端中，内存中的数据是典型的。输入也可以来自网络流，例如 Flight 请求。Acero 可以支持所有这些情况，甚至可以支持此处未提及的独特和自定义情况。

有一些预定义的源节点涵盖了最常见的输入场景。这些节点如下所示。但是，如果您的源数据是唯一的，那么您需要使用通用的 source 节点。此节点希望您提供一个异步批处理流，更多详细信息请参见此处。

可用的 `ExecNode` 实现#

下表快速总结了可用的运算符。

源#

这些节点可以用作数据的来源

源节点#
工厂名称	选项	简要说明
`source`	`SourceNodeOptions`	一个通用的源节点，它封装了一个异步数据流 (示例)
`table_source`	`TableSourceNodeOptions`	从 `arrow::Table` 生成数据 (示例)
`record_batch_source`	`RecordBatchSourceNodeOptions`	从 `arrow::RecordBatch` 的迭代器生成数据
`record_batch_reader_source`	`RecordBatchReaderSourceNodeOptions`	从 `arrow::RecordBatchReader` 生成数据
`exec_batch_source`	`ExecBatchSourceNodeOptions`	从 `arrow::compute::ExecBatch` 的迭代器生成数据
`array_vector_source`	`ArrayVectorSourceNodeOptions`	从 `arrow::Array` 向量的迭代器生成数据
`scan`	`arrow::dataset::ScanNodeOptions`	从 `arrow::dataset::Dataset` 生成数据（需要 datasets 模块）(示例)

计算节点#

这些节点对数据执行计算，并且可以转换或重塑数据

计算节点#
工厂名称	选项	简要说明
`filter`	`FilterNodeOptions`	删除与给定过滤器表达式不匹配的行 (示例)
`project`	`ProjectNodeOptions`	通过评估计算表达式来创建新列。还可以删除和重新排序列 (示例)
`aggregate`	`AggregateNodeOptions`	计算整个输入流或数据组的汇总统计信息 (示例)
`pivot_longer`	`PivotLongerNodeOptions`	通过将某些列转换为其他行来重塑数据

排列节点#

这些节点重新排序、组合或切片数据流

排列节点#
工厂名称	选项	简要说明
`hash_join`	`HashJoinNodeOptions`	基于公共列连接两个输入 (示例)
`asofjoin`	`AsofJoinNodeOptions`	基于公共有序列（通常是时间）将多个输入连接到第一个输入
`union`	N/A	合并具有相同架构的两个输入 (示例)
`order_by`	`OrderByNodeOptions`	重新排序流
`fetch`	`FetchNodeOptions`	从流中切片一系列行

Sink 节点#

这些节点终止计划。用户通常不创建 sink 节点，因为它们是根据用于消费计划的 DeclarationToXyz 方法选择的。但是，此列表可能对那些开发新 sink 节点或以高级方式使用 Acero 的人有用。

Sink 节点#
工厂名称	选项	简要说明
`sink`	`SinkNodeOptions`	将批处理收集到具有可选背压的 FIFO 队列中
`write`	`arrow::dataset::WriteNodeOptions`	将批处理写入文件系统 (示例)
`consuming_sink`	`ConsumingSinkNodeOptions`	使用用户提供的回调函数来消费批处理
`table_sink`	`TableSinkNodeOptions`	将批处理收集到 `arrow::Table` 中
`order_by_sink`	`OrderBySinkNodeOptions`	已弃用
`select_k_sink`	`SelectKSinkNodeOptions`	已弃用

示例#

本文档的其余部分包含执行计划示例。每个示例都突出显示了特定执行节点的行为。

`source`#

source 操作可以被认为是创建流式执行计划的入口点。 SourceNodeOptions 用于创建 source 操作。 source 操作是当前可用的最通用和最灵活的源类型，但配置起来可能非常棘手。首先，您应该查看其他源节点类型，以确保没有更简单的选择。

源节点需要某种可以调用以轮询更多数据的函数。此函数不应接受任何参数，并且应返回 arrow::Future<std::optional<arrow::ExecBatch>>。此函数可能正在读取文件、迭代内存中的结构或从网络连接接收数据。 arrow 库将这些函数称为 arrow::AsyncGenerator，并且有许多实用程序可用于处理这些函数。在此示例中，我们使用已存储在内存中的记录批次向量。此外，必须预先知道数据的模式。在任何处理开始之前，Acero 必须知道执行图的每个阶段的数据模式。这意味着我们必须从数据本身单独提供源节点的模式。

在这里，我们定义一个结构来保存数据生成器定义。这包括内存中的批处理、模式和充当数据生成器的函数

struct BatchesWithSchema {
  std::vector<cp::ExecBatch> batches;
  std::shared_ptr<arrow::Schema> schema;
  // This method uses internal arrow utilities to
  // convert a vector of record batches to an AsyncGenerator of optional batches
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen() const {
    auto opt_batches = ::arrow::internal::MapVector(
        [](cp::ExecBatch batch) { return std::make_optional(std::move(batch)); },
        batches);
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen;
    gen = arrow::MakeVectorGenerator(std::move(opt_batches));
    return gen;
  }
};

生成用于计算的样本批次

arrow::Result<BatchesWithSchema> MakeBasicBatches() {
  BatchesWithSchema out;
  auto field_vector = {arrow::field("a", arrow::int32()),
                       arrow::field("b", arrow::boolean())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({0, 4}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({5, 6, 7}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({8, 9, 10}));

  ARROW_ASSIGN_OR_RAISE(auto b1_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b2_bool,
                        GetArrayDataSample<arrow::BooleanType>({true, false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b3_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true, false}));

  ARROW_ASSIGN_OR_RAISE(auto b1,
                        GetExecBatchFromVectors(field_vector, {b1_int, b1_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b2,
                        GetExecBatchFromVectors(field_vector, {b2_int, b2_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors(field_vector, {b3_int, b3_bool}));

  out.batches = {b1, b2, b3};
  out.schema = arrow::schema(field_vector);
  return out;
}

使用 source 的示例（sink 的用法在 sink 中详细说明）

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`table_source`#

在前面的示例中，源节点，使用源节点来输入数据。但是，在开发应用程序时，如果数据已经以表的形式存在于内存中，那么使用 TableSourceNodeOptions 会更容易、性能更高。这里，输入数据可以作为 std::shared_ptr<arrow::Table> 和 max_batch_size 传递。 max_batch_size 用于分解大型记录批次，以便可以并行处理它们。重要的是要注意，当源表具有较小的批处理大小时，表批处理不会合并以形成更大的批处理。

使用 table_source 的示例

/// \brief An example showing a table source node
///
/// TableSource-Table Example
/// This example shows how a table_source can be used
/// in an execution plan. This includes a table source node
/// receiving data from a table.  This plan simply collects the
/// data back into a table but nodes could be added that modify
/// or transform the data as well (as is shown in later examples)
arrow::Status TableSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;
  int max_batch_size = 2;
  auto table_source_options = ac::TableSourceNodeOptions{table, max_batch_size};

  ac::Declaration source{"table_source", std::move(table_source_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`filter`#

filter 操作，顾名思义，提供了一个定义数据过滤标准的选项。它选择给定表达式求值为 true 的行。可以使用 arrow::compute::Expression 编写过滤器，并且该表达式应具有布尔类型的返回类型。例如，如果我们希望保留列 b 的值大于 3 的行，则可以使用以下表达式。

过滤器示例

/// \brief An example showing a filter node
///
/// Source-Filter-Table
/// This example shows how a filter can be used in an execution plan,
/// to filter data from a source. The output from the execution plan
/// is collected into a table.
arrow::Status ScanFilterSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // specify the filter.  This filter removes all rows where the
  // value of the "a" column is greater than 3.
  cp::Expression filter_expr = cp::greater(cp::field_ref("a"), cp::literal(3));
  // set filter for scanner : on-disk / push-down filtering.
  // This step can be skipped if you are not reading from disk.
  options->filter = filter_expr;
  // empty projection
  options->projection = cp::project({}, {});

  // construct the scan node
  std::cout << "Initialized Scanning Options" << std::endl;

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};
  std::cout << "Scan node options created" << std::endl;

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  // pipe the scan node into the filter node
  // Need to set the filter in scan node options and filter node options.
  // At scan node it is used for on-disk / push-down filtering.
  // At filter node it is used for in-memory filtering.
  ac::Declaration filter{
      "filter", {std::move(scan)}, ac::FilterNodeOptions(std::move(filter_expr))};

  return ExecutePlanAndCollectAsTable(std::move(filter));
}

`project`#

project 操作重新排列、删除、转换和创建列。通过针对源记录批处理评估表达式来计算每个输出列。这些必须是标量表达式（由标量文字、字段引用和标量函数组成的表达式，即，对于每个输入行，返回一个值，该值独立于所有其他行的值）。这通过 ProjectNodeOptions 公开，该选项需要一个 arrow::compute::Expression 和每个输出列的名称（如果未提供名称，则将使用 exprs 的字符串表示形式）。

Project 示例

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

`aggregate`#

aggregate 节点计算各种类型的数据聚合。

Arrow 支持两种类型的聚合：“标量”聚合和“哈希”聚合。标量聚合将数组或标量输入简化为单个标量输出（例如，计算列的平均值）。哈希聚合的作用类似于 SQL 中的 GROUP BY，首先根据一个或多个键列对数据进行分区，然后减少每个分区中的数据。aggregate 节点支持这两种类型的计算，并且可以一次计算任意数量的聚合。

使用 AggregateNodeOptions 定义聚合标准。它接受聚合函数列表及其选项；要聚合的目标字段列表，每个函数一个；以及输出字段的名称列表，每个函数一个。可选地，它接受用于对数据进行分区的列列表，以进行哈希聚合。可以从此聚合函数列表中选择聚合函数。

注意

此节点是一个“管道中断器”，它将在内存中完全物化数据集。将来，将添加溢出机制，这应该可以缓解此限制。

聚合可以提供作为组或标量的结果。例如，像 hash_count 这样的操作会提供每个唯一记录的计数作为分组结果，而像 sum 这样的操作则提供单个记录。

标量聚合示例

/// \brief An example showing an aggregation node to aggregate an entire table
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in a scalar output. The source node loads the
/// data and the aggregation (counting unique types in column 'a')
/// is applied on this data. The output is collected into a table (that will
/// have exactly one row)
arrow::Status SourceScalarAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"sum", nullptr, "a", "sum(a)"}}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}

分组聚合示例

/// \brief An example showing an aggregation node to perform a group-by operation
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in grouped output. The source node loads the
/// data and the aggregation (counting unique types in column 'a') is
/// applied on this data. The output is collected into a table that will contain
/// one row for each unique combination of group keys.
arrow::Status SourceGroupAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto options = std::make_shared<cp::CountOptions>(cp::CountOptions::ONLY_VALID);
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"hash_count", options, "a", "count(a)"}},
                               /*keys=*/{"b"}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}

`sink`#

sink 操作提供输出，并且是流式执行定义的最后一个节点。 SinkNodeOptions 接口用于传递所需的选项。与源操作符类似，sink 操作符通过一个每次调用时返回记录批处理 future 的函数公开输出。预期调用者将重复调用此函数，直到生成器函数耗尽（返回 std::optional::nullopt）。如果此函数调用不够频繁，则记录批处理将在内存中累积。执行计划应仅具有一个“终端”节点（一个 sink 节点）。 ExecPlan 可能会由于取消或错误而提前终止，在完全消耗输出之前。但是，可以独立于 sink 安全地销毁该计划，sink 将通过 exec_plan->finished() 保存未消耗的批次。

作为源示例的一部分，还包括 Sink 操作；

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`consuming_sink`#

consuming_sink 操作符是一个 sink 操作，其中包含执行计划内的消耗操作（即，在消耗完成之前，执行计划不应完成）。与 sink 节点不同，此节点接收一个回调函数，该回调函数预期消耗该批处理。一旦此回调完成，执行计划将不再持有对该批处理的任何引用。可能会在上一次调用完成之前调用消耗函数。如果消耗函数运行不够快，则可能会堆积许多并发执行，从而阻塞 CPU 线程池。在所有消耗函数回调都已完成之前，执行计划不会被标记为已完成。一旦所有批次都已交付，执行计划将等待 finish future 完成，然后再将执行计划标记为已完成。这允许工作流将批次转换为异步任务（目前在内部为数据集写入节点完成）。

示例

// define a Custom SinkNodeConsumer
std::atomic<uint32_t> batches_seen{0};
arrow::Future<> finish = arrow::Future<>::Make();
struct CustomSinkNodeConsumer : public cp::SinkNodeConsumer {

    CustomSinkNodeConsumer(std::atomic<uint32_t> *batches_seen, arrow::Future<>finish):
    batches_seen(batches_seen), finish(std::move(finish)) {}
    // Consumption logic can be written here
    arrow::Status Consume(cp::ExecBatch batch) override {
    // data can be consumed in the expected way
    // transfer to another system or just do some work
    // and write to disk
    (*batches_seen)++;
    return arrow::Status::OK();
    }

    arrow::Future<> Finish() override { return finish; }

    std::atomic<uint32_t> *batches_seen;
    arrow::Future<> finish;

};

std::shared_ptr<CustomSinkNodeConsumer> consumer =
        std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

arrow::acero::ExecNode *consuming_sink;

ARROW_ASSIGN_OR_RAISE(consuming_sink, MakeExecNode("consuming_sink", plan.get(),
    {source}, cp::ConsumingSinkNodeOptions(consumer)));

消耗-Sink 示例

/// \brief An example showing a consuming sink node
///
/// Source-Consuming-Sink
/// This example shows how the data can be consumed within the execution plan
/// by using a ConsumingSink node. There is no data output from this execution plan.
arrow::Status SourceConsumingSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  std::atomic<uint32_t> batches_seen{0};
  arrow::Future<> finish = arrow::Future<>::Make();
  struct CustomSinkNodeConsumer : public ac::SinkNodeConsumer {
    CustomSinkNodeConsumer(std::atomic<uint32_t>* batches_seen, arrow::Future<> finish)
        : batches_seen(batches_seen), finish(std::move(finish)) {}

    arrow::Status Init(const std::shared_ptr<arrow::Schema>& schema,
                       ac::BackpressureControl* backpressure_control,
                       ac::ExecPlan* plan) override {
      // This will be called as the plan is started (before the first call to Consume)
      // and provides the schema of the data coming into the node, controls for pausing /
      // resuming input, and a pointer to the plan itself which can be used to access
      // other utilities such as the thread indexer or async task scheduler.
      return arrow::Status::OK();
    }

    arrow::Status Consume(cp::ExecBatch batch) override {
      (*batches_seen)++;
      return arrow::Status::OK();
    }

    arrow::Future<> Finish() override {
      // Here you can perform whatever (possibly async) cleanup is needed, e.g. closing
      // output file handles and flushing remaining work
      return arrow::Future<>::MakeFinished();
    }

    std::atomic<uint32_t>* batches_seen;
    arrow::Future<> finish;
  };
  std::shared_ptr<CustomSinkNodeConsumer> consumer =
      std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

  ac::Declaration consuming_sink{"consuming_sink",
                                 {std::move(source)},
                                 ac::ConsumingSinkNodeOptions(std::move(consumer))};

  // Since we are consuming the data within the plan there is no output and we simply
  // run the plan to completion instead of collecting into a table.
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(consuming_sink)));

  std::cout << "The consuming sink node saw " << batches_seen.load() << " batches"
            << std::endl;
  return arrow::Status::OK();
}

`order_by_sink`#

order_by_sink 操作是 sink 操作的扩展。此操作通过提供 OrderBySinkNodeOptions 来保证流的排序。在这里，提供了 arrow::compute::SortOptions 以定义哪些列用于排序，以及是按升序还是降序值排序。

注意

此节点是一个“管道中断器”，它将在内存中完全物化数据集。将来，将添加溢出机制，这应该可以缓解此限制。

Order-By-Sink 示例

arrow::Status ExecutePlanAndCollectAsTableWithCustomSink(
    std::shared_ptr<ac::ExecPlan> plan, std::shared_ptr<arrow::Schema> schema,
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen) {
  // translate sink_gen (async) to sink_reader (sync)
  std::shared_ptr<arrow::RecordBatchReader> sink_reader =
      ac::MakeGeneratorReader(schema, std::move(sink_gen), arrow::default_memory_pool());

  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;

  ARROW_ASSIGN_OR_RAISE(response_table,
                        arrow::Table::FromRecordBatchReader(sink_reader.get()));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  // stop producing
  plan->StopProducing();
  // plan mark finished
  auto future = plan->finished();
  return future.status();
}

/// \brief An example showing an order-by node
///
/// Source-OrderBy-Sink
/// In this example, the data enters through the source node
/// and the data is ordered in the sink node. The order can be
/// ASCENDING or DESCENDING and it is configurable. The output
/// is obtained as a table from the sink node.
arrow::Status SourceOrderBySinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeSortTestBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};
  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  ARROW_RETURN_NOT_OK(ac::MakeExecNode(
      "order_by_sink", plan.get(), {source},
      ac::OrderBySinkNodeOptions{
          cp::SortOptions{{cp::SortKey{"a", cp::SortOrder::Descending}}}, &sink_gen}));

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, basic_data.schema, sink_gen);
}

`select_k_sink`#

select_k_sink 选项允许选择顶部/底部 K 个元素，类似于 SQL ORDER BY ... LIMIT K 子句。 SelectKOptions 通过使用 OrderBySinkNode 定义来定义。此选项返回一个接收输入然后计算 top_k/bottom_k 的 sink 节点。

注意

此节点是一个“管道中断器”，它将在内存中完全物化输入。将来，将添加溢出机制，这应该可以缓解此限制。

SelectK 示例

/// \brief An example showing a select-k node
///
/// Source-KSelect
/// This example shows how K number of elements can be selected
/// either from the top or bottom. The output node is a modified
/// sink node where output can be obtained as a table.
arrow::Status SourceKSelectExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  ARROW_ASSIGN_OR_RAISE(
      ac::ExecNode * source,
      ac::MakeExecNode("source", plan.get(), {},
                       ac::SourceNodeOptions{input.schema, input.gen()}));

  cp::SelectKOptions options = cp::SelectKOptions::TopKDefault(/*k=*/2, {"i32"});

  ARROW_RETURN_NOT_OK(ac::MakeExecNode("select_k_sink", plan.get(), {source},
                                       ac::SelectKSinkNodeOptions{options, &sink_gen}));

  auto schema = arrow::schema(
      {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())});

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, schema, sink_gen);
}

`table_sink`#

table_sink 节点提供了将输出作为内存表接收的能力。这比流式执行引擎提供的其他 sink 节点更易于使用，但仅当输出可以轻松放入内存时才有意义。该节点是使用 TableSinkNodeOptions 创建的。

使用 table_sink 的示例

/// \brief An example showing a table sink node
///
/// TableSink Example
/// This example shows how a table_sink can be used
/// in an execution plan. This includes a source node
/// receiving data as batches and the table sink node
/// which emits the output as a table.
arrow::Status TableSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  std::shared_ptr<arrow::Table> output_table;
  auto table_sink_options = ac::TableSinkNodeOptions{&output_table};

  ARROW_RETURN_NOT_OK(
      ac::MakeExecNode("table_sink", plan.get(), {source}, table_sink_options));
  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // Wait for the plan to finish
  auto finished = plan->finished();
  RETURN_NOT_OK(finished.status());
  std::cout << "Results : " << output_table->ToString() << std::endl;
  return arrow::Status::OK();
}

`scan`#

scan 是一种用于加载和处理数据集的操作。当您的输入是一个数据集时，应优先选择更通用的 source 节点。行为是使用 arrow::dataset::ScanNodeOptions 定义的。有关数据集和各种扫描选项的更多信息，可以在表格数据集中找到。

此节点能够将下推过滤器应用于文件读取器，从而减少需要读取的数据量。这意味着您可以向扫描节点提供与您也提供给 FilterNode 的相同的过滤器表达式，因为过滤是在两个不同的位置完成的。

扫描示例

/// \brief An example demonstrating a scan and sink node
///
/// Scan-Table
/// This example shows how scan operation can be applied on a dataset.
/// There are operations that can be applied on the scan (project, filter)
/// and the input data can be processed. The output is obtained as a table
arrow::Status ScanSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  options->projection = cp::project({}, {});  // create empty projection

  // construct the scan node
  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(scan));
}

`write`#

write 节点使用 Arrow 中的表格数据集功能，将查询结果保存为 Parquet、Feather、CSV 等格式的文件数据集。写入选项通过 arrow::dataset::WriteNodeOptions 提供，而该选项又包含 arrow::dataset::FileSystemDatasetWriteOptions。arrow::dataset::FileSystemDatasetWriteOptions 提供了对写入数据集的控制，包括输出目录、文件命名方案等选项。

写入示例

/// \brief An example showing a write node
/// \param file_path The destination to write to
///
/// Scan-Filter-Write
/// This example shows how scan node can be used to load the data
/// and after processing how it can be written to disk.
arrow::Status ScanFilterWriteExample(const std::string& file_path) {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // empty projection
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  std::string root_path = "";
  std::string uri = "file://" + file_path;
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::fs::FileSystem> filesystem,
                        arrow::fs::FileSystemFromUri(uri, &root_path));

  auto base_path = root_path + "/parquet_dataset";
  // Uncomment the following line, if run repeatedly
  // ARROW_RETURN_NOT_OK(filesystem->DeleteDirContents(base_path));
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::int32())});
  // We'll use Hive-style partitioning,
  // which creates directories with "key=value" pairs.

  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();

  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";

  arrow::dataset::WriteNodeOptions write_node_options{write_options};

  ac::Declaration write{"write", {std::move(scan)}, std::move(write_node_options)};

  // Since the write node has no output we simply run the plan to completion and the
  // data should be written
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(write)));

  std::cout << "Dataset written to " << base_path << std::endl;
  return arrow::Status::OK();
}

`union`#

union 将具有相同架构的多个数据流合并为一个，类似于 SQL UNION ALL 子句。

以下示例演示了如何使用两个数据源来实现此目的。

Union 示例

/// \brief An example showing a union node
///
/// Source-Union-Table
/// This example shows how a union operation can be applied on two
/// data sources. The output is collected into a table.
arrow::Status SourceUnionSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  ac::Declaration lhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  lhs.label = "lhs";
  ac::Declaration rhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  rhs.label = "rhs";
  ac::Declaration union_plan{
      "union", {std::move(lhs), std::move(rhs)}, ac::ExecNodeOptions{}};

  return ExecutePlanAndCollectAsTable(std::move(union_plan));
}

`hash_join`#

hash_join 操作提供了关系代数操作，使用基于哈希的算法进行连接。 HashJoinNodeOptions 包含定义连接所需的选项。 hash_join 支持左/右/全半/反/外连接。此外，连接键（即，要连接的列）和后缀（即，可以作为后缀附加到左右关系中重复的列名的后缀项，例如“_x”）可以通过连接选项进行设置。阅读更多关于哈希连接的信息。

Hash-Join 示例

/// \brief An example showing a hash join node
///
/// Source-HashJoin-Table
/// This example shows how source node gets the data and how a self-join
/// is applied on the data. The join options are configurable. The output
/// is collected into a table.
arrow::Status SourceHashJoinSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());

  ac::Declaration left{"source", ac::SourceNodeOptions{input.schema, input.gen()}};
  ac::Declaration right{"source", ac::SourceNodeOptions{input.schema, input.gen()}};

  ac::HashJoinNodeOptions join_opts{
      ac::JoinType::INNER,
      /*left_keys=*/{"str"},
      /*right_keys=*/{"str"}, cp::literal(true), "l_", "r_"};

  ac::Declaration hashjoin{
      "hashjoin", {std::move(left), std::move(right)}, std::move(join_opts)};

  return ExecutePlanAndCollectAsTable(std::move(hashjoin));
}

概述#

这些节点的示例可以在 Arrow 源代码的 cpp/examples/arrow/execution_plan_documentation_examples.cc 中找到。

完整示例

#include <arrow/array.h>
#include <arrow/builder.h>

#include <arrow/acero/exec_plan.h>
#include <arrow/compute/api.h>
#include <arrow/compute/api_vector.h>
#include <arrow/compute/cast.h>

#include <arrow/csv/api.h>

#include <arrow/dataset/dataset.h>
#include <arrow/dataset/file_base.h>
#include <arrow/dataset/file_parquet.h>
#include <arrow/dataset/plan.h>
#include <arrow/dataset/scanner.h>

#include <arrow/io/interfaces.h>
#include <arrow/io/memory.h>

#include <arrow/result.h>
#include <arrow/status.h>
#include <arrow/table.h>

#include <arrow/ipc/api.h>

#include <arrow/util/future.h>
#include <arrow/util/range.h>
#include <arrow/util/thread_pool.h>
#include <arrow/util/vector.h>

#include <iostream>
#include <memory>
#include <utility>

// Demonstrate various operators in Arrow Streaming Execution Engine

namespace cp = ::arrow::compute;
namespace ac = ::arrow::acero;

constexpr char kSep[] = "******";

void PrintBlock(const std::string& msg) {
  std::cout << "\n\t" << kSep << " " << msg << " " << kSep << "\n" << std::endl;
}

template <typename TYPE,
          typename = typename std::enable_if<arrow::is_number_type<TYPE>::value |
                                             arrow::is_boolean_type<TYPE>::value |
                                             arrow::is_temporal_type<TYPE>::value>::type>
arrow::Result<std::shared_ptr<arrow::Array>> GetArrayDataSample(
    const std::vector<typename TYPE::c_type>& values) {
  using ArrowBuilderType = typename arrow::TypeTraits<TYPE>::BuilderType;
  ArrowBuilderType builder;
  ARROW_RETURN_NOT_OK(builder.Reserve(values.size()));
  ARROW_RETURN_NOT_OK(builder.AppendValues(values));
  return builder.Finish();
}

template <class TYPE>
arrow::Result<std::shared_ptr<arrow::Array>> GetBinaryArrayDataSample(
    const std::vector<std::string>& values) {
  using ArrowBuilderType = typename arrow::TypeTraits<TYPE>::BuilderType;
  ArrowBuilderType builder;
  ARROW_RETURN_NOT_OK(builder.Reserve(values.size()));
  ARROW_RETURN_NOT_OK(builder.AppendValues(values));
  return builder.Finish();
}

arrow::Result<std::shared_ptr<arrow::RecordBatch>> GetSampleRecordBatch(
    const arrow::ArrayVector array_vector, const arrow::FieldVector& field_vector) {
  std::shared_ptr<arrow::RecordBatch> record_batch;
  ARROW_ASSIGN_OR_RAISE(auto struct_result,
                        arrow::StructArray::Make(array_vector, field_vector));
  return record_batch->FromStructArray(struct_result);
}

/// \brief Create a sample table
/// The table's contents will be:
/// a,b
/// 1,null
/// 2,true
/// null,true
/// 3,false
/// null,true
/// 4,false
/// 5,null
/// 6,false
/// 7,false
/// 8,true
/// \return The created table

arrow::Result<std::shared_ptr<arrow::Table>> GetTable() {
  auto null_long = std::numeric_limits<int64_t>::quiet_NaN();
  ARROW_ASSIGN_OR_RAISE(auto int64_array,
                        GetArrayDataSample<arrow::Int64Type>(
                            {1, 2, null_long, 3, null_long, 4, 5, 6, 7, 8}));

  arrow::BooleanBuilder boolean_builder;
  std::shared_ptr<arrow::BooleanArray> bool_array;

  std::vector<uint8_t> bool_values = {false, true,  true,  false, true,
                                      false, false, false, false, true};
  std::vector<bool> is_valid = {false, true,  true, true, true,
                                true,  false, true, true, true};

  ARROW_RETURN_NOT_OK(boolean_builder.Reserve(10));

  ARROW_RETURN_NOT_OK(boolean_builder.AppendValues(bool_values, is_valid));

  ARROW_RETURN_NOT_OK(boolean_builder.Finish(&bool_array));

  auto record_batch =
      arrow::RecordBatch::Make(arrow::schema({arrow::field("a", arrow::int64()),
                                              arrow::field("b", arrow::boolean())}),
                               10, {int64_array, bool_array});
  ARROW_ASSIGN_OR_RAISE(auto table, arrow::Table::FromRecordBatches({record_batch}));
  return table;
}

/// \brief Create a sample dataset
/// \return An in-memory dataset based on GetTable()
arrow::Result<std::shared_ptr<arrow::dataset::Dataset>> GetDataset() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());
  auto ds = std::make_shared<arrow::dataset::InMemoryDataset>(table);
  return ds;
}

arrow::Result<cp::ExecBatch> GetExecBatchFromVectors(
    const arrow::FieldVector& field_vector, const arrow::ArrayVector& array_vector) {
  std::shared_ptr<arrow::RecordBatch> record_batch;
  ARROW_ASSIGN_OR_RAISE(auto res_batch, GetSampleRecordBatch(array_vector, field_vector));
  cp::ExecBatch batch{*res_batch};
  return batch;
}

// (Doc section: BatchesWithSchema Definition)
struct BatchesWithSchema {
  std::vector<cp::ExecBatch> batches;
  std::shared_ptr<arrow::Schema> schema;
  // This method uses internal arrow utilities to
  // convert a vector of record batches to an AsyncGenerator of optional batches
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen() const {
    auto opt_batches = ::arrow::internal::MapVector(
        [](cp::ExecBatch batch) { return std::make_optional(std::move(batch)); },
        batches);
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen;
    gen = arrow::MakeVectorGenerator(std::move(opt_batches));
    return gen;
  }
};
// (Doc section: BatchesWithSchema Definition)

// (Doc section: MakeBasicBatches Definition)
arrow::Result<BatchesWithSchema> MakeBasicBatches() {
  BatchesWithSchema out;
  auto field_vector = {arrow::field("a", arrow::int32()),
                       arrow::field("b", arrow::boolean())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({0, 4}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({5, 6, 7}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({8, 9, 10}));

  ARROW_ASSIGN_OR_RAISE(auto b1_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b2_bool,
                        GetArrayDataSample<arrow::BooleanType>({true, false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b3_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true, false}));

  ARROW_ASSIGN_OR_RAISE(auto b1,
                        GetExecBatchFromVectors(field_vector, {b1_int, b1_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b2,
                        GetExecBatchFromVectors(field_vector, {b2_int, b2_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors(field_vector, {b3_int, b3_bool}));

  out.batches = {b1, b2, b3};
  out.schema = arrow::schema(field_vector);
  return out;
}
// (Doc section: MakeBasicBatches Definition)

arrow::Result<BatchesWithSchema> MakeSortTestBasicBatches() {
  BatchesWithSchema out;
  auto field = arrow::field("a", arrow::int32());
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({1, 3, 0, 2}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int,
                        GetArrayDataSample<arrow::Int32Type>({121, 101, 120, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int,
                        GetArrayDataSample<arrow::Int32Type>({10, 110, 210, 121}));
  ARROW_ASSIGN_OR_RAISE(auto b4_int,
                        GetArrayDataSample<arrow::Int32Type>({51, 101, 2, 34}));
  ARROW_ASSIGN_OR_RAISE(auto b5_int,
                        GetArrayDataSample<arrow::Int32Type>({11, 31, 1, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b6_int,
                        GetArrayDataSample<arrow::Int32Type>({12, 101, 120, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b7_int,
                        GetArrayDataSample<arrow::Int32Type>({0, 110, 210, 11}));
  ARROW_ASSIGN_OR_RAISE(auto b8_int,
                        GetArrayDataSample<arrow::Int32Type>({51, 10, 2, 3}));

  ARROW_ASSIGN_OR_RAISE(auto b1, GetExecBatchFromVectors({field}, {b1_int}));
  ARROW_ASSIGN_OR_RAISE(auto b2, GetExecBatchFromVectors({field}, {b2_int}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors({field, field}, {b3_int, b8_int}));
  ARROW_ASSIGN_OR_RAISE(auto b4,
                        GetExecBatchFromVectors({field, field, field, field},
                                                {b4_int, b5_int, b6_int, b7_int}));
  out.batches = {b1, b2, b3, b4};
  out.schema = arrow::schema({field});
  return out;
}

arrow::Result<BatchesWithSchema> MakeGroupableBatches(int multiplicity = 1) {
  BatchesWithSchema out;
  auto fields = {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({12, 7, 3}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({-2, -1, 3}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({5, 3, -8}));
  ARROW_ASSIGN_OR_RAISE(auto b1_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"alpha", "beta", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b2_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"alpha", "gamma", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b3_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"gamma", "beta", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b1, GetExecBatchFromVectors(fields, {b1_int, b1_str}));
  ARROW_ASSIGN_OR_RAISE(auto b2, GetExecBatchFromVectors(fields, {b2_int, b2_str}));
  ARROW_ASSIGN_OR_RAISE(auto b3, GetExecBatchFromVectors(fields, {b3_int, b3_str}));
  out.batches = {b1, b2, b3};

  size_t batch_count = out.batches.size();
  for (int repeat = 1; repeat < multiplicity; ++repeat) {
    for (size_t i = 0; i < batch_count; ++i) {
      out.batches.push_back(out.batches[i]);
    }
  }

  out.schema = arrow::schema(fields);
  return out;
}

arrow::Status ExecutePlanAndCollectAsTable(ac::Declaration plan) {
  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;
  ARROW_ASSIGN_OR_RAISE(response_table, ac::DeclarationToTable(std::move(plan)));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  return arrow::Status::OK();
}

// (Doc section: Scan Example)

/// \brief An example demonstrating a scan and sink node
///
/// Scan-Table
/// This example shows how scan operation can be applied on a dataset.
/// There are operations that can be applied on the scan (project, filter)
/// and the input data can be processed. The output is obtained as a table
arrow::Status ScanSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  options->projection = cp::project({}, {});  // create empty projection

  // construct the scan node
  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(scan));
}
// (Doc section: Scan Example)

// (Doc section: Source Example)

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}
// (Doc section: Source Example)

// (Doc section: Table Source Example)

/// \brief An example showing a table source node
///
/// TableSource-Table Example
/// This example shows how a table_source can be used
/// in an execution plan. This includes a table source node
/// receiving data from a table.  This plan simply collects the
/// data back into a table but nodes could be added that modify
/// or transform the data as well (as is shown in later examples)
arrow::Status TableSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;
  int max_batch_size = 2;
  auto table_source_options = ac::TableSourceNodeOptions{table, max_batch_size};

  ac::Declaration source{"table_source", std::move(table_source_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}
// (Doc section: Table Source Example)

// (Doc section: Filter Example)

/// \brief An example showing a filter node
///
/// Source-Filter-Table
/// This example shows how a filter can be used in an execution plan,
/// to filter data from a source. The output from the execution plan
/// is collected into a table.
arrow::Status ScanFilterSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // specify the filter.  This filter removes all rows where the
  // value of the "a" column is greater than 3.
  cp::Expression filter_expr = cp::greater(cp::field_ref("a"), cp::literal(3));
  // set filter for scanner : on-disk / push-down filtering.
  // This step can be skipped if you are not reading from disk.
  options->filter = filter_expr;
  // empty projection
  options->projection = cp::project({}, {});

  // construct the scan node
  std::cout << "Initialized Scanning Options" << std::endl;

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};
  std::cout << "Scan node options created" << std::endl;

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  // pipe the scan node into the filter node
  // Need to set the filter in scan node options and filter node options.
  // At scan node it is used for on-disk / push-down filtering.
  // At filter node it is used for in-memory filtering.
  ac::Declaration filter{
      "filter", {std::move(scan)}, ac::FilterNodeOptions(std::move(filter_expr))};

  return ExecutePlanAndCollectAsTable(std::move(filter));
}

// (Doc section: Filter Example)

// (Doc section: Project Example)

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

// (Doc section: Project Example)

// This is a variation of ScanProjectSinkExample introducing how to use the
// Declaration::Sequence function
arrow::Status ScanProjectSequenceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  // (Doc section: Project Sequence Example)
  // Inputs do not have to be passed to the project node when using Sequence
  ac::Declaration plan =
      ac::Declaration::Sequence({{"scan", std::move(scan_node_options)},
                                 {"project", ac::ProjectNodeOptions({a_times_2})}});
  // (Doc section: Project Sequence Example)

  return ExecutePlanAndCollectAsTable(std::move(plan));
}

// (Doc section: Scalar Aggregate Example)

/// \brief An example showing an aggregation node to aggregate an entire table
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in a scalar output. The source node loads the
/// data and the aggregation (counting unique types in column 'a')
/// is applied on this data. The output is collected into a table (that will
/// have exactly one row)
arrow::Status SourceScalarAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"sum", nullptr, "a", "sum(a)"}}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}
// (Doc section: Scalar Aggregate Example)

// (Doc section: Group Aggregate Example)

/// \brief An example showing an aggregation node to perform a group-by operation
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in grouped output. The source node loads the
/// data and the aggregation (counting unique types in column 'a') is
/// applied on this data. The output is collected into a table that will contain
/// one row for each unique combination of group keys.
arrow::Status SourceGroupAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto options = std::make_shared<cp::CountOptions>(cp::CountOptions::ONLY_VALID);
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"hash_count", options, "a", "count(a)"}},
                               /*keys=*/{"b"}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}
// (Doc section: Group Aggregate Example)

// (Doc section: ConsumingSink Example)

/// \brief An example showing a consuming sink node
///
/// Source-Consuming-Sink
/// This example shows how the data can be consumed within the execution plan
/// by using a ConsumingSink node. There is no data output from this execution plan.
arrow::Status SourceConsumingSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  std::atomic<uint32_t> batches_seen{0};
  arrow::Future<> finish = arrow::Future<>::Make();
  struct CustomSinkNodeConsumer : public ac::SinkNodeConsumer {
    CustomSinkNodeConsumer(std::atomic<uint32_t>* batches_seen, arrow::Future<> finish)
        : batches_seen(batches_seen), finish(std::move(finish)) {}

    arrow::Status Init(const std::shared_ptr<arrow::Schema>& schema,
                       ac::BackpressureControl* backpressure_control,
                       ac::ExecPlan* plan) override {
      // This will be called as the plan is started (before the first call to Consume)
      // and provides the schema of the data coming into the node, controls for pausing /
      // resuming input, and a pointer to the plan itself which can be used to access
      // other utilities such as the thread indexer or async task scheduler.
      return arrow::Status::OK();
    }

    arrow::Status Consume(cp::ExecBatch batch) override {
      (*batches_seen)++;
      return arrow::Status::OK();
    }

    arrow::Future<> Finish() override {
      // Here you can perform whatever (possibly async) cleanup is needed, e.g. closing
      // output file handles and flushing remaining work
      return arrow::Future<>::MakeFinished();
    }

    std::atomic<uint32_t>* batches_seen;
    arrow::Future<> finish;
  };
  std::shared_ptr<CustomSinkNodeConsumer> consumer =
      std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

  ac::Declaration consuming_sink{"consuming_sink",
                                 {std::move(source)},
                                 ac::ConsumingSinkNodeOptions(std::move(consumer))};

  // Since we are consuming the data within the plan there is no output and we simply
  // run the plan to completion instead of collecting into a table.
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(consuming_sink)));

  std::cout << "The consuming sink node saw " << batches_seen.load() << " batches"
            << std::endl;
  return arrow::Status::OK();
}
// (Doc section: ConsumingSink Example)

// (Doc section: OrderBySink Example)

arrow::Status ExecutePlanAndCollectAsTableWithCustomSink(
    std::shared_ptr<ac::ExecPlan> plan, std::shared_ptr<arrow::Schema> schema,
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen) {
  // translate sink_gen (async) to sink_reader (sync)
  std::shared_ptr<arrow::RecordBatchReader> sink_reader =
      ac::MakeGeneratorReader(schema, std::move(sink_gen), arrow::default_memory_pool());

  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;

  ARROW_ASSIGN_OR_RAISE(response_table,
                        arrow::Table::FromRecordBatchReader(sink_reader.get()));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  // stop producing
  plan->StopProducing();
  // plan mark finished
  auto future = plan->finished();
  return future.status();
}

/// \brief An example showing an order-by node
///
/// Source-OrderBy-Sink
/// In this example, the data enters through the source node
/// and the data is ordered in the sink node. The order can be
/// ASCENDING or DESCENDING and it is configurable. The output
/// is obtained as a table from the sink node.
arrow::Status SourceOrderBySinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeSortTestBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};
  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  ARROW_RETURN_NOT_OK(ac::MakeExecNode(
      "order_by_sink", plan.get(), {source},
      ac::OrderBySinkNodeOptions{
          cp::SortOptions{{cp::SortKey{"a", cp::SortOrder::Descending}}}, &sink_gen}));

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, basic_data.schema, sink_gen);
}

// (Doc section: OrderBySink Example)

// (Doc section: HashJoin Example)

/// \brief An example showing a hash join node
///
/// Source-HashJoin-Table
/// This example shows how source node gets the data and how a self-join
/// is applied on the data. The join options are configurable. The output
/// is collected into a table.
arrow::Status SourceHashJoinSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());

  ac::Declaration left{"source", ac::SourceNodeOptions{input.schema, input.gen()}};
  ac::Declaration right{"source", ac::SourceNodeOptions{input.schema, input.gen()}};

  ac::HashJoinNodeOptions join_opts{
      ac::JoinType::INNER,
      /*left_keys=*/{"str"},
      /*right_keys=*/{"str"}, cp::literal(true), "l_", "r_"};

  ac::Declaration hashjoin{
      "hashjoin", {std::move(left), std::move(right)}, std::move(join_opts)};

  return ExecutePlanAndCollectAsTable(std::move(hashjoin));
}

// (Doc section: HashJoin Example)

// (Doc section: KSelect Example)

/// \brief An example showing a select-k node
///
/// Source-KSelect
/// This example shows how K number of elements can be selected
/// either from the top or bottom. The output node is a modified
/// sink node where output can be obtained as a table.
arrow::Status SourceKSelectExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  ARROW_ASSIGN_OR_RAISE(
      ac::ExecNode * source,
      ac::MakeExecNode("source", plan.get(), {},
                       ac::SourceNodeOptions{input.schema, input.gen()}));

  cp::SelectKOptions options = cp::SelectKOptions::TopKDefault(/*k=*/2, {"i32"});

  ARROW_RETURN_NOT_OK(ac::MakeExecNode("select_k_sink", plan.get(), {source},
                                       ac::SelectKSinkNodeOptions{options, &sink_gen}));

  auto schema = arrow::schema(
      {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())});

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, schema, sink_gen);
}

// (Doc section: KSelect Example)

// (Doc section: Write Example)

/// \brief An example showing a write node
/// \param file_path The destination to write to
///
/// Scan-Filter-Write
/// This example shows how scan node can be used to load the data
/// and after processing how it can be written to disk.
arrow::Status ScanFilterWriteExample(const std::string& file_path) {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // empty projection
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  std::string root_path = "";
  std::string uri = "file://" + file_path;
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::fs::FileSystem> filesystem,
                        arrow::fs::FileSystemFromUri(uri, &root_path));

  auto base_path = root_path + "/parquet_dataset";
  // Uncomment the following line, if run repeatedly
  // ARROW_RETURN_NOT_OK(filesystem->DeleteDirContents(base_path));
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::int32())});
  // We'll use Hive-style partitioning,
  // which creates directories with "key=value" pairs.

  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();

  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";

  arrow::dataset::WriteNodeOptions write_node_options{write_options};

  ac::Declaration write{"write", {std::move(scan)}, std::move(write_node_options)};

  // Since the write node has no output we simply run the plan to completion and the
  // data should be written
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(write)));

  std::cout << "Dataset written to " << base_path << std::endl;
  return arrow::Status::OK();
}

// (Doc section: Write Example)

// (Doc section: Union Example)

/// \brief An example showing a union node
///
/// Source-Union-Table
/// This example shows how a union operation can be applied on two
/// data sources. The output is collected into a table.
arrow::Status SourceUnionSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  ac::Declaration lhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  lhs.label = "lhs";
  ac::Declaration rhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  rhs.label = "rhs";
  ac::Declaration union_plan{
      "union", {std::move(lhs), std::move(rhs)}, ac::ExecNodeOptions{}};

  return ExecutePlanAndCollectAsTable(std::move(union_plan));
}

// (Doc section: Union Example)

// (Doc section: Table Sink Example)

/// \brief An example showing a table sink node
///
/// TableSink Example
/// This example shows how a table_sink can be used
/// in an execution plan. This includes a source node
/// receiving data as batches and the table sink node
/// which emits the output as a table.
arrow::Status TableSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  std::shared_ptr<arrow::Table> output_table;
  auto table_sink_options = ac::TableSinkNodeOptions{&output_table};

  ARROW_RETURN_NOT_OK(
      ac::MakeExecNode("table_sink", plan.get(), {source}, table_sink_options));
  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // Wait for the plan to finish
  auto finished = plan->finished();
  RETURN_NOT_OK(finished.status());
  std::cout << "Results : " << output_table->ToString() << std::endl;
  return arrow::Status::OK();
}

// (Doc section: Table Sink Example)

// (Doc section: RecordBatchReaderSource Example)

/// \brief An example showing the usage of a RecordBatchReader as the data source.
///
/// RecordBatchReaderSourceSink Example
/// This example shows how a record_batch_reader_source can be used
/// in an execution plan. This includes the source node
/// receiving data from a TableRecordBatchReader.

arrow::Status RecordBatchReaderSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());
  std::shared_ptr<arrow::RecordBatchReader> reader =
      std::make_shared<arrow::TableBatchReader>(table);
  ac::Declaration reader_source{"record_batch_reader_source",
                                ac::RecordBatchReaderSourceNodeOptions{reader}};
  return ExecutePlanAndCollectAsTable(std::move(reader_source));
}

// (Doc section: RecordBatchReaderSource Example)

enum ExampleMode {
  SOURCE_SINK = 0,
  TABLE_SOURCE_SINK = 1,
  SCAN = 2,
  FILTER = 3,
  PROJECT = 4,
  SCALAR_AGGREGATION = 5,
  GROUP_AGGREGATION = 6,
  CONSUMING_SINK = 7,
  ORDER_BY_SINK = 8,
  HASHJOIN = 9,
  KSELECT = 10,
  WRITE = 11,
  UNION = 12,
  TABLE_SOURCE_TABLE_SINK = 13,
  RECORD_BATCH_READER_SOURCE = 14,
  PROJECT_SEQUENCE = 15
};

int main(int argc, char** argv) {
  if (argc < 3) {
    // Fake success for CI purposes.
    return EXIT_SUCCESS;
  }

  std::string base_save_path = argv[1];
  int mode = std::atoi(argv[2]);
  arrow::Status status;
  // ensure arrow::dataset node factories are in the registry
  arrow::dataset::internal::Initialize();
  switch (mode) {
    case SOURCE_SINK:
      PrintBlock("Source Sink Example");
      status = SourceSinkExample();
      break;
    case TABLE_SOURCE_SINK:
      PrintBlock("Table Source Sink Example");
      status = TableSourceSinkExample();
      break;
    case SCAN:
      PrintBlock("Scan Example");
      status = ScanSinkExample();
      break;
    case FILTER:
      PrintBlock("Filter Example");
      status = ScanFilterSinkExample();
      break;
    case PROJECT:
      PrintBlock("Project Example");
      status = ScanProjectSinkExample();
      break;
    case PROJECT_SEQUENCE:
      PrintBlock("Project Example (using Declaration::Sequence)");
      status = ScanProjectSequenceSinkExample();
      break;
    case GROUP_AGGREGATION:
      PrintBlock("Aggregate Example");
      status = SourceGroupAggregateSinkExample();
      break;
    case SCALAR_AGGREGATION:
      PrintBlock("Aggregate Example");
      status = SourceScalarAggregateSinkExample();
      break;
    case CONSUMING_SINK:
      PrintBlock("Consuming-Sink Example");
      status = SourceConsumingSinkExample();
      break;
    case ORDER_BY_SINK:
      PrintBlock("OrderBy Example");
      status = SourceOrderBySinkExample();
      break;
    case HASHJOIN:
      PrintBlock("HashJoin Example");
      status = SourceHashJoinSinkExample();
      break;
    case KSELECT:
      PrintBlock("KSelect Example");
      status = SourceKSelectExample();
      break;
    case WRITE:
      PrintBlock("Write Example");
      status = ScanFilterWriteExample(base_save_path);
      break;
    case UNION:
      PrintBlock("Union Example");
      status = SourceUnionSinkExample();
      break;
    case TABLE_SOURCE_TABLE_SINK:
      PrintBlock("TableSink Example");
      status = TableSinkExample();
      break;
    case RECORD_BATCH_READER_SOURCE:
      PrintBlock("RecordBatchReaderSource Example");
      status = RecordBatchReaderSourceSinkExample();
      break;
    default:
      break;
  }

  if (status.ok()) {
    return EXIT_SUCCESS;
  } else {
    std::cout << "Error occurred: " << status.message() << std::endl;
    return EXIT_FAILURE;
  }
}

Acero 用户指南#

使用 Acero#

创建计划#

使用 Substrait#

以编程方式创建计划#

执行计划#

DeclarationToTable#

DeclarationToReader#

DeclarationToStatus#

直接运行计划#

提供输入#

可用的 ExecNode 实现#

源#

计算节点#

排列节点#

Sink 节点#

示例#

source#

table_source#

filter#

project#

aggregate#

sink#

consuming_sink#

order_by_sink#

select_k_sink#

table_sink#

scan#

write#

union#

hash_join#

概述#