Arrow 计算#

Apache Arrow 提供计算函数以促进高效且可移植的数据处理。在本文中，您将使用 Arrow 的计算功能来

计算一列的总和
计算两列的逐元素总和
搜索一列中的值

先决条件#

在继续之前，请确保您拥有

Arrow 安装，您可以在此处设置：在您自己的项目中使用 Arrow C++。如果您自己编译 Arrow，请确保您启用了计算模块进行编译（即 -DARROW_COMPUTE=ON），请参阅可选组件。
了解来自基本 Arrow 数据结构的基本 Arrow 数据结构

设置#

在运行一些计算之前，我们需要填补几个空白

我们需要包含必要的头文件。
需要一个 main() 将所有内容粘合在一起。
我们需要一些数据来操作。

包含#

在编写 C++ 代码之前，我们需要一些包含文件。我们将获得 iostream 用于输出，然后导入 Arrow 的计算功能

#include <arrow/api.h>
#include <arrow/compute/api.h>

#include <iostream>

Main()#

对于我们的粘合剂，我们将使用来自先前数据结构教程的 main() 模式

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

它与 RunMain() 配对，就像我们之前使用它一样

arrow::Status RunMain() {

生成用于计算的表格#

在开始之前，我们将初始化一个具有两列的 Table 来操作。我们将使用基本 Arrow 数据结构中的方法，因此如果任何内容令人困惑，请回顾一下

  // Create a couple 32-bit integer arrays.
  arrow::Int32Builder int32builder;
  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
  std::shared_ptr<arrow::Array> some_nums;
  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());

  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
  std::shared_ptr<arrow::Array> more_nums;
  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());

  // Make a table out of our pair of arrays.
  std::shared_ptr<arrow::Field> field_a, field_b;
  std::shared_ptr<arrow::Schema> schema;

  field_a = arrow::field("A", arrow::int32());
  field_b = arrow::field("B", arrow::int32());

  schema = arrow::schema({field_a, field_b});

  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);

计算数组的总和#

使用计算函数通常有两个步骤，我们在此处将其分开

准备一个用于输出的 Datum
调用 compute::Sum()，这是一个用于对 Array 进行求和的便捷函数
检索和打印输出

使用 Datum 准备输出内存#

完成计算后，我们需要一个地方来存放结果。在 Arrow 中，用于此类输出的对象称为 Datum。此对象用于在计算函数中传递输入和输出，并且可以包含许多不同形状的 Arrow 数据结构。我们需要它来从计算函数中检索输出。

  // The Datum class is what all compute functions output to, and they can take Datums
  // as inputs, as well.
  arrow::Datum sum;

调用 Sum()#

在这里，我们将获取我们的 Table，它具有“A”和“B”列，并对列“A”求和。对于求和，有一个便捷函数，称为 compute::Sum()，它降低了计算接口的复杂性。我们将查看下一个计算的更复杂版本。对于给定函数，请参阅计算函数以查看是否存在便捷函数。compute::Sum() 接收给定的 Array 或 ChunkedArray – 在这里，我们使用 Table::GetColumnByName() 传递列 A。然后，它输出到 Datum。将所有这些放在一起，我们得到以下内容

  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
  // computation won't be so simple. However, using these where possible helps
  // readability.
  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));

从 Datum 获取结果#

上一步为我们提供了一个包含总和的 Datum。但是，我们不能直接打印它 – 它在保存任意 Arrow 数据结构方面的灵活性意味着我们必须小心地检索我们的数据。首先，要了解其中包含的内容，我们可以检查它是哪种数据结构，然后检查正在保存哪种原始类型

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
  std::cout << "Datum kind: " << sum.ToString()
            << " content type: " << sum.type()->ToString() << std::endl;

这应该报告 Datum 存储了一个具有 64 位整数的 Scalar。只是为了查看该值是什么，我们可以像这样打印出来，它会产生 12891

  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
  // you must ask for the correct type.
  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;

现在我们已经使用了 compute::Sum() 并从中获得了我们想要的内容！

使用 CallFunction() 计算逐元素数组加法#

下一层复杂性使用了 compute::Sum() 有助于隐藏的内容：compute::CallFunction()。对于此示例，我们将探讨如何使用更强大的 compute::CallFunction() 和“add”计算函数。模式仍然相似

准备一个用于输出的 Datum
使用“add”调用 compute::CallFunction()
检索和打印输出

使用 Datum 准备输出内存#

再一次，我们将需要一个 Datum 来保存我们获得的任何输出

  arrow::Datum element_wise_sum;

使用“add”使用 CallFunction()#

compute::CallFunction() 将所需函数的名称作为其第一个参数，然后将该函数的数据输入作为向量作为其第二个参数。现在，我们想要在列“A”和“B”之间进行逐元素加法。因此，我们将请求“add”，传入列“A 和 B”，并输出到我们的 Datum。将所有这些放在一起，我们得到

  // Get element-wise sum of both columns A and B in our Table. Note that here we use
  // CallFunction(), which takes the name of the function as the first argument.
  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
                                              "add", {table->GetColumnByName("A"),
                                                      table->GetColumnByName("B")}));

另请参阅

可用函数，以获取与 compute::CallFunction() 结合使用的其他函数列表

从 Datum 获取结果#

同样，Datum 需要一些谨慎的处理。当我们知道其中包含的内容时，这种处理就容易得多。此 Datum 保存了一个具有 32 位整数的 ChunkedArray，但我们可以打印出来以确认

  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
  std::cout << "Datum kind: " << element_wise_sum.ToString()
            << " content type: " << element_wise_sum.type()->ToString() << std::endl;

由于它是一个 ChunkedArray，因此我们从 Datum 中请求它 – ChunkedArray 具有 ChunkedArray::ToString() 方法，因此我们将使用它来打印其内容

  // This time, we get a ChunkedArray, not a scalar.
  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;

输出如下所示

Datum kind: ChunkedArray content type: int32
[
  [
    75376,
    647,
    2287,
    5671,
    5092
  ]
]

现在，我们使用了 compute::CallFunction()，而不是便捷函数！这使得可用的计算范围大大扩展。

使用 CallFunction() 和 Options 搜索值#

还有一类计算。 compute::CallFunction() 使用向量作为数据输入，但计算通常需要其他参数才能发挥作用。为了提供这一点，计算函数可能与结构体相关联，在结构体中可以定义其参数。您可以检查给定函数以查看它使用哪个结构体此处。对于此示例，我们将使用“index”计算函数在列“A”中搜索一个值。此过程包含三个步骤，而不是之前的两个步骤

准备一个用于输出的 Datum
准备 compute::IndexOptions
使用“index”和 compute::IndexOptions 调用 compute::CallFunction()
检索和打印输出

使用 Datum 准备输出内存#

我们将需要一个用于我们获得的任何输出的 Datum

  // Use an options struct to set up searching for 2223 in column A (the third item).
  arrow::Datum third_item;

使用 IndexOptions 配置“index”#

对于此探索，我们将使用“index”函数 – 这是一种搜索方法，它返回输入值的索引。为了传递此输入值，我们需要一个 compute::IndexOptions 结构体。因此，让我们创建该结构体

  // An options struct is used in lieu of passing an arbitrary amount of arguments.
  arrow::compute::IndexOptions index_options;

在搜索函数中，需要一个目标值。在这里，我们将使用 2223（列 A 中的第三项），并相应地配置我们的结构体

  // We need an Arrow Scalar, not a raw value.
  index_options.value = arrow::MakeScalar(2223);

使用“index”和 IndexOptions 使用 CallFunction()#

要实际运行该函数，我们将再次使用 compute::CallFunction()，这次将我们的 IndexOptions 结构体作为引用传递作为第三个参数。与之前一样，第一个参数是函数的名称，第二个参数是我们的数据输入

  ARROW_ASSIGN_OR_RAISE(
      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
                                               &index_options));

从 Datum 获取结果#

最后一次，让我们看看我们的 Datum 有什么！这将是一个具有 64 位整数的 Scalar，输出将为 2

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
  std::cout << "Datum kind: " << third_item.ToString()
            << " content type: " << third_item.type()->ToString() << std::endl;
  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;

结束程序#

最后，我们只需返回 arrow::Status::OK()，这样 main() 就会知道我们已经完成，并且一切正常，就像前面的教程一样。

  return arrow::Status::OK();
}

这样，您就使用了属于三种主要类型的计算函数 – 带有和不带有便捷函数，然后带有 Options 结构体。现在您可以处理您需要的任何 Table，并解决您遇到的任何适合内存的数据问题！

这意味着我们现在必须了解如何通过下一篇文章中的 Arrow 数据集处理大于内存的数据集。

请参阅以下内容以获取完整代码的副本

// (Doc section: Includes)
#include <arrow/api.h>
#include <arrow/compute/api.h>

#include <iostream>
// (Doc section: Includes)

// (Doc section: RunMain)
arrow::Status RunMain() {
  // (Doc section: RunMain)
  // (Doc section: Create Tables)
  // Create a couple 32-bit integer arrays.
  arrow::Int32Builder int32builder;
  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
  std::shared_ptr<arrow::Array> some_nums;
  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());

  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
  std::shared_ptr<arrow::Array> more_nums;
  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());

  // Make a table out of our pair of arrays.
  std::shared_ptr<arrow::Field> field_a, field_b;
  std::shared_ptr<arrow::Schema> schema;

  field_a = arrow::field("A", arrow::int32());
  field_b = arrow::field("B", arrow::int32());

  schema = arrow::schema({field_a, field_b});

  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);
  // (Doc section: Create Tables)

  // (Doc section: Sum Datum Declaration)
  // The Datum class is what all compute functions output to, and they can take Datums
  // as inputs, as well.
  arrow::Datum sum;
  // (Doc section: Sum Datum Declaration)
  // (Doc section: Sum Call)
  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
  // computation won't be so simple. However, using these where possible helps
  // readability.
  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));
  // (Doc section: Sum Call)
  // (Doc section: Sum Datum Type)
  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
  std::cout << "Datum kind: " << sum.ToString()
            << " content type: " << sum.type()->ToString() << std::endl;
  // (Doc section: Sum Datum Type)
  // (Doc section: Sum Contents)
  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
  // you must ask for the correct type.
  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;
  // (Doc section: Sum Contents)

  // (Doc section: Add Datum Declaration)
  arrow::Datum element_wise_sum;
  // (Doc section: Add Datum Declaration)
  // (Doc section: Add Call)
  // Get element-wise sum of both columns A and B in our Table. Note that here we use
  // CallFunction(), which takes the name of the function as the first argument.
  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
                                              "add", {table->GetColumnByName("A"),
                                                      table->GetColumnByName("B")}));
  // (Doc section: Add Call)
  // (Doc section: Add Datum Type)
  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
  std::cout << "Datum kind: " << element_wise_sum.ToString()
            << " content type: " << element_wise_sum.type()->ToString() << std::endl;
  // (Doc section: Add Datum Type)
  // (Doc section: Add Contents)
  // This time, we get a ChunkedArray, not a scalar.
  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;
  // (Doc section: Add Contents)

  // (Doc section: Index Datum Declare)
  // Use an options struct to set up searching for 2223 in column A (the third item).
  arrow::Datum third_item;
  // (Doc section: Index Datum Declare)
  // (Doc section: IndexOptions Declare)
  // An options struct is used in lieu of passing an arbitrary amount of arguments.
  arrow::compute::IndexOptions index_options;
  // (Doc section: IndexOptions Declare)
  // (Doc section: IndexOptions Assign)
  // We need an Arrow Scalar, not a raw value.
  index_options.value = arrow::MakeScalar(2223);
  // (Doc section: IndexOptions Assign)
  // (Doc section: Index Call)
  ARROW_ASSIGN_OR_RAISE(
      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
                                               &index_options));
  // (Doc section: Index Call)
  // (Doc section: Index Inspection)
  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
  std::cout << "Datum kind: " << third_item.ToString()
            << " content type: " << third_item.type()->ToString() << std::endl;
  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;
  // (Doc section: Index Inspection)
  // (Doc section: Ret)
  return arrow::Status::OK();
}
// (Doc section: Ret)

// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}
// (Doc section: Main)