基本 Arrow 数据结构#

Apache Arrow 提供了用于表示数据的基本数据结构：Array、ChunkedArray、RecordBatch 和 Table。本文介绍如何从原始数据类型构造这些数据结构；具体来说，我们将使用代表天数、月份和年份的各种大小的整数。我们将使用它们来创建以下数据结构

先决条件#

在继续之前，请确保您已具备

一个 Arrow 安装，您可以在此处进行设置：在您自己的项目中使用 Arrow C++
了解如何使用基本的 C++ 数据结构
了解基本的 C++ 数据类型

设置#

在尝试 Arrow 之前，我们需要填写一些空白

我们需要包含必要的头文件。
需要一个 A main() 将所有内容粘合在一起。

包含#

首先，像往常一样，我们需要一些包含。我们将获取 iostream 用于输出，然后从 api.h 导入 Arrow 的基本功能，如下所示

#include <arrow/api.h>

#include <iostream>

Main()#

接下来，我们需要一个 main() – Arrow 的常见模式如下所示

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

这使我们能够轻松地使用 Arrow 的错误处理宏，如果发生故障，它将返回到带有 arrow::Status 对象的 main() – 并且此 main() 将报告该错误。请注意，这意味着 Arrow 永远不会引发异常，而是依靠返回 Status。有关更多信息，请在此处阅读：约定。

为了配合这个 main()，我们有一个 RunMain()，任何 Status 对象都可以从中返回 – 我们将在此处编写程序的其余部分

arrow::Status RunMain() {

创建 Arrow 数组#

构建 int8 数组#

鉴于我们在标准 C++ 数组中拥有一些数据，并且想要使用 Arrow，我们需要将数据从这些数组移动到 Arrow 数组中。我们仍然保证 Array 中的内存连续性，因此在使用 Array 时无需担心性能损失与 C++ 数组相比。构造 Array 的最简单方法是使用 ArrayBuilder。

另请参阅

数组了解有关 Array 的更多技术细节

以下代码初始化一个 ArrayBuilder，用于将保存 8 位整数的 Array。具体来说，它使用具体的 arrow::ArrayBuilder 子类中存在的 AppendValues() 方法，以使用标准 C++ 数组的内容填充 ArrayBuilder。请注意 ARROW_RETURN_NOT_OK 的使用。如果 AppendValues() 失败，此宏将返回到 main()，后者将打印出失败的含义。

  // Builders are the main way to create Arrays in Arrow from existing values that are not
  // on-disk. In this case, we'll make a simple array, and feed that in.
  // Data types are important as ever, and there is a Builder for each compatible type;
  // in this case, int8.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));

给定一个 ArrayBuilder 在我们的 Array 中具有我们想要的值，我们可以使用 ArrayBuilder::Finish() 将最终结构输出到 Array – 具体来说，我们输出到 std::shared_ptr<arrow::Array>。请注意以下代码中 ARROW_ASSIGN_OR_RAISE 的使用。Finish() 输出一个 arrow::Result 对象，ARROW_ASSIGN_OR_RAISE 可以处理该对象。如果该方法失败，它将返回到带有 Status 的 main()，这将解释出现问题的原因。如果成功，它将将最终输出分配给左侧变量。

  // We only have a Builder though, not an Array -- the following code pushes out the
  // built up data into a proper Array.
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

一旦ArrayBuilder调用了它的Finish方法，其状态就会重置，因此可以像新的一样再次使用。因此，我们对第二个数组重复上述过程

  // Builders clear their state every time they fill an Array, so if the type is the same,
  // we can re-use the builder. We do that here for month values.
  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

构建 int16 数组#

ArrayBuilder的类型在声明时指定。一旦完成，就不能更改其类型。当我们切换到年份数据时，我们必须创建一个新的，这至少需要一个 16 位整数。当然，有一个ArrayBuilder可以做到。它使用完全相同的方法，但使用新的数据类型

  // Now that we change to int16, we use the Builder for that data type instead.
  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

现在，我们有三个 Arrow Arrays，类型上有一些差异。

创建 RecordBatch#

只有在您有一个表时，列式数据格式才会真正发挥作用。那么，让我们做一个。我们将制作的第一种是RecordBatch – 它在内部使用Arrays，这意味着所有数据在每个列内都是连续的，但任何附加或连接都需要复制。给定现有的Arrays，创建RecordBatch有两个步骤

定义Schema
将Schema和数组加载到构造函数中

定义 Schema#

要开始创建RecordBatch，我们首先需要定义列的特征，每列由一个Field实例表示。每个Field包含其关联列的名称和数据类型；然后，Schema将它们组合在一起并设置列的顺序，如下所示

  // Now, we want a RecordBatch, which has columns and labels for said columns.
  // This gets us to the 2d data structures we want in Arrow.
  // These are defined by schema, which have fields -- here we get both those object types
  // ready.
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  // Every field needs its name and data type.
  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  // The schema can be built from a vector of fields, and we do so here.
  schema = arrow::schema({field_day, field_month, field_year});

构建 RecordBatch#

有了上一节中的Arrays中的数据，以及上一步中的Schema中的列描述，我们可以创建RecordBatch。请注意，列的长度是必要的，并且所有列共享该长度。

  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
  // each column is internally contiguous. This is in opposition to Tables, which we'll
  // see next.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  // The RecordBatch needs the schema, length for columns, which all must match,
  // and the actual data itself.
  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});

  std::cout << rbatch->ToString();

现在，我们的数据以漂亮的表格形式安全地位于RecordBatch中。我们如何处理这些将在以后的教程中讨论。

创建 ChunkedArray#

假设我们想要一个由子数组组成的数组，因为它对于避免连接时的数据复制、并行化工作、将每个块放入缓存或超过标准 Arrow Array中的 2,147,483,647 行限制很有用。为此，Arrow 提供了ChunkedArray，它由单独的 Arrow Arrays组成。在此示例中，我们可以重用我们之前在部分分块数组中创建的数组，从而允许我们扩展它们而无需复制数据。因此，让我们构建更多的Arrays，为了便于使用，使用相同的构建器

  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
  // Builders.
  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
  std::shared_ptr<arrow::Array> days2;
  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());

  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
  std::shared_ptr<arrow::Array> months2;
  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());

  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
  std::shared_ptr<arrow::Array> years2;
  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());

为了支持构造ChunkedArray中的任意数量的Arrays，Arrow 提供了ArrayVector。这为Arrays提供了一个向量，我们将在此处使用它来准备创建ChunkedArray

  // ChunkedArrays let us have a list of arrays, which aren't contiguous
  // with each other. First, we get a vector of arrays.
  arrow::ArrayVector day_vecs{days, days2};

为了利用 Arrow，我们确实需要执行最后一步，并进入ChunkedArray

  // Then, we use that to initialize a ChunkedArray, which can be used with other
  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
  // get us far.
  std::shared_ptr<arrow::ChunkedArray> day_chunks =
      std::make_shared<arrow::ChunkedArray>(day_vecs);

有了用于我们的日期值的ChunkedArray，我们现在只需要对月份和年份数据重复此过程

  // Repeat for months.
  arrow::ArrayVector month_vecs{months, months2};
  std::shared_ptr<arrow::ChunkedArray> month_chunks =
      std::make_shared<arrow::ChunkedArray>(month_vecs);

  // Repeat for years.
  arrow::ArrayVector year_vecs{years, years2};
  std::shared_ptr<arrow::ChunkedArray> year_chunks =
      std::make_shared<arrow::ChunkedArray>(year_vecs);

有了这些，我们还剩下三个ChunkedArrays，它们的类型各不相同。

创建 Table#

我们可以使用上一节中的ChunkedArrays做的一件特别有用的事情是创建Tables。就像RecordBatch一样，Table存储表格数据。但是，由于Table由ChunkedArrays组成，因此不能保证连续性。这对于逻辑、并行化工作、将块放入缓存或超出Array中存在的 2,147,483,647 行限制（以及因此RecordBatch中）很有用。

如果您阅读了RecordBatch，您可能会注意到以下代码中的Table构造函数实际上是相同的，它只是恰好将列的长度放在位置 3，并创建一个Table。我们重用之前的Schema，并创建我们的Table

  // A Table is the structure we need for these non-contiguous columns, and keeps them
  // all in one place for us so we can use them as if they were normal arrays.
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);

  std::cout << table->ToString();

现在，我们的数据以漂亮的表格形式安全地位于Table中。我们如何处理这些将在以后的教程中讨论。

结束程序#

最后，我们只需返回Status::OK()，以便main()知道我们已完成，并且一切正常。

  return arrow::Status::OK();
}

总结#

有了这些，您已经在 Arrow 中创建了基本数据结构，并且可以在下一篇文章中继续使用文件 I/O 将它们移入和移出程序。

有关完整代码的副本，请参阅下文

// (Doc section: Includes)
#include <arrow/api.h>

#include <iostream>
// (Doc section: Includes)

// (Doc section: RunMain Start)
arrow::Status RunMain() {
  // (Doc section: RunMain Start)
  // (Doc section: int8builder 1 Append)
  // Builders are the main way to create Arrays in Arrow from existing values that are not
  // on-disk. In this case, we'll make a simple array, and feed that in.
  // Data types are important as ever, and there is a Builder for each compatible type;
  // in this case, int8.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
  // (Doc section: int8builder 1 Append)

  // (Doc section: int8builder 1 Finish)
  // We only have a Builder though, not an Array -- the following code pushes out the
  // built up data into a proper Array.
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
  // (Doc section: int8builder 1 Finish)

  // (Doc section: int8builder 2)
  // Builders clear their state every time they fill an Array, so if the type is the same,
  // we can re-use the builder. We do that here for month values.
  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
  // (Doc section: int8builder 2)

  // (Doc section: int16builder)
  // Now that we change to int16, we use the Builder for that data type instead.
  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
  // (Doc section: int16builder)

  // (Doc section: Schema)
  // Now, we want a RecordBatch, which has columns and labels for said columns.
  // This gets us to the 2d data structures we want in Arrow.
  // These are defined by schema, which have fields -- here we get both those object types
  // ready.
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  // Every field needs its name and data type.
  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  // The schema can be built from a vector of fields, and we do so here.
  schema = arrow::schema({field_day, field_month, field_year});
  // (Doc section: Schema)

  // (Doc section: RBatch)
  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
  // each column is internally contiguous. This is in opposition to Tables, which we'll
  // see next.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  // The RecordBatch needs the schema, length for columns, which all must match,
  // and the actual data itself.
  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});

  std::cout << rbatch->ToString();
  // (Doc section: RBatch)

  // (Doc section: More Arrays)
  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
  // Builders.
  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
  std::shared_ptr<arrow::Array> days2;
  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());

  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
  std::shared_ptr<arrow::Array> months2;
  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());

  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
  std::shared_ptr<arrow::Array> years2;
  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
  // (Doc section: More Arrays)

  // (Doc section: ArrayVector)
  // ChunkedArrays let us have a list of arrays, which aren't contiguous
  // with each other. First, we get a vector of arrays.
  arrow::ArrayVector day_vecs{days, days2};
  // (Doc section: ArrayVector)
  // (Doc section: ChunkedArray Day)
  // Then, we use that to initialize a ChunkedArray, which can be used with other
  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
  // get us far.
  std::shared_ptr<arrow::ChunkedArray> day_chunks =
      std::make_shared<arrow::ChunkedArray>(day_vecs);
  // (Doc section: ChunkedArray Day)

  // (Doc section: ChunkedArray Month Year)
  // Repeat for months.
  arrow::ArrayVector month_vecs{months, months2};
  std::shared_ptr<arrow::ChunkedArray> month_chunks =
      std::make_shared<arrow::ChunkedArray>(month_vecs);

  // Repeat for years.
  arrow::ArrayVector year_vecs{years, years2};
  std::shared_ptr<arrow::ChunkedArray> year_chunks =
      std::make_shared<arrow::ChunkedArray>(year_vecs);
  // (Doc section: ChunkedArray Month Year)

  // (Doc section: Table)
  // A Table is the structure we need for these non-contiguous columns, and keeps them
  // all in one place for us so we can use them as if they were normal arrays.
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);

  std::cout << table->ToString();
  // (Doc section: Table)

  // (Doc section: Ret)
  return arrow::Status::OK();
}
// (Doc section: Ret)

// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

// (Doc section: Main)