基础 Arrow 数据结构#

Apache Arrow 提供了用于表示数据的基本数据结构:ArrayChunkedArrayRecordBatchTable。本文展示了如何从原始数据类型构建这些数据结构;具体来说,我们将使用表示天、月和年的不同大小的整数,并将它们用于创建以下数据结构:

  1. Arrow Arrays

  2. ChunkedArrays

  3. RecordBatch,源自 Arrays

  4. Table,源自 ChunkedArrays

先决条件#

在继续之前,请确保您已具备:

  1. 已安装 Arrow,您可以在此处进行设置:在您自己的项目中使用 Arrow C++

  2. 理解如何使用基础 C++ 数据结构

  3. 理解基本的 C++ 数据类型

设置#

在尝试使用 Arrow 之前,我们需要填补几个空白:

  1. 我们需要包含必要的头文件。

  2. main() 函数是整合一切所必需的。

包含文件#

首先,像往常一样,我们需要包含一些头文件。我们将引入 iostream 用于输出,然后从 api.h 导入 Arrow 的基本功能,如下所示:

#include <arrow/api.h>

#include <iostream>

Main()#

接下来,我们需要一个 main() 函数——Arrow 中一种常见的模式如下所示:

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

这使我们能够轻松使用 Arrow 的错误处理宏。如果发生故障,这些宏将返回一个 arrow::Status 对象到 main(),随后 main() 将报告该错误。请注意,这意味着 Arrow 从不抛出异常,而是依赖于返回 Status。有关更多信息,请阅读:约定 (Conventions)

为了配合此 main(),我们还有一个 RunMain() 函数,任何 Status 对象都可以从中返回——我们将在这里编写程序的其余部分。

arrow::Status RunMain() {

创建 Arrow Array#

构建 int8 Arrays#

鉴于我们已有标准 C++ 数组中的数据,并希望使用 Arrow,我们需要将数据从这些数组移动到 Arrow 数组中。我们仍然在 Array 中保证内存的连续性,因此不必担心使用 Array 相比 C++ 数组会有性能损失。构建 Array 最简单的方法是使用 ArrayBuilder

另请参阅

有关 Array 的更多技术细节,请参阅 Arrays

以下代码初始化了一个用于容纳 8 位整数的 ArrayArrayBuilder。具体来说,它使用了具体 arrow::ArrayBuilder 子类中提供的 AppendValues() 方法,将标准 C++ 数组的内容填充到 ArrayBuilder 中。注意 ARROW_RETURN_NOT_OK 的使用。如果 AppendValues() 失败,该宏将返回到 main(),后者将打印出失败的原因。

  // Builders are the main way to create Arrays in Arrow from existing values that are not
  // on-disk. In this case, we'll make a simple array, and feed that in.
  // Data types are important as ever, and there is a Builder for each compatible type;
  // in this case, int8.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));

一旦 ArrayBuilder 拥有了我们想要的 Array 值,我们就可以使用 ArrayBuilder::Finish() 将最终结构输出到 Array 中——具体而言,我们输出到 std::shared_ptr<arrow::Array>。注意以下代码中 ARROW_ASSIGN_OR_RAISE 的使用。Finish() 输出一个 arrow::Result 对象,ARROW_ASSIGN_OR_RAISE 可以处理它。如果该方法失败,它将返回一个解释失败原因的 Statusmain()。如果成功,它将把最终输出赋值给左侧变量。

  // We only have a Builder though, not an Array -- the following code pushes out the
  // built up data into a proper Array.
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

一旦 ArrayBuilder 调用了 Finish 方法,其状态就会重置,因此它可以像全新一样再次使用。因此,我们对第二个数组重复上述过程。

  // Builders clear their state every time they fill an Array, so if the type is the same,
  // we can re-use the builder. We do that here for month values.
  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

构建 int16 Arrays#

在声明时必须指定 ArrayBuilder 的类型。一旦完成,其类型就无法更改。当我们切换到年份数据(至少需要 16 位整数)时,必须创建一个新的构建器。当然,有对应的 ArrayBuilder 可用。它使用完全相同的方法,但使用新的数据类型。

  // Now that we change to int16, we use the Builder for that data type instead.
  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

现在,我们有了三个类型略有不同的 Arrow Arrays

创建 RecordBatch#

列式数据格式只有在拥有表格时才能真正发挥作用。所以,让我们创建一个。我们创建的第一种类型是 RecordBatch——它在内部使用 Arrays,这意味着所有数据在每一列中都是连续的,但任何追加或连接操作都需要复制。鉴于已有 Arrays,创建 RecordBatch 分为两步:

  1. 定义 Schema

  2. Schema 和 Arrays 加载到构造函数中

定义 Schema#

要开始制作 RecordBatch,我们首先需要定义列的特征,每一列由一个 Field 实例表示。每个 Field 都包含其关联列的名称和数据类型;然后,一个 Schema 将它们组合在一起并设置列的顺序,如下所示:

  // Now, we want a RecordBatch, which has columns and labels for said columns.
  // This gets us to the 2d data structures we want in Arrow.
  // These are defined by schema, which have fields -- here we get both those object types
  // ready.
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  // Every field needs its name and data type.
  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  // The schema can be built from a vector of fields, and we do so here.
  schema = arrow::schema({field_day, field_month, field_year});

构建 RecordBatch#

利用上一节中 Arrays 中的数据,以及上一步中 Schema 中的列描述,我们可以创建 RecordBatch。请注意,列的长度是必需的,并且该长度由所有列共享。

  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
  // each column is internally contiguous. This is in opposition to Tables, which we'll
  // see next.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  // The RecordBatch needs the schema, length for columns, which all must match,
  // and the actual data itself.
  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});

  std::cout << rbatch->ToString();

现在,我们的数据以表格形式整齐地存储在 RecordBatch 中。我们可以用它做什么,将在后续教程中讨论。

创建 ChunkedArray#

假设我们想要一个由子数组组成的数组,因为这在连接时避免数据复制、进行并行化工作、将每个块放入缓存或超过标准 Arrow Array 中 2,147,483,647 行的限制方面非常有用。为此,Arrow 提供了 ChunkedArray,它可以由单独的 Arrow Arrays 组成。在此示例中,我们可以在分块数组中重用之前制作的数组,从而无需复制数据即可扩展它们。因此,为了方便起见,让我们使用相同的构建器再构建几个 Arrays

  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
  // Builders.
  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
  std::shared_ptr<arrow::Array> days2;
  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());

  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
  std::shared_ptr<arrow::Array> months2;
  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());

  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
  std::shared_ptr<arrow::Array> years2;
  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());

为了支持在构建 ChunkedArray 时使用任意数量的 Arrays,Arrow 提供了 ArrayVector。它提供了一个用于 Arrays 的向量,我们将在这里使用它来准备创建 ChunkedArray

  // ChunkedArrays let us have a list of arrays, which aren't contiguous
  // with each other. First, we get a vector of arrays.
  arrow::ArrayVector day_vecs{days, days2};

为了利用 Arrow,我们需要采取最后一步,进入 ChunkedArray

  // Then, we use that to initialize a ChunkedArray, which can be used with other
  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
  // get us far.
  std::shared_ptr<arrow::ChunkedArray> day_chunks =
      std::make_shared<arrow::ChunkedArray>(day_vecs);

有了用于天数值的 ChunkedArray,我们现在只需要对月份和年份数据重复此过程。

  // Repeat for months.
  arrow::ArrayVector month_vecs{months, months2};
  std::shared_ptr<arrow::ChunkedArray> month_chunks =
      std::make_shared<arrow::ChunkedArray>(month_vecs);

  // Repeat for years.
  arrow::ArrayVector year_vecs{years, years2};
  std::shared_ptr<arrow::ChunkedArray> year_chunks =
      std::make_shared<arrow::ChunkedArray>(year_vecs);

至此,我们剩下三个类型各异的 ChunkedArrays

创建 Table#

我们可以用前一节中的 ChunkedArrays 做的一件特别有用的事情是创建 Tables。与 RecordBatch 非常相似,Table 存储表格数据。但是,Table 不保证连续性,因为它是由 ChunkedArrays 组成的。这对于逻辑处理、并行工作、将块放入缓存或超过 Array(以及因此 RecordBatch)中存在的 2,147,483,647 行限制非常有用。

如果您阅读到 RecordBatch,您可能会注意到以下代码中的 Table 构造函数本质上是相同的,它只是恰好将列的长度放在第 3 个位置,并生成一个 Table。我们重用之前的 Schema,并创建我们的 Table

  // A Table is the structure we need for these non-contiguous columns, and keeps them
  // all in one place for us so we can use them as if they were normal arrays.
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);

  std::cout << table->ToString();

现在,我们的数据以表格形式整齐地存储在 Table 中。我们可以用它做什么,将在后续教程中讨论。

结束程序#

最后,我们只需返回 Status::OK(),这样 main() 就知道我们完成了,一切正常。

  return arrow::Status::OK();
}

总结#

至此,您已经创建了 Arrow 中的基础数据结构,并可以在下一篇文章中继续学习如何通过文件 I/O 将它们存入或读出程序。

请参考以下内容以获取完整的代码副本。

 19// (Doc section: Includes)
 20#include <arrow/api.h>
 21
 22#include <iostream>
 23// (Doc section: Includes)
 24
 25// (Doc section: RunMain Start)
 26arrow::Status RunMain() {
 27  // (Doc section: RunMain Start)
 28  // (Doc section: int8builder 1 Append)
 29  // Builders are the main way to create Arrays in Arrow from existing values that are not
 30  // on-disk. In this case, we'll make a simple array, and feed that in.
 31  // Data types are important as ever, and there is a Builder for each compatible type;
 32  // in this case, int8.
 33  arrow::Int8Builder int8builder;
 34  int8_t days_raw[5] = {1, 12, 17, 23, 28};
 35  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
 36  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
 37  // (Doc section: int8builder 1 Append)
 38
 39  // (Doc section: int8builder 1 Finish)
 40  // We only have a Builder though, not an Array -- the following code pushes out the
 41  // built up data into a proper Array.
 42  std::shared_ptr<arrow::Array> days;
 43  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
 44  // (Doc section: int8builder 1 Finish)
 45
 46  // (Doc section: int8builder 2)
 47  // Builders clear their state every time they fill an Array, so if the type is the same,
 48  // we can re-use the builder. We do that here for month values.
 49  int8_t months_raw[5] = {1, 3, 5, 7, 1};
 50  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
 51  std::shared_ptr<arrow::Array> months;
 52  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
 53  // (Doc section: int8builder 2)
 54
 55  // (Doc section: int16builder)
 56  // Now that we change to int16, we use the Builder for that data type instead.
 57  arrow::Int16Builder int16builder;
 58  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
 59  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
 60  std::shared_ptr<arrow::Array> years;
 61  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
 62  // (Doc section: int16builder)
 63
 64  // (Doc section: Schema)
 65  // Now, we want a RecordBatch, which has columns and labels for said columns.
 66  // This gets us to the 2d data structures we want in Arrow.
 67  // These are defined by schema, which have fields -- here we get both those object types
 68  // ready.
 69  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
 70  std::shared_ptr<arrow::Schema> schema;
 71
 72  // Every field needs its name and data type.
 73  field_day = arrow::field("Day", arrow::int8());
 74  field_month = arrow::field("Month", arrow::int8());
 75  field_year = arrow::field("Year", arrow::int16());
 76
 77  // The schema can be built from a vector of fields, and we do so here.
 78  schema = arrow::schema({field_day, field_month, field_year});
 79  // (Doc section: Schema)
 80
 81  // (Doc section: RBatch)
 82  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
 83  // each column is internally contiguous. This is in opposition to Tables, which we'll
 84  // see next.
 85  std::shared_ptr<arrow::RecordBatch> rbatch;
 86  // The RecordBatch needs the schema, length for columns, which all must match,
 87  // and the actual data itself.
 88  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});
 89
 90  std::cout << rbatch->ToString();
 91  // (Doc section: RBatch)
 92
 93  // (Doc section: More Arrays)
 94  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
 95  // Builders.
 96  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
 97  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
 98  std::shared_ptr<arrow::Array> days2;
 99  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());
100
101  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
102  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
103  std::shared_ptr<arrow::Array> months2;
104  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());
105
106  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
107  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
108  std::shared_ptr<arrow::Array> years2;
109  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
110  // (Doc section: More Arrays)
111
112  // (Doc section: ArrayVector)
113  // ChunkedArrays let us have a list of arrays, which aren't contiguous
114  // with each other. First, we get a vector of arrays.
115  arrow::ArrayVector day_vecs{days, days2};
116  // (Doc section: ArrayVector)
117  // (Doc section: ChunkedArray Day)
118  // Then, we use that to initialize a ChunkedArray, which can be used with other
119  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
120  // get us far.
121  std::shared_ptr<arrow::ChunkedArray> day_chunks =
122      std::make_shared<arrow::ChunkedArray>(day_vecs);
123  // (Doc section: ChunkedArray Day)
124
125  // (Doc section: ChunkedArray Month Year)
126  // Repeat for months.
127  arrow::ArrayVector month_vecs{months, months2};
128  std::shared_ptr<arrow::ChunkedArray> month_chunks =
129      std::make_shared<arrow::ChunkedArray>(month_vecs);
130
131  // Repeat for years.
132  arrow::ArrayVector year_vecs{years, years2};
133  std::shared_ptr<arrow::ChunkedArray> year_chunks =
134      std::make_shared<arrow::ChunkedArray>(year_vecs);
135  // (Doc section: ChunkedArray Month Year)
136
137  // (Doc section: Table)
138  // A Table is the structure we need for these non-contiguous columns, and keeps them
139  // all in one place for us so we can use them as if they were normal arrays.
140  std::shared_ptr<arrow::Table> table;
141  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);
142
143  std::cout << table->ToString();
144  // (Doc section: Table)
145
146  // (Doc section: Ret)
147  return arrow::Status::OK();
148}
149// (Doc section: Ret)
150
151// (Doc section: Main)
152int main() {
153  arrow::Status st = RunMain();
154  if (!st.ok()) {
155    std::cerr << st << std::endl;
156    return 1;
157  }
158  return 0;
159}
160
161// (Doc section: Main)