基本 Arrow 数据结构#
Apache Arrow 提供了用于表示数据的基本数据结构:Array
(数组)、ChunkedArray
(分块数组)、RecordBatch
(记录批)和 Table
(表)。本文将展示如何从基本数据类型构建这些数据结构;具体来说,我们将使用表示日、月和年的不同大小的整数。我们将使用它们来创建以下数据结构
Arrow
Arrays
(数组)由
Arrays
(数组)构成的RecordBatch
(记录批)由
ChunkedArrays
(分块数组)构成的Table
(表)
前提条件#
在继续之前,请确保您已具备以下条件:
已安装 Arrow,您可以在此处进行设置:在您自己的项目中使用 Arrow C++
了解如何使用基本的 C++ 数据结构
了解基本的 C++ 数据类型
设置#
在尝试 Arrow 之前,我们需要完成一些准备工作
我们需要包含必要的头文件。
需要一个
main()
函数将所有内容连接在一起。
包含#
首先,我们需要一些包含文件。我们将引入用于输出的 iostream
,然后从 api.h
导入 Arrow 的基本功能,如下所示
#include <arrow/api.h>
#include <iostream>
Main() 函数#
接下来,我们需要一个 main()
函数——Arrow 的常用模式如下所示
int main() {
arrow::Status st = RunMain();
if (!st.ok()) {
std::cerr << st << std::endl;
return 1;
}
return 0;
}
这使我们可以轻松地使用 Arrow 的错误处理宏。如果发生错误,这些宏将返回一个 arrow::Status
(状态)对象给 main()
函数,该函数将报告错误。请注意,这意味着 Arrow 永远不会引发异常,而是依赖于返回 Status
对象。有关更多信息,请阅读此处:约定。
为了配合这个 main()
函数,我们有一个 RunMain()
函数,任何 Status
对象都可以从中返回——我们将在此处编写程序的其余部分
arrow::Status RunMain() {
创建 Arrow 数组#
构建 int8 数组#
假设我们在标准 C++ 数组中有一些数据,并且想要使用 Arrow,我们需要将数据从这些数组移动到 Arrow 数组中。我们仍然保证 Array
(数组)中的内存连续性,因此在使用 Array
与 C++ 数组时,无需担心性能损失。构建 Array
最简单的方法是使用 ArrayBuilder
(数组构建器)。
以下代码初始化一个用于存放 8 位整数的 Array
的 ArrayBuilder
。具体来说,它使用 arrow::ArrayBuilder
具体子类中存在的 AppendValues()
方法,使用标准 C++ 数组的内容填充 ArrayBuilder
。请注意 ARROW_RETURN_NOT_OK
的使用。如果 AppendValues()
失败,此宏将返回到 main()
函数,该函数将打印出错误信息。
// Builders are the main way to create Arrays in Arrow from existing values that are not
// on-disk. In this case, we'll make a simple array, and feed that in.
// Data types are important as ever, and there is a Builder for each compatible type;
// in this case, int8.
arrow::Int8Builder int8builder;
int8_t days_raw[5] = {1, 12, 17, 23, 28};
// AppendValues, as called, puts 5 values from days_raw into our Builder object.
ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
如果一个 ArrayBuilder
中包含了我们想要放入 Array
的值,我们可以使用 ArrayBuilder::Finish()
方法将最终结构输出到一个 Array
中,更具体地说,我们输出到一个 std::shared_ptr<arrow::Array>
中。请注意以下代码中 ARROW_ASSIGN_OR_RAISE
的用法。Finish()
方法会输出一个 arrow::Result
对象,ARROW_ASSIGN_OR_RAISE
可以处理该对象。如果该方法失败,它将返回到 main()
函数,并带有一个 Status
对象,该对象将解释发生了什么错误。如果成功,它将最终输出赋值给左侧变量。
// We only have a Builder though, not an Array -- the following code pushes out the
// built up data into a proper Array.
std::shared_ptr<arrow::Array> days;
ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
一旦 ArrayBuilder
的 Finish
方法被调用,它的状态就会重置,因此可以再次使用,就像新创建的一样。因此,我们对第二个数组重复上述过程。
// Builders clear their state every time they fill an Array, so if the type is the same,
// we can re-use the builder. We do that here for month values.
int8_t months_raw[5] = {1, 3, 5, 7, 1};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
std::shared_ptr<arrow::Array> months;
ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
构建 int16 数组#
ArrayBuilder
的类型在声明时指定。一旦指定,就不能更改其类型。当我们切换到年份数据时,我们需要创建一个新的 ArrayBuilder
,它至少需要一个 16 位整数。当然,Arrow 提供了相应的 ArrayBuilder
。它使用完全相同的方法,但使用新的数据类型。
// Now that we change to int16, we use the Builder for that data type instead.
arrow::Int16Builder int16builder;
int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
std::shared_ptr<arrow::Array> years;
ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
现在,我们有三个 Arrow Array
,它们的类型有一些差异。
创建 RecordBatch#
列式数据格式只有在您拥有表格时才会真正发挥作用。所以,让我们创建一个表格。我们将创建的第一种表格是 RecordBatch
,它在内部使用 Array
,这意味着所有数据在每列内都是连续的,但任何追加或连接操作都需要复制数据。在已存在 Array
的情况下,创建 RecordBatch
需要两个步骤:
定义 Schema#
要开始创建 RecordBatch
,我们首先需要定义列的特征,每列都由一个 Field
实例表示。每个 Field
都包含其关联列的名称和数据类型;然后,Schema
将它们组合在一起并设置列的顺序,如下所示:
// Now, we want a RecordBatch, which has columns and labels for said columns.
// This gets us to the 2d data structures we want in Arrow.
// These are defined by schema, which have fields -- here we get both those object types
// ready.
std::shared_ptr<arrow::Field> field_day, field_month, field_year;
std::shared_ptr<arrow::Schema> schema;
// Every field needs its name and data type.
field_day = arrow::field("Day", arrow::int8());
field_month = arrow::field("Month", arrow::int8());
field_year = arrow::field("Year", arrow::int16());
// The schema can be built from a vector of fields, and we do so here.
schema = arrow::schema({field_day, field_month, field_year});
构建 RecordBatch#
使用上一节中 Array
中的数据以及上一步中 Schema
中的列描述,我们可以创建 RecordBatch
。请注意,列的长度是必需的,并且所有列共享相同的长度。
// With the schema and Arrays full of data, we can make our RecordBatch! Here,
// each column is internally contiguous. This is in opposition to Tables, which we'll
// see next.
std::shared_ptr<arrow::RecordBatch> rbatch;
// The RecordBatch needs the schema, length for columns, which all must match,
// and the actual data itself.
rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});
std::cout << rbatch->ToString();
现在,我们的数据以一个良好的表格形式存储在 RecordBatch
中。我们将在后面的教程中讨论如何使用它。
创建 ChunkedArray#
假设我们想要一个由子数组组成的数组,因为这在连接时避免数据复制、并行化工作、将每个块装入缓存或超过标准 Arrow Array
中的 2,147,483,647 行限制时非常有用。为此,Arrow 提供了 ChunkedArray
,它可以由单独的 Arrow Array
组成。在本例中,我们可以重用我们在前面部分创建的数组作为分块数组的一部分,从而允许我们在不复制数据的情况下扩展它们。因此,让我们使用相同的构建器来简化操作,构建更多 Array
。
// Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
// Builders.
int8_t days_raw2[5] = {6, 12, 3, 30, 22};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
std::shared_ptr<arrow::Array> days2;
ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());
int8_t months_raw2[5] = {5, 4, 11, 3, 2};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
std::shared_ptr<arrow::Array> months2;
ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());
int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
std::shared_ptr<arrow::Array> years2;
ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
为了支持在 ChunkedArray
的构造中使用任意数量的 Array
,Arrow 提供了 ArrayVector
。它为 Array
提供了一个向量,我们将在这里使用它来准备创建一个 ChunkedArray
。
// ChunkedArrays let us have a list of arrays, which aren't contiguous
// with each other. First, we get a vector of arrays.
arrow::ArrayVector day_vecs{days, days2};
为了利用 Arrow,我们需要采取最后一步,将其转换为 ChunkedArray
。
// Then, we use that to initialize a ChunkedArray, which can be used with other
// functions in Arrow! This is good, since having a normal vector of arrays wouldn't
// get us far.
std::shared_ptr<arrow::ChunkedArray> day_chunks =
std::make_shared<arrow::ChunkedArray>(day_vecs);
有了日期值的 ChunkedArray
,我们现在只需要对月份和年份数据重复此过程即可。
// Repeat for months.
arrow::ArrayVector month_vecs{months, months2};
std::shared_ptr<arrow::ChunkedArray> month_chunks =
std::make_shared<arrow::ChunkedArray>(month_vecs);
// Repeat for years.
arrow::ArrayVector year_vecs{years, years2};
std::shared_ptr<arrow::ChunkedArray> year_chunks =
std::make_shared<arrow::ChunkedArray>(year_vecs);
这样,我们就得到了三个类型不同的 ChunkedArray
。
创建 Table#
我们可以使用上一节中的 ChunkedArray
创建一个特别有用的东西:Table
。与 RecordBatch
非常相似,Table
也存储表格数据。但是,由于 Table
由 ChunkedArray
组成,因此它不保证数据的连续性。这对于逻辑处理、并行化工作、将块装入缓存或超过 Array
以及 RecordBatch
中存在的 2,147,483,647 行限制非常有用。
如果您阅读到 RecordBatch
,您可能会注意到以下代码中的 Table
构造函数实际上是相同的,它只是将列的长度放在位置 3,并创建一个 Table
。我们重复使用之前的 Schema
,并创建我们的 Table
// A Table is the structure we need for these non-contiguous columns, and keeps them
// all in one place for us so we can use them as if they were normal arrays.
std::shared_ptr<arrow::Table> table;
table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);
std::cout << table->ToString();
现在,我们的数据以一种很好的表格形式存在于 Table
中。我们将在后面的教程中讨论如何使用它。
结束程序#
最后,我们只返回 Status::OK()
,以便 main()
函数知道我们已经完成,并且一切正常。
return arrow::Status::OK();
}
总结#
这样,您就创建了 Arrow 中的基本数据结构,并且可以在下一篇文章中继续学习如何使用文件 I/O 将它们导入和导出程序。
有关完整代码的副本,请参阅以下内容
19// (Doc section: Includes)
20#include <arrow/api.h>
21
22#include <iostream>
23// (Doc section: Includes)
24
25// (Doc section: RunMain Start)
26arrow::Status RunMain() {
27 // (Doc section: RunMain Start)
28 // (Doc section: int8builder 1 Append)
29 // Builders are the main way to create Arrays in Arrow from existing values that are not
30 // on-disk. In this case, we'll make a simple array, and feed that in.
31 // Data types are important as ever, and there is a Builder for each compatible type;
32 // in this case, int8.
33 arrow::Int8Builder int8builder;
34 int8_t days_raw[5] = {1, 12, 17, 23, 28};
35 // AppendValues, as called, puts 5 values from days_raw into our Builder object.
36 ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
37 // (Doc section: int8builder 1 Append)
38
39 // (Doc section: int8builder 1 Finish)
40 // We only have a Builder though, not an Array -- the following code pushes out the
41 // built up data into a proper Array.
42 std::shared_ptr<arrow::Array> days;
43 ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
44 // (Doc section: int8builder 1 Finish)
45
46 // (Doc section: int8builder 2)
47 // Builders clear their state every time they fill an Array, so if the type is the same,
48 // we can re-use the builder. We do that here for month values.
49 int8_t months_raw[5] = {1, 3, 5, 7, 1};
50 ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
51 std::shared_ptr<arrow::Array> months;
52 ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
53 // (Doc section: int8builder 2)
54
55 // (Doc section: int16builder)
56 // Now that we change to int16, we use the Builder for that data type instead.
57 arrow::Int16Builder int16builder;
58 int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
59 ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
60 std::shared_ptr<arrow::Array> years;
61 ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
62 // (Doc section: int16builder)
63
64 // (Doc section: Schema)
65 // Now, we want a RecordBatch, which has columns and labels for said columns.
66 // This gets us to the 2d data structures we want in Arrow.
67 // These are defined by schema, which have fields -- here we get both those object types
68 // ready.
69 std::shared_ptr<arrow::Field> field_day, field_month, field_year;
70 std::shared_ptr<arrow::Schema> schema;
71
72 // Every field needs its name and data type.
73 field_day = arrow::field("Day", arrow::int8());
74 field_month = arrow::field("Month", arrow::int8());
75 field_year = arrow::field("Year", arrow::int16());
76
77 // The schema can be built from a vector of fields, and we do so here.
78 schema = arrow::schema({field_day, field_month, field_year});
79 // (Doc section: Schema)
80
81 // (Doc section: RBatch)
82 // With the schema and Arrays full of data, we can make our RecordBatch! Here,
83 // each column is internally contiguous. This is in opposition to Tables, which we'll
84 // see next.
85 std::shared_ptr<arrow::RecordBatch> rbatch;
86 // The RecordBatch needs the schema, length for columns, which all must match,
87 // and the actual data itself.
88 rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});
89
90 std::cout << rbatch->ToString();
91 // (Doc section: RBatch)
92
93 // (Doc section: More Arrays)
94 // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
95 // Builders.
96 int8_t days_raw2[5] = {6, 12, 3, 30, 22};
97 ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
98 std::shared_ptr<arrow::Array> days2;
99 ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());
100
101 int8_t months_raw2[5] = {5, 4, 11, 3, 2};
102 ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
103 std::shared_ptr<arrow::Array> months2;
104 ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());
105
106 int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
107 ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
108 std::shared_ptr<arrow::Array> years2;
109 ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
110 // (Doc section: More Arrays)
111
112 // (Doc section: ArrayVector)
113 // ChunkedArrays let us have a list of arrays, which aren't contiguous
114 // with each other. First, we get a vector of arrays.
115 arrow::ArrayVector day_vecs{days, days2};
116 // (Doc section: ArrayVector)
117 // (Doc section: ChunkedArray Day)
118 // Then, we use that to initialize a ChunkedArray, which can be used with other
119 // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
120 // get us far.
121 std::shared_ptr<arrow::ChunkedArray> day_chunks =
122 std::make_shared<arrow::ChunkedArray>(day_vecs);
123 // (Doc section: ChunkedArray Day)
124
125 // (Doc section: ChunkedArray Month Year)
126 // Repeat for months.
127 arrow::ArrayVector month_vecs{months, months2};
128 std::shared_ptr<arrow::ChunkedArray> month_chunks =
129 std::make_shared<arrow::ChunkedArray>(month_vecs);
130
131 // Repeat for years.
132 arrow::ArrayVector year_vecs{years, years2};
133 std::shared_ptr<arrow::ChunkedArray> year_chunks =
134 std::make_shared<arrow::ChunkedArray>(year_vecs);
135 // (Doc section: ChunkedArray Month Year)
136
137 // (Doc section: Table)
138 // A Table is the structure we need for these non-contiguous columns, and keeps them
139 // all in one place for us so we can use them as if they were normal arrays.
140 std::shared_ptr<arrow::Table> table;
141 table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);
142
143 std::cout << table->ToString();
144 // (Doc section: Table)
145
146 // (Doc section: Ret)
147 return arrow::Status::OK();
148}
149// (Doc section: Ret)
150
151// (Doc section: Main)
152int main() {
153 arrow::Status st = RunMain();
154 if (!st.ok()) {
155 std::cerr << st << std::endl;
156 return 1;
157 }
158 return 0;
159}
160
161// (Doc section: Main)