元数据 • Arrow R 包

本文介绍了 arrow 提供的各种数据和元数据对象类型，并记录了这些对象的结构。

Arrow 元数据类

arrow 包定义了以下类来表示元数据

Schema 是 Field 对象的列表，用于描述表格数据对象的结构；其中
Field 指定一个字符串名称和一个 DataType；并且
DataType 是控制值如何表示的属性

考虑这个

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
tb <- arrow_table(df)
tb$schema

## Schema
## x: int32
## y: string
## 
## See $metadata for additional Schema metadata

自动推断出的模式也可以手动创建

schema(
  field(name = "x", type = int32()),
  field(name = "y", type = utf8())
)

## Schema
## x: int32
## y: string

schema() 函数允许以下简写来定义字段

schema(x = int32(), y = utf8())

## Schema
## x: int32
## y: string

有时手动指定模式非常重要，特别是当您想要对 Arrow 数据类型进行细粒度控制时

arrow_table(df, schema = schema(x = int64(), y = utf8()))

## Table
## 3 rows x 2 columns
## $x <int64>
## $y <string>
## 
## See $metadata for additional Schema metadata

arrow_table(df, schema = schema(x = float64(), y = utf8()))

## Table
## 3 rows x 2 columns
## $x <double>
## $y <string>
## 
## See $metadata for additional Schema metadata

R 对象属性

Arrow 支持附加到 Schema 的自定义键值元数据。当我们将 data.frame 转换为 Arrow Table 或 RecordBatch 时，该包会将附加到 data.frame 列的任何 attributes() 存储在 Arrow 对象 Schema 中。以这种方式添加到对象的属性存储在 r 键下，如下所示

# data frame with custom metadata
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
attr(df, "df_meta") <- "custom data frame metadata"
attr(df$y, "col_meta") <- "custom column metadata"

# when converted to a Table, the metadata is preserved
tb <- arrow_table(df)
tb$metadata

## $r
## $r$attributes
## $r$attributes$df_meta
## [1] "custom data frame metadata"
## 
## 
## $r$columns
## $r$columns$x
## NULL
## 
## $r$columns$y
## $r$columns$y$attributes
## $r$columns$y$attributes$col_meta
## [1] "custom column metadata"
## 
## 
## $r$columns$y$columns
## NULL

也可以使用如下命令在您希望的任何其他键下分配其他字符串元数据

tb$metadata$new_key <- "new value"

将 Table 写入 Arrow/Feather 或 Parquet 格式时，附加到 Schema 的元数据将保留。当将这些文件读入 R 时，或在 Table 或 RecordBatch 上调用 as.data.frame() 时，列属性将恢复到生成的 data.frame 的列。这意味着，当通过 Arrow 进行往返时，包括 haven::labelled、vctrs 注释和其他自定义数据类型会被保留。

请注意，存储在 $metadata$r 中的属性仅被 R 理解。如果您将带有 haven 列的 data.frame 写入 Feather 文件并在 Pandas 中读取，则 haven 元数据将不会在那里被识别。类似地，Pandas 写入其自己的自定义元数据，R 包不使用它。但是，您可以自由地为您的应用程序定义自定义元数据约定，并将您想要的任何（字符串）值分配给其他元数据键。

进一步阅读

要了解有关 arrow 元数据的更多信息，请参阅 schema() 的文档。
要了解有关数据类型的更多信息，请参阅数据类型文章。