5 定义数据类型
5.1 简介
如前几章所述,Arrow 在读取数据或将 R 对象转换为 Arrow 对象时会自动推断最合适的数据类型。但是,您可能希望手动告诉 Arrow 使用哪些数据类型,例如,为了确保与数据库和数据仓库系统的互操作性。本章包含以下食谱:
- 更改现有 Arrow 对象的数据类型
- 在创建 Arrow 对象的过程中定义数据类型
显示 R 和 Arrow 数据类型之间默认映射的表格可以在 R 数据类型到 Arrow 数据类型映射 中找到。
包含 Arrow 数据类型及其 R 等效项的表格可以在 Arrow 数据类型到 R 数据类型映射 中找到。
5.2 更新现有 Arrow 数组的数据类型
您希望更改现有 Arrow 数组的数据类型。
5.3 更新现有 Arrow 表中字段的数据类型
您希望更改现有 Arrow 表中一个或多个字段的类型。
5.3.1 解决方案
# Set up a tibble to use in this example
<- tibble::tibble(
oscars actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Convert tibble to an Arrow table
<- arrow_table(oscars)
oscars_arrow
# The default mapping from numeric column "num_awards" is to a double
oscars_arrow
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <double>
# Set up schema with "num_awards" as integer
<- schema(actor = string(), num_awards = int16())
oscars_schema
# Cast to an int16
<- oscars_arrow$cast(target_schema = oscars_schema)
oscars_arrow_int
oscars_arrow_int
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <int16>
5.3.2 讨论
某些 Arrow 数据类型在 R 中没有等效项。尝试强制转换为这些数据类型或使用包含它们的模式会导致错误。
# Set up a tibble to use in this example
<- tibble::tibble(
oscars actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Convert tibble to an Arrow table
<- arrow_table(oscars)
oscars_arrow
# Set up schema with "num_awards" as float16 which doesn't have an R equivalent
<- schema(actor = string(), num_awards = float16())
oscars_schema_invalid
# The default mapping from numeric column "num_awards" is to a double
$cast(target_schema = oscars_schema_invalid) oscars_arrow
## Error: NotImplemented: Unsupported cast from double to halffloat using function cast_half_float
5.4 从 R 对象创建 Arrow 表时指定数据类型
您希望在将对象从数据框转换为 Arrow 对象时手动指定 Arrow 数据类型。
5.4.1 解决方案
# Set up a tibble to use in this example
<- tibble::tibble(
oscars actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Set up schema with "num_awards" as integer
<- schema(actor = string(), num_awards = int16())
oscars_schema
# create arrow Table containing data and schema
<- arrow_table(oscars, schema = oscars_schema)
oscars_data_arrow
oscars_data_arrow
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <int16>
5.5 读取文件时指定数据类型
您希望在读取文件时手动指定 Arrow 数据类型。
5.5.1 解决方案
# Set up a tibble to use in this example
<- tibble::tibble(
oscars actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# write dataset to disk
write_dataset(oscars, path = "oscars_data")
# Set up schema with "num_awards" as integer
<- schema(actor = string(), num_awards = int16())
oscars_schema
# read the dataset in, using the schema instead of inferring the type automatically
<- open_dataset("oscars_data", schema = oscars_schema)
oscars_dataset_arrow
oscars_dataset_arrow
## FileSystemDataset with 1 Parquet file
## actor: string
## num_awards: int16