5 定义数据类型

5.1 简介

如前几章所述,Arrow 在读取数据或将 R 对象转换为 Arrow 对象时会自动推断最合适的数据类型。但是,您可能希望手动告诉 Arrow 使用哪些数据类型,例如,为了确保与数据库和数据仓库系统的互操作性。本章包含以下食谱:

  • 更改现有 Arrow 对象的数据类型
  • 在创建 Arrow 对象的过程中定义数据类型

显示 R 和 Arrow 数据类型之间默认映射的表格可以在 R 数据类型到 Arrow 数据类型映射 中找到。

包含 Arrow 数据类型及其 R 等效项的表格可以在 Arrow 数据类型到 R 数据类型映射 中找到。

5.2 更新现有 Arrow 数组的数据类型

您希望更改现有 Arrow 数组的数据类型。

5.2.1 解决方案

# Create an Array to cast
integer_arr <- Array$create(1:5)

# Cast to an unsigned int8 type
uint_arr <- integer_arr$cast(target_type = uint8())

uint_arr
## Array
## <uint8>
## [
##   1,
##   2,
##   3,
##   4,
##   5
## ]

5.2.2 讨论

某些数据类型彼此不兼容。如果您尝试在不兼容的数据类型之间进行强制转换,将会出现错误。

int_arr <- Array$create(1:5)
int_arr$cast(target_type = binary())
## Error: NotImplemented: Unsupported cast from int32 to binary using function cast_binary

5.3 更新现有 Arrow 表中字段的数据类型

您希望更改现有 Arrow 表中一个或多个字段的类型。

5.3.1 解决方案

# Set up a tibble to use in this example
oscars <- tibble::tibble(
  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
  num_awards = c(4, 3, 3)
)

# Convert tibble to an Arrow table
oscars_arrow <- arrow_table(oscars)

# The default mapping from numeric column "num_awards" is to a double
oscars_arrow
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <double>
# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())

# Cast to an int16
oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema)

oscars_arrow_int
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <int16>

5.3.2 讨论

某些 Arrow 数据类型在 R 中没有等效项。尝试强制转换为这些数据类型或使用包含它们的模式会导致错误。

# Set up a tibble to use in this example
oscars <- tibble::tibble(
  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
  num_awards = c(4, 3, 3)
)

# Convert tibble to an Arrow table
oscars_arrow <- arrow_table(oscars)

# Set up schema with "num_awards" as float16 which doesn't have an R equivalent
oscars_schema_invalid <- schema(actor = string(), num_awards = float16())

# The default mapping from numeric column "num_awards" is to a double
oscars_arrow$cast(target_schema = oscars_schema_invalid)
## Error: NotImplemented: Unsupported cast from double to halffloat using function cast_half_float

5.4 从 R 对象创建 Arrow 表时指定数据类型

您希望在将对象从数据框转换为 Arrow 对象时手动指定 Arrow 数据类型。

5.4.1 解决方案

# Set up a tibble to use in this example
oscars <- tibble::tibble(
  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
  num_awards = c(4, 3, 3)
)

# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())

# create arrow Table containing data and schema
oscars_data_arrow <- arrow_table(oscars, schema = oscars_schema)

oscars_data_arrow
## Table
## 3 rows x 2 columns
## $actor <string>
## $num_awards <int16>

5.5 读取文件时指定数据类型

您希望在读取文件时手动指定 Arrow 数据类型。

5.5.1 解决方案

# Set up a tibble to use in this example
oscars <- tibble::tibble(
  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
  num_awards = c(4, 3, 3)
)

# write dataset to disk
write_dataset(oscars, path = "oscars_data")

# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())

# read the dataset in, using the schema instead of inferring the type automatically
oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema)

oscars_dataset_arrow
## FileSystemDataset with 1 Parquet file
## actor: string
## num_awards: int16