创建 Arrow 数据类型 — data-type • Arrow R 包

这些函数创建与 Arrow 类型相对应的数据类型对象。在定义 schema() 时，或作为其他类型（如 struct）的输入时使用它们。这些函数大多不接受参数，但有一些函数接受参数。

用法

int8()

int16()

int32()

int64()

uint8()

uint16()

uint32()

uint64()

float16()

halffloat()

float32()

float()

float64()

boolean()

bool()

utf8()

large_utf8()

binary()

large_binary()

fixed_size_binary(byte_width)

string()

date32()

date64()

time32(unit = c("ms", "s"))

time64(unit = c("ns", "us"))

duration(unit = c("s", "ms", "us", "ns"))

null()

timestamp(unit = c("s", "ms", "us", "ns"), timezone = "")

decimal(precision, scale)

decimal128(precision, scale)

decimal256(precision, scale)

struct(...)

list_of(type)

large_list_of(type)

fixed_size_list_of(type, list_size)

map_of(key_type, item_type, .keys_sorted = FALSE)

参数

byte_width: FixedSizeBinary 类型的字节宽度。
unit: 对于时间/时间戳类型，时间单位。 time32() 可以采用 "s" 或 "ms"，而 time64() 可以采用 "us" 或 "ns"。 timestamp() 可以采用这四个值中的任何一个。
timezone: 对于 timestamp()，可选的时区字符串。
precision: 对于 decimal()、decimal128() 和 decimal256()，arrow decimal 类型可以表示的有效数字位数。 decimal128() 的最大精度为 38 位有效数字，而 decimal256() 的最大精度为 76 位数字。 decimal() 将使用它来选择返回哪种类型的 decimal。
scale: 对于 decimal()、decimal128() 和 decimal256()，小数点后的位数。它可以是负数。
...: 对于 struct()，定义 struct 列的类型的命名列表。
type: 对于 list_of()，一种用于创建 type-of-list 的数据类型。
list_size: FixedSizeList 类型的列表大小。
key_type, item_type: 对于 MapType，键和项类型。
.keys_sorted: 使用 TRUE 断言 MapType 的键已排序。

值

一个从 DataType 继承的 Arrow 类型对象。

详情

一些函数有别名

utf8() 和 string()
float16() 和 halffloat()
float32() 和 float()
bool() 和 boolean()
在 arrow 函数（例如 schema() 或 cast()）中调用时，还支持 double() 作为创建 float64() 的一种方式

date32() 创建一个具有 "day" 单位的 datetime 类型，类似于 R 的 Date 类。 date64() 具有 "ms" 单位。

uint32 (32 位无符号整数), uint64 (64 位无符号整数), 和 int64 (64 位有符号整数) 类型可能包含超出 R 的 integer 类型（32 位有符号整数）范围的值。当这些 arrow 对象被转换为 R 对象时，uint32 和 uint64 被转换为 double ("numeric")，而 int64 被转换为 bit64::integer64。对于 int64 类型，可以通过设置 options(arrow.int64_downcast = FALSE) 来禁用此转换（以便 int64 始终产生一个 bit64::integer64 对象）。

decimal128() 创建一个 Decimal128Type。 Arrow decimals 是编码为标量整数的定点十进制数。 precision 是 decimal 类型可以表示的有效数字位数； scale 是小数点后的位数。例如，数字 1234.567 的精度为 7，比例为 3。请注意，scale 可以为负。

例如，decimal128(7, 3) 可以精确地表示数字 1234.567 和 -1234.567（在内部编码为 128 位整数 1234567 和 -1234567），但不能表示 12345.67 或 123.4567。

decimal128(5, -3) 可以精确地表示数字 12345000（在内部编码为 128 位整数 12345），但不能表示 123450000 或 1234500。可以将 scale 视为控制舍入的参数。当为负数时，scale 会导致数字使用科学计数法和 10 的幂来表示。

decimal256() 创建一个 Decimal256Type，它允许更高的最大精度。对于大多数用例，Decimal128Type 提供的最大精度已经足够，并且它将产生更紧凑和更有效的编码。

decimal() 根据 precision 的值创建 Decimal128Type 或 Decimal256Type。如果 precision 大于 38，则返回 Decimal256Type，否则返回 Decimal128Type。

使用 decimal128() 或 decimal256()，因为这些名称比 decimal() 更具信息性。

另请参阅

dictionary() 用于创建字典（类似因子）类型。

示例

bool()
#> Boolean
#> bool
struct(a = int32(), b = double())
#> StructType
#> struct<a: int32, b: double>
timestamp("ms", timezone = "CEST")
#> Timestamp
#> timestamp[ms, tz=CEST]
time64("ns")
#> Time64
#> time64[ns]

# Use the cast method to change the type of data contained in Arrow objects.
# Please check the documentation of each data object class for details.
my_scalar <- Scalar$create(0L, type = int64()) # int64
my_scalar$cast(timestamp("ns")) # timestamp[ns]
#> Scalar
#> 1970-01-01 00:00:00.000000000

my_array <- Array$create(0L, type = int64()) # int64
my_array$cast(timestamp("s", timezone = "UTC")) # timestamp[s, tz=UTC]
#> Array
#> <timestamp[s, tz=UTC]>
#> [
#>   1970-01-01 00:00:00Z
#> ]

my_chunked_array <- chunked_array(0L, 1L) # int32
my_chunked_array$cast(date32()) # date32[day]
#> ChunkedArray
#> <date32[day]>
#> [
#>   [
#>     1970-01-01
#>   ],
#>   [
#>     1970-01-02
#>   ]
#> ]

# You can also use `cast()` in an Arrow dplyr query.
if (requireNamespace("dplyr", quietly = TRUE)) {
  library(dplyr, warn.conflicts = FALSE)
  arrow_table(mtcars) %>%
    transmute(
      col1 = cast(cyl, string()),
      col2 = cast(cyl, int8())
    ) %>%
    compute()
}
#> Table
#> 32 rows x 2 columns
#> $col1 <string>
#> $col2 <int8>
#> 
#> See $metadata for additional Schema metadata