使用 Arrow 读取 CSV 或其他分隔文件 — read_delim_arrow • Arrow R Package

这些函数使用 Arrow C++ CSV 读取器读取到 tibble 中。 Arrow C++ 选项已映射到参数名称，这些名称遵循 readr::read_delim() 的参数名称，并且 col_select 的灵感来自 vroom::vroom()。

用法

read_delim_arrow(
  file,
  delim = ",",
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL,
  decimal_point = "."
)

read_csv_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

read_csv2_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

read_tsv_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

参数

file

字符文件名或 URI，连接，字面量数据（单个字符串或 raw 向量），Arrow 输入流，或具有路径的 FileSystem (SubTreeFileSystem)。

如果是一个文件名，一个内存映射的 Arrow InputStream 将会被打开，并在完成后关闭；压缩将从文件扩展名检测到并自动处理。如果提供了一个输入流，它将保持打开状态。

要被识别为字面量数据，输入必须用 I() 包裹。

delim

用于分隔记录中字段的单个字符。

quote

用于引用字符串的单个字符。

escape_double

文件是否通过双引号来转义引号？即，如果此选项为 TRUE，则值 """" 表示一个单引号，\"。

escape_backslash

文件是否使用反斜杠来转义特殊字符？这比 escape_double 更通用，因为反斜杠可用于转义分隔符字符、引号字符或添加特殊字符，如 \\n。

schema

描述表的 Schema。如果提供，它将用于满足 col_names 和 col_types。

col_names

如果为 TRUE，则输入的第一行将用作列名，并且不会包含在数据框中。如果为 FALSE，则列名将由 Arrow 生成，从 "f0", "f1", ..., "fN" 开始。或者，您可以指定一个列名的字符向量。

col_types

列类型的紧凑字符串表示形式，一个 Arrow Schema，或 NULL (默认值) 以从数据推断类型。

col_select

要保留的列名的字符向量，如 data.table::fread() 的 "select" 参数，或整洁选择规范列，如 dplyr::select() 中使用的。

na

要解释为缺失值的字符串的字符向量。

quoted_na

是否应将引号内的缺失值视为缺失值（默认值）或字符串。（请注意，这与 Arrow C++ 对应的转换选项 strings_can_be_null 的默认值不同。）

skip_empty_rows

是否应完全忽略空白行？如果为 TRUE，则根本不会表示空白行。如果为 FALSE，则将用缺失值填充它们。

skip

读取数据之前要跳过的行数。

parse_options

参见 CSV 解析选项。如果给定，这将覆盖在其他参数中提供的任何解析选项（例如，delim、quote 等）。

convert_options

参见 CSV 转换选项

read_options

参见 CSV 读取选项

as_data_frame

函数是否应该返回一个 tibble（默认值）或一个 Arrow Table？

timestamp_parsers

用户定义的时间戳解析器。如果指定了多个解析器，则 CSV 转换逻辑将尝试从该向量的开头开始解析值。可能的值是

NULL：默认值，它使用 ISO-8601 解析器
strptime 解析字符串的字符向量
TimestampParser 对象的列表

decimal_point

用于浮点数中小数点的字符。

值

一个 tibble，或者如果 as_data_frame = FALSE，则是一个 Table。

详情

read_csv_arrow() 和 read_tsv_arrow() 是 read_delim_arrow() 的包装器，用于指定分隔符。 read_csv2_arrow() 使用 ; 作为分隔符，使用 , 作为小数点。

请注意，并非所有 readr 选项目前都在此处实现。如果您遇到 arrow 应该支持的选项，请提交问题。

如果您需要控制 Arrow 特定的读取器参数，这些参数在 readr::read_csv() 中没有等效项，您可以将它们提供在 parse_options、convert_options 或 read_options 参数中，或者您可以直接使用 CsvTableReader 以获得更低级别的访问权限。

指定列类型和名称

默认情况下，CSV 读取器将从文件推断列名和数据类型，但是您可以通过几种方法直接指定它们。

一种方法是在 schema 参数中提供一个 Arrow Schema，这是一个列名到类型的有序映射。提供后，它将同时满足 col_names 和 col_types 参数。如果您预先知道所有这些信息，这将很好。

您还可以将 Schema 传递给 col_types 参数。如果这样做，除非您还指定 col_names，否则列名仍将从文件中推断。在任何一种情况下，Schema 中的列名都必须与数据的列名匹配，无论它们是显式提供还是推断的。也就是说，此 Schema 不必引用所有列：那些省略的列将推断其类型。

或者，您可以通过提供 readr 用于 col_types 参数的紧凑字符串表示形式来声明列类型。这意味着您提供一个字符串，每个列一个字符，其中字符映射到 Arrow 类型，类似于 readr 类型映射

"c": utf8()
"i": int32()
"n": float64()
"d": float64()
"l": bool()
"f": dictionary()
"D": date32()
"T": timestamp(unit = "ns")
"t": time32() (unit 参数设置为默认值 "ms")
"_": null()
"-": null()
"?": 从数据推断类型

如果您使用 col_types 的紧凑字符串表示形式，您还必须指定 col_names。

无论如何指定类型，所有具有 null() 类型的列都将被删除。

请注意，如果您要指定列名，无论是通过 schema 还是 col_names，并且 CSV 文件具有一个否则将用于标识列名的标题行，则需要添加 skip = 1 以跳过该行。

示例

tf <- tempfile()
on.exit(unlink(tf))
write.csv(mtcars, file = tf)
df <- read_csv_arrow(tf)
dim(df)
#> [1] 32 12
# Can select columns
df <- read_csv_arrow(tf, col_select = starts_with("d"))

# Specifying column types and names
write.csv(data.frame(x = c(1, 3), y = c(2, 4)), file = tf, row.names = FALSE)
read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    
read_csv_arrow(tf, col_types = schema(y = utf8()))
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    
read_csv_arrow(tf, col_types = "ic", col_names = c("x", "y"), skip = 1)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    

# Note that if a timestamp column contains time zones,
# the string "T" `col_types` specification won't work.
# To parse timestamps with time zones, provide a [Schema] to `col_types`
# and specify the time zone in the type object:
tf <- tempfile()
write.csv(data.frame(x = "1970-01-01T12:00:00+12:00"), file = tf, row.names = FALSE)
read_csv_arrow(
  tf,
  col_types = schema(x = timestamp(unit = "us", timezone = "UTC"))
)
#> # A tibble: 1 x 1
#>   x                  
#>   <dttm>             
#> 1 1970-01-01 00:00:00

# Read directly from strings with `I()`
read_csv_arrow(I("x,y\n1,2\n3,4"))
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     2
#> 2     3     4
read_delim_arrow(I(c("x y", "1 2", "3 4")), delim = " ")
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     2
#> 2     3     4