将PyArrow与R集成#

Arrow支持通过Arrow C数据接口在同一进程内交换数据。

这可用于在 Python 和 R 函数及方法之间交换数据，以便两种语言可以交互，而无需任何封送和解封送数据的成本。

注意

本文假设您已正确安装了 pyarrow 的 Python 环境，以及正确安装了 arrow 库的 R 环境。有关更多详细信息，请参阅 Python 安装说明和 R 安装说明。

从 Python 调用 R 函数#

假设我们有一个简单的 R 函数，接收一个 Arrow Array，并将其所有元素加 3

library(arrow)

addthree <- function(arr) {
    return(arr + 3L)
}

我们可以将这样的函数保存在 addthree.R 文件中，以便我们可以使其可重用。

创建 addthree.R 文件后，我们可以使用 rpy2 库从 Python 调用它的任何函数，该库在 Python 解释器中启用 R 运行时。

rpy2 可以像大多数 Python 库一样使用 pip 安装

$ pip install rpy2

我们可以使用 addthree 函数做的最基本的事情是用一个数字从 Python 调用它，看看它将如何返回结果。

为此，我们可以创建一个 addthree.py 文件，该文件使用 rpy2 从 addthree.R 文件导入 addthree 函数并调用它

import rpy2.robjects as robjects

# Load the addthree.R file
r_source = robjects.r["source"]
r_source("addthree.R")

# Get a reference to the addthree function
addthree = robjects.r["addthree"]

# Invoke the function
r = addthree(3)

# Access the returned value
value = r[0]
print(value)

运行 addthree.py 文件将显示我们的 Python 代码如何能够访问 R 函数并打印预期的结果

$ python addthree.py
6

如果我们要传递 Arrow Array 而不是传递基本数据类型，我们可以依赖 rpy2-arrow 模块，该模块实现了 rpy2 对 Arrow 类型的支持。

rpy2-arrow 可以通过 pip 安装

$ pip install rpy2-arrow

rpy2-arrow 实现了从 PyArrow 对象到 R Arrow 对象的转换器，这不会产生任何数据复制成本，因为它依赖于 C Data 接口。

要将 PyArrow 数组传递给 addthree 函数，我们需要修改我们的 addthree.py 文件以启用 rpy2-arrow 转换器，然后传递 PyArrow 数组

import rpy2.robjects as robjects
from rpy2_arrow.pyarrow_rarrow import (rarrow_to_py_array,
                                       converter as arrowconverter)
from rpy2.robjects.conversion import localconverter

r_source = robjects.r["source"]
r_source("addthree.R")

addthree = robjects.r["addthree"]

import pyarrow

array = pyarrow.array((1, 2, 3))

# Enable rpy2-arrow converter so that R can receive the array.
with localconverter(arrowconverter):
    r_result = addthree(array)

# The result of the R function will be an R Environment
# we can convert the Environment back to a pyarrow Array
# using the rarrow_to_py_array function
py_result = rarrow_to_py_array(r_result)
print("RESULT", type(py_result), py_result)

现在运行新修改的 addthree.py 应该可以正确执行 R 函数并打印生成的 PyArrow Array

$ python addthree.py
RESULT <class 'pyarrow.lib.Int64Array'> [
  4,
  5,
  6
]

有关更多信息，您可以参考 rpy2 文档和 rpy2-arrow 文档

从 R 调用 Python 函数#

可以通过 reticulate 库将 Python 函数暴露给 R。例如，如果我们想在 R 中创建的 Array 上从 R 调用 pyarrow.compute.add()，我们可以通过 reticulate 在 R 中导入 pyarrow。

调用 add 以将 3 添加到 R 数组的基本 addthree.R 脚本如下所示

# Load arrow and reticulate libraries
library(arrow)
library(reticulate)

# Create a new array in R
a <- Array$create(c(1, 2, 3))

# Make pyarrow.compute available to R
pc <- import("pyarrow.compute")

# Invoke pyarrow.compute.add with the array and 3
# This will add 3 to all elements of the array and return a new Array
result <- pc$add(a, 3)

# Print the result to confirm it's what we expect
print(result)

调用 addthree.R 脚本将打印出将 3 添加到原始 Array$create(c(1, 2, 3)) 数组的所有元素的结果

$ R --silent -f addthree.R
Array
<double>
[
  4,
  5,
  6
]

有关更多信息，您可以参考 Reticulate 文档和 R Arrow 文档

使用 C Data Interface 进行 R 到 Python 通信#

上述两种解决方案都使用底层的 Arrow C Data 接口。

如果我们要扩展先前的 addthree 示例，从使用 rpy2-arrow 切换到使用纯 C Data 接口，我们可以通过对我们的代码库进行一些修改来实现。

要启用从 C Data 接口导入 Arrow Array，我们必须将 addthree 函数包装在一个函数中，该函数执行从 C Data 接口在 R 中导入 Arrow Array 所需的额外工作。

这项工作将由 addthree_cdata 函数完成，一旦导入 Array，它就会调用 addthree 函数。

因此，我们的 addthree.R 将同时具有 addthree_cdata 和 addthree 函数

library(arrow)

addthree_cdata <- function(array_ptr_s, schema_ptr_s) {
    a <- Array$import_from_c(array_ptr, schema_ptr)

    return(addthree(a))
}

addthree <- function(arr) {
    return(arr + 3L)
}

我们现在可以通过 array_ptr_s 和 schema_ptr_s 参数从 Python 向 R 提供数组及其模式，以便 R 可以从它们重建 Array，然后使用该数组调用 addthree。

从 Python 调用 addthree_cdata 涉及构建我们要传递给 R 的数组，将其导出到 C Data 接口，然后将导出的引用传递给 R 函数。

因此，我们的 addthree.py 将变为

# Get a reference to the addthree_cdata R function
import rpy2.robjects as robjects
r_source = robjects.r["source"]
r_source("addthree.R")
addthree_cdata = robjects.r["addthree_cdata"]

# Create the pyarrow array we want to pass to R
import pyarrow
array = pyarrow.array((1, 2, 3))

# Import the pyarrow module that provides access to the C Data interface
from pyarrow.cffi import ffi as arrow_c

# Allocate structures where we will export the Array data
# and the Array schema. They will be released when we exit the with block.
with arrow_c.new("struct ArrowArray*") as c_array, \
     arrow_c.new("struct ArrowSchema*") as c_schema:
    # Get the references to the C Data structures.
    c_array_ptr = int(arrow_c.cast("uintptr_t", c_array))
    c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema))

    # Export the Array and its schema to the C Data structures.
    array._export_to_c(c_array_ptr)
    array.type._export_to_c(c_schema_ptr)

    # Invoke the R addthree_cdata function passing the references
    # to the array and schema C Data structures.
    # Those references are passed as strings as R doesn't have
    # native support for 64bit integers, so the integers are
    # converted to their string representation for R to convert it back.
    r_result_array = addthree_cdata(str(c_array_ptr), str(c_schema_ptr))

    # r_result will be an Environment variable that contains the
    # arrow Array built from R as the return value of addthree.
    # To make it available as a Python pyarrow array we need to export
    # it as a C Data structure invoking the Array$export_to_c R method
    r_result_array["export_to_c"](str(c_array_ptr), str(c_schema_ptr))

    # Once the returned array is exported to a C Data infrastructure
    # we can import it back into pyarrow using Array._import_from_c
    py_array = pyarrow.Array._import_from_c(c_array_ptr, c_schema_ptr)

print("RESULT", py_array)

现在运行新更改的 addthree.py 将打印出将 3 添加到原始 pyarrow.array((1, 2, 3)) 数组的所有元素后生成的 Array

$ python addthree.py
R[write to console]: Attaching package: ‘arrow’
RESULT [
  4,
  5,
  6
]