Pandas 集成#

为了与 pandas 交互，PyArrow 提供了各种转换例程来使用 pandas 结构并将其转换回来。

注意

虽然 pandas 使用 NumPy 作为后端，但它有足够的特性（例如不同的类型系统和对空值的支持），因此它与 NumPy 集成是一个单独的主题。

要遵循本文档中的示例，请确保运行

In [1]: import pandas as pd

In [2]: import pyarrow as pa

DataFrames#

Arrow 中 pandas DataFrame 的等价物是 Table。两者都由一组等长的命名列组成。虽然 pandas 仅支持平面列，但 Table 还提供嵌套列，因此它可以表示比 DataFrame 更多的数据，因此并非总是可以进行完全转换。

从 Table 转换为 DataFrame 通过调用 pyarrow.Table.to_pandas() 完成。反之，则通过使用 pyarrow.Table.from_pandas() 实现。

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)

默认情况下，pyarrow 尝试尽可能准确地保留和恢复 .index 数据。请参阅下面的部分，了解更多相关信息以及如何禁用此逻辑。

Series#

在 Arrow 中，与 pandas Series 最相似的结构是 Array。它是一个向量，包含与线性内存相同类型的数据。您可以使用 pyarrow.Array.from_pandas() 将 pandas Series 转换为 Arrow Array。由于 Arrow Arrays 始终可以为空，因此您可以使用 mask 参数提供可选的掩码来标记所有空条目。

处理 pandas Indexes#

诸如 pyarrow.Table.from_pandas() 之类的方法具有 preserve_index 选项，该选项定义如何保留（存储）或不保留（不存储）相应 pandas 对象的 index 成员中的数据。此数据使用内部 arrow::Schema 对象中的模式级元数据进行跟踪。

preserve_index 的默认值为 None，其行为如下

RangeIndex 仅作为元数据存储，不需要任何额外的存储。
其他索引类型作为结果 Table 中的一个或多个物理数据列存储

要完全不存储索引，请传递 preserve_index=False。由于存储 RangeIndex 可能会在某些有限的情况下（例如将多个 DataFrame 对象存储在 Parquet 文件中）导致问题，为了强制将所有索引数据序列化到结果表中，请传递 preserve_index=True。

类型差异#

根据 pandas 和 Arrow 的当前设计，无法未经修改地转换所有列类型。这里的主要问题之一是 pandas 不支持任意类型的可为空列。此外，datetime64 当前固定为纳秒分辨率。另一方面，Arrow 可能仍然缺少对某些类型的支持。

pandas -> Arrow 转换#

源类型 (pandas)	目标类型 (Arrow)
`bool`	`BOOL`
`(u)int{8,16,32,64}`	`(U)INT{8,16,32,64}`
`float32`	`FLOAT`
`float64`	`DOUBLE`
`str` / `unicode`	`STRING`
`pd.Categorical`	`DICTIONARY`
`pd.Timestamp`	`TIMESTAMP(unit=ns)`
`datetime.date`	`DATE`
`datetime.time`	`TIME64`

Arrow -> pandas 转换#

源类型 (Arrow)	目标类型 (pandas)
`BOOL`	`bool`
`BOOL` 包含空值	`object`（包含值 `True`、`False`、`None`）
`(U)INT{8,16,32,64}`	`(u)int{8,16,32,64}`
`(U)INT{8,16,32,64}` 包含空值	`float64`
`FLOAT`	`float32`
`DOUBLE`	`float64`
`STRING`	`str`
`DICTIONARY`	`pd.Categorical`
`TIMESTAMP(unit=*)`	`pd.Timestamp` (`np.datetime64[ns]`)
`DATE`	`object`（包含 `datetime.date` 对象）
`TIME64`	`object`（包含 `datetime.time` 对象）

分类类型#

Pandas 分类列转换为 Arrow 字典数组，这是一种特殊的数组类型，经过优化，可以处理重复和有限数量的可能值。

In [3]: df = pd.DataFrame({"cat": pd.Categorical(["a", "b", "c", "a", "b", "c"])})

In [4]: df.cat.dtype.categories
Out[4]: Index(['a', 'b', 'c'], dtype='object')

In [5]: df
Out[5]: 
  cat
0   a
1   b
2   c
3   a
4   b
5   c

In [6]: table = pa.Table.from_pandas(df)

In [7]: table
Out[7]: 
pyarrow.Table
cat: dictionary<values=string, indices=int8, ordered=0>
----
cat: [  -- dictionary:
["a","b","c"]  -- indices:
[0,1,2,0,1,2]]

我们可以检查创建的表的 ChunkedArray，并看到与 Pandas DataFrame 相同的类别。

In [8]: column = table[0]

In [9]: chunk = column.chunk(0)

In [10]: chunk.dictionary
Out[10]: 
<pyarrow.lib.StringArray object at 0x7f9d813466e0>
[
  "a",
  "b",
  "c"
]

In [11]: chunk.indices
Out[11]: 
<pyarrow.lib.Int8Array object at 0x7f9d81346560>
[
  0,
  1,
  2,
  0,
  1,
  2
]

日期时间（时间戳）类型#

Pandas 时间戳在 Pandas 中使用 datetime64[ns] 类型，并转换为 Arrow TimestampArray。

In [12]: df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)})

In [13]: df.dtypes
Out[13]: 
datetime    datetime64[ns, UTC]
dtype: object

In [14]: df
Out[14]: 
                   datetime
0 2020-01-01 00:00:00+00:00
1 2020-01-01 01:00:00+00:00
2 2020-01-01 02:00:00+00:00

In [15]: table = pa.Table.from_pandas(df)

In [16]: table
Out[16]: 
pyarrow.Table
datetime: timestamp[ns, tz=UTC]
----
datetime: [[2020-01-01 00:00:00.000000000Z,2020-01-01 01:00:00.000000000Z,2020-01-01 02:00:00.000000000Z]]

在本示例中，Pandas 时间戳具有时区感知（在本例中为 UTC），此信息用于创建 Arrow TimestampArray。

日期类型#

虽然可以使用 pandas 中的 datetime64[ns] 类型处理日期，但某些系统使用 Python 内置 datetime.date 对象的对象数组

In [17]: from datetime import date

In [18]: s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])

In [19]: s
Out[19]: 
0    2018-12-31
1          None
2    2000-01-01
dtype: object

转换为 Arrow 数组时，默认情况下将使用 date32 类型

In [20]: arr = pa.array(s)

In [21]: arr.type
Out[21]: DataType(date32[day])

In [22]: arr[0]
Out[22]: <pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>

要使用 64 位 date64，请显式指定

In [23]: arr = pa.array(s, type='date64')

In [24]: arr.type
Out[24]: DataType(date64[ms])

使用 to_pandas 转换回来时，将返回 datetime.date 对象的对象数组

In [25]: arr.to_pandas()
Out[25]: 
0    2018-12-31
1          None
2    2000-01-01
dtype: object

如果您想改用 NumPy 的 datetime64 dtype，请传递 date_as_object=False

In [26]: s2 = pd.Series(arr.to_pandas(date_as_object=False))

In [27]: s2.dtype
Out[27]: dtype('<M8[ms]')

警告

从 Arrow 0.13 开始，默认情况下参数 date_as_object 为 True。较旧的版本必须传递 date_as_object=True 才能获得此行为

时间类型#

Pandas 数据结构中的内置 datetime.time 对象将分别转换为 Arrow time64 和 Time64Array。

In [28]: from datetime import time

In [29]: s = pd.Series([time(1, 1, 1), time(2, 2, 2)])

In [30]: s
Out[30]: 
0    01:01:01
1    02:02:02
dtype: object

In [31]: arr = pa.array(s)

In [32]: arr.type
Out[32]: Time64Type(time64[us])

In [33]: arr
Out[33]: 
<pyarrow.lib.Time64Array object at 0x7f9d83bae980>
[
  01:01:01.000000,
  02:02:02.000000
]

转换为 pandas 时，将返回 datetime.time 对象的数组

In [34]: arr.to_pandas()
Out[34]: 
0    01:01:01
1    02:02:02
dtype: object

可空类型#

在 Arrow 中，所有数据类型都是可空的，这意味着它们支持存储缺失值。然而，在 pandas 中，并非所有数据类型都支持缺失数据。最值得注意的是，默认的整数数据类型不支持，并且当引入缺失值时，会被强制转换为浮点数。因此，当 Arrow 数组或表被转换为 pandas 时，如果存在缺失值，整数列将变为浮点数。

>>> arr = pa.array([1, 2, None])
>>> arr
<pyarrow.lib.Int64Array object at 0x7f07d467c640>
[
  1,
  2,
  null
]
>>> arr.to_pandas()
0    1.0
1    2.0
2    NaN
dtype: float64

Pandas 具有实验性的可空数据类型 (https://pandas.ac.cn/docs/user_guide/integer_na.html)。Arrow 支持这些类型的往返转换。

>>> df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype="Int64")})
>>> df
      a
0     1
1     2
2  <NA>

>>> table = pa.table(df)
>>> table
Out[32]:
pyarrow.Table
a: int64
----
a: [[1,2,null]]

>>> table.to_pandas()
      a
0     1
1     2
2  <NA>

>>> table.to_pandas().dtypes
a    Int64
dtype: object

这种往返转换之所以可行，是因为有关原始 pandas DataFrame 的元数据存储在 Arrow 表中。但是，如果您有并非来自具有可空数据类型的 pandas DataFrame 的 Arrow 数据（或者例如 Parquet 文件），则默认转换为 pandas 将不会使用这些可空数据类型。

pyarrow.Table.to_pandas() 方法有一个 types_mapper 关键字，可用于覆盖生成的 pandas DataFrame 使用的默认数据类型。通过这种方式，您可以指示 Arrow 创建使用可空数据类型的 pandas DataFrame。

>>> table = pa.table({"a": [1, 2, None]})
>>> table.to_pandas()
     a
0  1.0
1  2.0
2  NaN
>>> table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
      a
0     1
1     2
2  <NA>

types_mapper 关键字需要一个函数，该函数将返回给定 pyarrow 数据类型要使用的 pandas 数据类型。通过使用 dict.get 方法，我们可以使用字典创建这样的函数。

如果您想使用 pandas 当前支持的所有可空数据类型，则该字典变为

dtype_mapping = {
    pa.int8(): pd.Int8Dtype(),
    pa.int16(): pd.Int16Dtype(),
    pa.int32(): pd.Int32Dtype(),
    pa.int64(): pd.Int64Dtype(),
    pa.uint8(): pd.UInt8Dtype(),
    pa.uint16(): pd.UInt16Dtype(),
    pa.uint32(): pd.UInt32Dtype(),
    pa.uint64(): pd.UInt64Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    pa.float32(): pd.Float32Dtype(),
    pa.float64(): pd.Float64Dtype(),
    pa.string(): pd.StringDtype(),
}

df = table.to_pandas(types_mapper=dtype_mapping.get)

当使用 pandas API 读取 Parquet 文件 (pd.read_parquet(..)) 时，也可以通过传递 use_nullable_dtypes 来实现。

df = pd.read_parquet(path, use_nullable_dtypes=True)

内存使用和零拷贝#

当使用各种 to_pandas 方法将 Arrow 数据结构转换为 pandas 对象时，必须偶尔注意与性能和内存使用相关的问题。

由于 pandas 的内部数据表示通常与 Arrow 列式格式不同，因此只有在某些有限的情况下才有可能进行零拷贝转换（无需内存分配或计算）。

在最坏的情况下，调用 to_pandas 会导致内存中存在两个版本的数据，一个用于 Arrow，一个用于 pandas，从而使内存占用量大约增加一倍。我们已经针对这种情况实施了一些缓解措施，尤其是在创建大型 DataFrame 对象时，我们将在下面进行描述。

零拷贝 Series 转换#

在某些特定情况下，可以从 Array 或 ChunkedArray 零拷贝转换为 NumPy 数组或 pandas Series

Arrow 数据存储为整数（有符号或无符号 int8 到 int64）或浮点类型（float16 到 float64）。这包括许多数字类型以及时间戳。
Arrow 数据没有空值（因为这些值使用 pandas 不支持的位图表示）。
对于 ChunkedArray，数据由单个块组成，即 arr.num_chunks == 1。由于 pandas 的连续性要求，多个块始终需要复制。

在这些情况下，to_pandas 或 to_numpy 将是零拷贝。在所有其他情况下，都需要复制。

减少 `Table.to_pandas` 中的内存使用#

在撰写本文时，pandas 应用了一种名为“合并”的数据管理策略，以将同类型的 DataFrame 列收集到二维 NumPy 数组中，在内部称为“块”。我们已经尽了最大的努力来构建精确的“合并”块，以便 pandas 在我们将数据传递给 pandas.DataFrame 后，不会执行任何进一步的分配或复制。这种合并策略的明显缺点是，它强制执行“内存加倍”。

为了尽量减少 Table.to_pandas 期间“内存加倍”的潜在影响，我们提供了几个选项

split_blocks=True，启用后，Table.to_pandas 会为每一列生成一个内部 DataFrame“块”，从而跳过“合并”步骤。请注意，许多 pandas 操作仍然会触发合并，但是峰值内存使用可能小于完全内存加倍的最坏情况。由于此选项，我们能够在与 Array 和 ChunkedArray 进行零拷贝的相同情况下，对列进行零拷贝转换。
self_destruct=True，这会在每个列 Table 对象中的内部 Arrow 内存缓冲区转换为 pandas 兼容的表示形式时销毁它们，从而可能会尽快将内存释放到操作系统。请注意，这会使调用的 Table 对象无法安全地进一步使用，并且调用的任何进一步方法都将导致您的 Python 进程崩溃。

一起使用，调用

df = table.to_pandas(split_blocks=True, self_destruct=True)
del table  # not necessary, but a good practice

将在某些情况下产生明显较低的内存使用量。如果没有这些选项，to_pandas 将始终使内存加倍。

请注意，不能保证 self_destruct=True 会节省内存。由于转换是逐列进行的，因此内存也是逐列释放的。但是，如果多个列共享一个基础缓冲区，那么在所有这些列都转换之前，将不会释放任何内存。特别是，由于实现细节，来自 IPC 或 Flight 的数据容易出现这种情况，因为内存将按以下方式布局

Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
...

在这种情况下，即使使用 self_destruct=True，也无法在整个表转换之前释放任何内存。