规范扩展示例#
Parquet Variant(变体)扩展#
未拆分(Unshredded)#
最简单的情况是,未拆分的变体始终由恰好两个字段组成:metadata 和 value。以下任何存储类型都是有效的(并非详尽列表):
struct<metadata: binary non-nullable, value: binary nullable>struct<value: binary nullable, metadata: binary non-nullable>struct<metadata: dictionary<int8, binary> non-nullable, value: binary_view nullable>
简单拆分(Simple Shredding)#
假设我们有一个名为 measurement 的 Variant 字段,为了提高效率,我们希望将 int64 值拆分到单独的列中。在 Parquet 中,可以表示为:
required group measurement (VARIANT) {
required binary metadata;
optional binary value;
optional int64 typed_value;
}
因此,对应的 arrow.parquet.variant Arrow 扩展类型的存储类型将是:
struct<
metadata: binary non-nullable,
value: binary nullable,
typed_value: int64 nullable
>
如果我们假设一系列测量值包含:
34, null, "n/a", 100
该数据在 Arrow 中应存储/表示为:
* Length: 4, Null count: 1
* Validity Bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|---------------|
| 00001011 | 0 (padding) |
* Children arrays:
* field-0 array (`VarBinary`)
* Length: 4, Null count: 0
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|------------------|--------------------------|
| 0, 2, 4, 6, 8 | unspecified (padding) |
* Value buffer: (01 00 -> indicates version 1 empty metadata)
| Bytes 0-7 | Bytes 8-63 |
|-------------------------|--------------------------|
| 01 00 01 00 01 00 01 00 | unspecified (padding) |
* field-1 array (`VarBinary`)
* Length: 4, Null count: 2
* Validity Bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|---------------|
| 00000110 | 0 (padding) |
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|------------------|--------------------------|
| 0, 0, 1, 5, 5 | unspecified (padding) |
* Value buffer: (`00` -> null,
`0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a")
| Bytes 0-4 | Bytes 5-63 |
|------------------------|--------------------------|
| 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) |
* field-2 array (int64 array)
* Length: 4, Null count: 2
* Validity Bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|---------------|
| 00001001 | 0 (padding) |
* Value buffer:
| Bytes 0-31 | Bytes 32-63 |
|---------------------|--------------------------|
| 34, 00, 00, 100 | unspecified (padding) |
注意
请注意,value 数组中存在一个变体 literal null,这是根据拆分规范要求的,以便消费者能够区分缺失字段和空值字段。空元素必须编码为变体空值:基础类型 0(原始类型)和物理类型 0(null)。
拆分数组(Shredding an Array)#
在下一个示例中,我们将表示一个拆分的字符串数组。让我们考虑一个看起来像这样的列:
["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null
在 Parquet 中表示此拆分变体可能如下所示:
optional group tags (VARIANT) {
required binary metadata;
optional binary value;
optional group typed_value (LIST) { # optional to allow null lists
repeated group list {
required group element { # shredded element
optional binary value;
optional binary typed_value (STRING);
}
}
}
}
变体编码的数组结构不允许缺失元素,因此数组的所有元素必须是非空(non-nullable)的。因此,typed_value 或 value(二者不可兼得!)必须是非空的。
在 Arrow 中将其表示为 Variant 扩展类型的存储类型为:
struct<
metadata: binary non-nullable,
value: binary nullable,
typed_value: list<element: struct<
value: binary nullable,
typed_value: string nullable
> non-nullable> nullable
>
注意
通常,Binary 也可以是 LargeBinary 或 BinaryView,String 也可以是 LargeString 或 StringView,List 也可以是 LargeList 或 ListView。
数据随后在 Arrow 中存储如下:
* Length: 4, Null count: 1
* Validity Bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|---------------|
| 00000111 | 0 (padding) |
* Children arrays:
* field-0 array (`VarBinary` metadata)
* Length: 4, Null count: 0
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|------------------|--------------------------|
| 0, 2, 4, 6, 8 | unspecified (padding) |
* Value buffer: (01 00 -> indicates version 1 empty metadata)
| Bytes 0-7 | Bytes 8-63 |
|-------------------------|--------------------------|
| 01 00 01 00 01 00 01 00 | unspecified (padding) |
* field-1 array (`VarBinary` value)
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|---------------|
| 00001000 | 0 (padding) |
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|------------------|--------------------------|
| 0, 0, 0, 0, 1 | unspecified (padding) |
* Value buffer: (00 -> variant null)
| Bytes 0 | Bytes 1-63 |
|--------------------|--------------------------|
| 00 | unspecified (padding) |
* field-2 array (`List<Struct<VarBinary, String>>` typed_value)
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-------------|
| 00000111 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-19 | Bytes 20-63 |
|-------------------|-----------------------|
| 0, 2, 4, 7, 7 | unspecified (padding) |
* Values array (`Struct<VarBinary, String>` element):
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Children arrays:
* field-0 array (`VarBinary` value)
* Length: 7, Null count: 6
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-------------|
| 00001000 | 0 (padding) |
* Offsets buffer (int32):
| Bytes 0-31 | Bytes 32-63 |
|---------------------------|--------------------------|
| 0, 0, 0, 0, 1, 1, 1, 1 | unspecified (padding) |
* Values buffer (`00` -> variant null):
| Bytes 0 | Bytes 1-63 |
|--------------------|--------------------------|
| 00 | unspecified (padding) |
* field-1 array (`String` typed_value)
* Length: 7, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-------------|
| 01110111 | 0 (padding) |
* Offsets buffer (int32):
| Bytes 0-31 | Bytes 32-63 |
|---------------------------------|--------------------------|
| 0, 6, 11, 17, 17, 23, 28, 35 | unspecified (padding) |
* Values buffer:
| Bytes 0-35 | Bytes 36-63 |
|--------------------------------------|--------------------------|
| comedydramahorrorcomedydramaromance | unspecified (padding) |
拆分对象(Shredding an Object)#
让我们考虑一个“事件”JSON 列,其中包含一个名为 event_type 的字段(字符串)和一个名为 event_ts 的字段(时间戳),我们希望将其拆分为单独的列。在 Parquet 中,它看起来可能像这样:
optional group event (VARIANT) {
required binary metadata;
optional binary value; # variant, remaining fields/values
optional group typed_value { # shredded fields for variant object
required group event_type { # event_type shredded field
optional binary value;
optional binary typed_value (STRING);
}
required group event_ts { # event_ts shredded field
optional binary value;
optional int64 typed_value (TIMESTAMP(true, MICROS))
}
}
}
然后我们可以将其转换为预期的扩展存储类型:
struct<
metadata: binary non-nullable,
value: binary nullable,
typed_value: struct<
event_type: struct<
value: binary nullable,
typed_value: string nullable
> non-nullable,
event_ts: struct<
value: binary nullable,
typed_value: timestamp(us, UTC) nullable
> non-nullable
> nullable
>
如果一个字段在变体对象值中不存在,则该行的 value 和 typed_value 列都将为 null。如果字段存在但其值为空,则 value 必须包含一个变体空值。
对于给定的索引,value 和 typed_value 同时为非空值是无效的。读取器可以选择在此场景下不报错,但如果选择不报错,则必须使用该索引下 typed_value 列中的值。
让我们考虑以下一系列对象:
{"event_type": "noop", "event_ts": 1729794114937}
{"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"}
{"error_msg": "malformed..."}
"malformed: not an object"
{"event_ts": 1729794240241, "click": "_button"}
{"event_ts": null, "event_ts": 1729794954163}
{"event_type": "noop", "event_ts": "2024-10-24"}
{}
null
*Entirely missing*
为了使用 Variant 扩展类型将这些值表示为 Variant 值列,我们得到以下结果:
* Length: 10, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 11111111 | 00000001 | 0 (padding) |
* Children arrays
* field-0 array (`VarBinary` Metadata)
* Length: 10, Null count: 0
* Offsets buffer:
| Bytes 0-43 (int32) | Bytes 44-63 |
|------------------------------------------|-------------------------|
| 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) |
* Value buffer: (01 00 -> version 1 empty metadata,
01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes)
| Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25 | Bytes 26-34 |
|-------------------------------|-----------------------|-------------|-------------------|
| 01 00 | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00 | 01 01 00 05 click |
| Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 | Bytes 45-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 01 00 | 01 00 | 01 00 | 01 00 | 01 00 | unspecified (padding) |
* field-1 array (`VarBinary` Value)
* Length: 10, Null count: 5
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|---------------------------|-----------|-----------------------|
| 00011110 | 00000001 | 0 (padding) |
* Offsets buffer (filled in based on lengths of encoded variants):
| ... |
* Value buffer:
| VariantEncode({"email": "user@email.com"}) | VariantEncode({"error_msg": "malformed..."}) |
| VariantEncode("malformed: not an object") | VariantEncode({"click": "_button"}) | 00 (null) |
* field-2 array (`Struct<...>` typed_value)
* Length: 10, Null count: 3
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 11110111 | 00000000 | 0 (padding) |
* Children arrays:
* field-0 array (`Struct<VarBinary, String>` event_type)
* Length: 10, Null count: 0
* Validity bitmap buffer: not required
* Children arrays
* field-0 array (`VarBinary` value)
* Length: 10, Null count: 9
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 01000000 | 00000000 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-43 (int32) | Bytes 44-63 |
|---------------------------------|-------------------------|
| 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding) |
* Value buffer:
| Byte 0 | Bytes 1-63 |
|--------|------------------------|
| 00 | unspecified (padding) |
* field-1 array (`String` typed_value)
* Length: 10, Null count: 7
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 01000011 | 00000000 | 0 (padding) |
* Offsets buffer (int32)
| Byte 0-43 | Bytes 44-63 |
|-------------------------------------|------------------------|
| 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) |
* Value buffer:
| Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 |
|-----------|-----------|------------|------------------------|
| noop | login | noop | unspecified (padding) |
* field-1 array (`Struct<VarBinary, Timestamp>` event_ts)
* Length: 10, Null count: 0
* Validity bitmap buffer: not required
* Children arrays
* field-0 array (`VarBinary` value)
* Length: 10, Null count: 9
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 01000000 | 00000000 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-43 (int32) | Bytes 44-63 |
|---------------------------------|-------------------------|
| ... | unspecified (padding) |
* Value buffer:
| VariantEncode("2024-10-24") |
* field-1 array (`Timestamp(us, UTC)` typed_value)
* Length: 10, Null count: 6
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 |
|--------------------------|-----------|-----------------------|
| 00110011 | 00000000 | 0 (padding) |
* Value buffer:
| Bytes 0-7 | Bytes 8-15 | Bytes 16-31 | Bytes 32-39 | Bytes 40-47 | Bytes 48-63 |
|---------------|---------------|--------------|---------------|---------------|------------------------|
| 1729794114937 | 1729794146402 | unspecified | 1729794240241 | 1729794954163 | unspecified (padding) |
综上所述(Putting it all together)#
如前所述,与 Variant value 关联的 typed_value 字段可以是任何已拆分类型。因此,只要遵循原始规则,就可以根据您想要拆分对象的方式拥有任意数量的嵌套层级。例如,除了 event_type 之外,我们可能还有更多字段需要拆分。可能是一个看起来像这样的对象:
{
"event_type": "login",
"event_ts": 1729794114937,
"location”: {"longitude": 1.5, "latitude": 5.5},
"tags": ["foo", "bar", "baz"]
}
如果我们拆分额外字段并将其表示为 Parquet,它看起来如下:
optional group event (VARIANT) {
required binary metadata;
optional binary value; # variant, remaining fields/values
optional group typed_value { # shredded fields for variant object
required group event_type { # event_type shredded field
optional binary value;
optional binary typed_value (STRING);
}
required group event_ts { # event_ts shredded field
optional binary value;
optional int64 typed_value (TIMESTAMP(true, MICROS))
}
required group location { # location shredded field
optional binary value;
optional group typed_value {
required group longitude {
optional binary value;
optional float64 typed_value;
}
required group latitude {
optional binary value;
optional float64 typed_value;
}
}
}
required group tags { # tags shredded field
optional binary value;
optional group typed_value (LIST) {
repeated group list {
required group element {
optional binary value;
optional binary typed_value (STRING);
}
}
}
}
}
}
最后,遵循我们为构建 Variant 扩展类型存储类型所设定的规则,我们得到:
struct<
metadata: binary non-nullable,
value: binary nullable,
typed_value: struct<
event_type: struct<value: binary nullable, typed_value: string nullable> non-nullable,
event_ts: struct<value: binary nullable, typed_value: timestamp(us, UTC) nullable> non-nullable,
location: struct<
value: binary nullable,
typed_value: struct<
longitude: struct<value: binary nullable, typed_value: double nullable> non-nullable,
latitude: struct<value: binary nullable, typed_value: double nullable> non-nullable
> nullable> non-nullable,
tags: struct<
value: binary nullable,
typed_value: list<struct<value: binary nullable, typed_value: string nullable> non-nullable> nullable
> non-nullable
> nullable
>