规范扩展示例#

Parquet Variant(变体)扩展#

未拆分(Unshredded)#

最简单的情况是,未拆分的变体始终由恰好两个字段组成:metadatavalue。以下任何存储类型都是有效的(并非详尽列表):

  • struct<metadata: binary non-nullable, value: binary nullable>

  • struct<value: binary nullable, metadata: binary non-nullable>

  • struct<metadata: dictionary<int8, binary> non-nullable, value: binary_view nullable>

简单拆分(Simple Shredding)#

假设我们有一个名为 measurement 的 Variant 字段,为了提高效率,我们希望将 int64 值拆分到单独的列中。在 Parquet 中,可以表示为:

required group measurement (VARIANT) {
  required binary metadata;
  optional binary value;
  optional int64 typed_value;
}

因此,对应的 arrow.parquet.variant Arrow 扩展类型的存储类型将是:

struct<
  metadata: binary non-nullable,
  value: binary nullable,
  typed_value: int64 nullable
>

如果我们假设一系列测量值包含:

34, null, "n/a", 100

该数据在 Arrow 中应存储/表示为:

* Length: 4, Null count: 1
* Validity Bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63    |
  |--------------------------|---------------|
  | 00001011                 | 0 (padding)   |

* Children arrays:
  * field-0 array (`VarBinary`)
    * Length: 4, Null count: 0
    * Offsets buffer:

      | Bytes 0-19       | Bytes 20-63              |
      |------------------|--------------------------|
      | 0, 2, 4, 6, 8    | unspecified (padding)    |

    * Value buffer: (01 00 -> indicates version 1 empty metadata)

      | Bytes 0-7               | Bytes 8-63               |
      |-------------------------|--------------------------|
      | 01 00 01 00 01 00 01 00 | unspecified (padding)    |

  * field-1 array (`VarBinary`)
    * Length: 4, Null count: 2
    * Validity Bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63    |
      |--------------------------|---------------|
      | 00000110                 | 0 (padding)   |

    * Offsets buffer:

      | Bytes 0-19       | Bytes 20-63              |
      |------------------|--------------------------|
      | 0, 0, 1, 5, 5    | unspecified (padding)    |

    * Value buffer: (`00` -> null,
      `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a")

      | Bytes 0-4              | Bytes 5-63               |
      |------------------------|--------------------------|
      | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding)    |

  * field-2 array (int64 array)
    * Length: 4, Null count: 2
    * Validity Bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63    |
      |--------------------------|---------------|
      | 00001001                 | 0 (padding)   |

    * Value buffer:

      | Bytes 0-31          | Bytes 32-63              |
      |---------------------|--------------------------|
      | 34, 00, 00, 100     | unspecified (padding)    |

注意

请注意,value 数组中存在一个变体 literal null,这是根据拆分规范要求的,以便消费者能够区分缺失字段和空值字段。空元素必须编码为变体空值:基础类型 0(原始类型)和物理类型 0(null)。

拆分数组(Shredding an Array)#

在下一个示例中,我们将表示一个拆分的字符串数组。让我们考虑一个看起来像这样的列:

["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null

在 Parquet 中表示此拆分变体可能如下所示:

optional group tags (VARIANT) {
  required binary metadata;
  optional binary value;
  optional group typed_value (LIST) { # optional to allow null lists
    repeated group list {
      required group element {        # shredded element
        optional binary value;
        optional binary typed_value (STRING);
      }
    }
  }
}

变体编码的数组结构不允许缺失元素,因此数组的所有元素必须是非空(non-nullable)的。因此,typed_valuevalue二者不可兼得!)必须是非空的。

在 Arrow 中将其表示为 Variant 扩展类型的存储类型为:

struct<
  metadata: binary non-nullable,
  value: binary nullable,
  typed_value: list<element: struct<
    value: binary nullable,
    typed_value: string nullable
  > non-nullable> nullable
>

注意

通常,Binary 也可以是 LargeBinaryBinaryViewString 也可以是 LargeStringStringViewList 也可以是 LargeListListView

数据随后在 Arrow 中存储如下:

* Length: 4, Null count: 1
* Validity Bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63    |
  |--------------------------|---------------|
  | 00000111                 | 0 (padding)   |

* Children arrays:
  * field-0 array (`VarBinary` metadata)
    * Length: 4, Null count: 0
    * Offsets buffer:

      | Bytes 0-19       | Bytes 20-63              |
      |------------------|--------------------------|
      | 0, 2, 4, 6, 8    | unspecified (padding)    |

    * Value buffer: (01 00 -> indicates version 1 empty metadata)

      | Bytes 0-7               | Bytes 8-63               |
      |-------------------------|--------------------------|
      | 01 00 01 00 01 00 01 00 | unspecified (padding)    |

  * field-1 array (`VarBinary` value)
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63    |
      |--------------------------|---------------|
      | 00001000                 | 0 (padding)   |

    * Offsets buffer:

      | Bytes 0-19       | Bytes 20-63              |
      |------------------|--------------------------|
      | 0, 0, 0, 0, 1    | unspecified (padding)    |

    * Value buffer: (00 -> variant null)

      | Bytes 0            | Bytes 1-63               |
      |--------------------|--------------------------|
      | 00                 | unspecified (padding)    |

  * field-2 array (`List<Struct<VarBinary, String>>` typed_value)
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63  |
      |--------------------------|-------------|
      | 00000111                 | 0 (padding) |

    * Offsets buffer (int32)

      | Bytes 0-19        | Bytes 20-63           |
      |-------------------|-----------------------|
      | 0, 2, 4, 7, 7     | unspecified (padding) |

    * Values array (`Struct<VarBinary, String>` element):
      * Length: 7, Null count: 0
      * Validity bitmap buffer: Not required

      * Children arrays:
        * field-0 array (`VarBinary` value)
          * Length: 7, Null count: 6
          * Validity bitmap buffer:

            | Byte 0 (validity bitmap) | Bytes 1-63  |
            |--------------------------|-------------|
            | 00001000                 | 0 (padding) |

          * Offsets buffer (int32):

            | Bytes 0-31                | Bytes 32-63              |
            |---------------------------|--------------------------|
            | 0, 0, 0, 0, 1, 1, 1, 1    | unspecified (padding)    |

          * Values buffer (`00` -> variant null):

            | Bytes 0            | Bytes 1-63               |
            |--------------------|--------------------------|
            | 00                 | unspecified (padding)    |

        * field-1 array (`String` typed_value)
          * Length: 7, Null count: 1
          * Validity bitmap buffer:

            | Byte 0 (validity bitmap) | Bytes 1-63  |
            |--------------------------|-------------|
            | 01110111                 | 0 (padding) |

          * Offsets buffer (int32):

            | Bytes 0-31                      | Bytes 32-63              |
            |---------------------------------|--------------------------|
            | 0, 6, 11, 17, 17, 23, 28, 35    | unspecified (padding)    |

          * Values buffer:

            | Bytes 0-35                           | Bytes 36-63              |
            |--------------------------------------|--------------------------|
            | comedydramahorrorcomedydramaromance  | unspecified (padding)    |

拆分对象(Shredding an Object)#

让我们考虑一个“事件”JSON 列,其中包含一个名为 event_type 的字段(字符串)和一个名为 event_ts 的字段(时间戳),我们希望将其拆分为单独的列。在 Parquet 中,它看起来可能像这样:

optional group event (VARIANT) {
  required binary metadata;
  optional binary value;        # variant, remaining fields/values
  optional group typed_value {  # shredded fields for variant object
    required group event_type { # event_type shredded field
      optional binary value;
      optional binary typed_value (STRING);
    }
    required group event_ts {   # event_ts shredded field
      optional binary value;
      optional int64 typed_value (TIMESTAMP(true, MICROS))
    }
  }
}

然后我们可以将其转换为预期的扩展存储类型:

struct<
  metadata: binary non-nullable,
  value: binary nullable,
  typed_value: struct<
    event_type: struct<
      value: binary nullable,
      typed_value: string nullable
    > non-nullable,
    event_ts: struct<
      value: binary nullable,
      typed_value: timestamp(us, UTC) nullable
    > non-nullable
  > nullable
>

如果一个字段在变体对象值中不存在,则该行的 valuetyped_value 列都将为 null。如果字段存在但其值为空,则 value 必须包含一个变体空值。

对于给定的索引,valuetyped_value 同时为非空值是无效的。读取器可以选择在此场景下不报错,但如果选择不报错,则必须使用该索引下 typed_value 列中的值。

让我们考虑以下一系列对象:

{"event_type": "noop", "event_ts": 1729794114937}

{"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"}

{"error_msg": "malformed..."}

"malformed: not an object"

{"event_ts": 1729794240241, "click": "_button"}

{"event_ts": null, "event_ts": 1729794954163}

{"event_type": "noop", "event_ts": "2024-10-24"}

{}

null

*Entirely missing*

为了使用 Variant 扩展类型将这些值表示为 Variant 值列,我们得到以下结果:

* Length: 10, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
  |--------------------------|-----------|-----------------------|
  | 11111111                 | 00000001  | 0 (padding)           |

* Children arrays
  * field-0 array (`VarBinary` Metadata)
    * Length: 10, Null count: 0
    * Offsets buffer:

      | Bytes 0-43 (int32)                       | Bytes 44-63             |
      |------------------------------------------|-------------------------|
      | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding)   |

    * Value buffer: (01 00 -> version 1 empty metadata,
                     01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes)

      | Bytes 0-1 | Bytes 2-10        | Bytes 11-23           | Bytes 24-25 | Bytes 26-34       |
      |-------------------------------|-----------------------|-------------|-------------------|
      | 01 00     | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00       | 01 01 00 05 click |

      | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 | Bytes 45-63           |
      |-------------|-------------|-------------|-------------|-------------|-----------------------|
      | 01 00       | 01 00       | 01 00       | 01 00       | 01 00       | unspecified (padding) |

  * field-1 array (`VarBinary` Value)
    * Length: 10, Null count: 5
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap)  | Byte 1    | Bytes 2-63            |
      |---------------------------|-----------|-----------------------|
      | 00011110                  | 00000001  | 0 (padding)           |

    * Offsets buffer (filled in based on lengths of encoded variants):

      | ... |

    * Value buffer:

      | VariantEncode({"email": "user@email.com"}) | VariantEncode({"error_msg": "malformed..."}) |
      | VariantEncode("malformed: not an object")  | VariantEncode({"click": "_button"})          | 00 (null) |

  * field-2 array (`Struct<...>` typed_value)
    * Length: 10, Null count: 3
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
      |--------------------------|-----------|-----------------------|
      | 11110111                 | 00000000  | 0 (padding)           |

    * Children arrays:
      * field-0 array (`Struct<VarBinary, String>` event_type)
        * Length: 10, Null count: 0
        * Validity bitmap buffer: not required

        * Children arrays
          * field-0 array (`VarBinary` value)
            * Length: 10, Null count: 9
            * Validity bitmap buffer:

              | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
              |--------------------------|-----------|-----------------------|
              | 01000000                 | 00000000  | 0 (padding)           |

            * Offsets buffer (int32)

              | Bytes 0-43 (int32)              | Bytes 44-63             |
              |---------------------------------|-------------------------|
              | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding)   |

            * Value buffer:

              | Byte 0 | Bytes 1-63             |
              |--------|------------------------|
              | 00     | unspecified (padding)  |

          * field-1 array (`String` typed_value)
            * Length: 10, Null count: 7
            * Validity bitmap buffer:

              | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
              |--------------------------|-----------|-----------------------|
              | 01000011                 | 00000000  | 0 (padding)           |

            * Offsets buffer (int32)

              | Byte 0-43                           | Bytes 44-63            |
              |-------------------------------------|------------------------|
              | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding)  |

            * Value buffer:

              | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63            |
              |-----------|-----------|------------|------------------------|
              | noop      | login     | noop       | unspecified (padding)  |


      * field-1 array (`Struct<VarBinary, Timestamp>` event_ts)
        * Length: 10, Null count: 0
        * Validity bitmap buffer: not required

        * Children arrays
          * field-0 array (`VarBinary` value)
            * Length: 10, Null count: 9
            * Validity bitmap buffer:

              | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
              |--------------------------|-----------|-----------------------|
              | 01000000                 | 00000000  | 0 (padding)           |

            * Offsets buffer (int32)

              | Bytes 0-43 (int32)              | Bytes 44-63             |
              |---------------------------------|-------------------------|
              | ...                             | unspecified (padding)   |

            * Value buffer:

              | VariantEncode("2024-10-24")     |

          * field-1 array (`Timestamp(us, UTC)` typed_value)
            * Length: 10, Null count: 6
            * Validity bitmap buffer:

              | Byte 0 (validity bitmap) | Byte 1    | Bytes 2-63            |
              |--------------------------|-----------|-----------------------|
              | 00110011                 | 00000000  | 0 (padding)           |

            * Value buffer:

              | Bytes 0-7     | Bytes 8-15    | Bytes 16-31  | Bytes 32-39   | Bytes 40-47   | Bytes 48-63            |
              |---------------|---------------|--------------|---------------|---------------|------------------------|
              | 1729794114937 | 1729794146402 | unspecified  | 1729794240241 | 1729794954163 | unspecified (padding)  |

综上所述(Putting it all together)#

如前所述,与 Variant value 关联的 typed_value 字段可以是任何已拆分类型。因此,只要遵循原始规则,就可以根据您想要拆分对象的方式拥有任意数量的嵌套层级。例如,除了 event_type 之外,我们可能还有更多字段需要拆分。可能是一个看起来像这样的对象:

{
  "event_type": "login",
  "event_ts": 1729794114937,
  "location”: {"longitude": 1.5, "latitude": 5.5},
  "tags": ["foo", "bar", "baz"]
}

如果我们拆分额外字段并将其表示为 Parquet,它看起来如下:

optional group event (VARIANT) {
  required binary metadata;
  optional binary value;        # variant, remaining fields/values
  optional group typed_value {  # shredded fields for variant object
    required group event_type { # event_type shredded field
      optional binary value;
      optional binary typed_value (STRING);
    }
    required group event_ts {   # event_ts shredded field
      optional binary value;
      optional int64 typed_value (TIMESTAMP(true, MICROS))
    }
    required group location {   # location shredded field
      optional binary value;
      optional group typed_value {
        required group longitude {
          optional binary value;
          optional float64 typed_value;
        }
        required group latitude {
          optional binary value;
          optional float64 typed_value;
        }
      }
    }
    required group tags {       # tags shredded field
      optional binary value;
      optional group typed_value (LIST) {
        repeated group list {
          required group element {
            optional binary value;
            optional binary typed_value (STRING);
          }
        }
      }
    }
  }
}

最后,遵循我们为构建 Variant 扩展类型存储类型所设定的规则,我们得到:

struct<
  metadata: binary non-nullable,
  value: binary nullable,
  typed_value: struct<
    event_type: struct<value: binary nullable, typed_value: string nullable> non-nullable,
    event_ts: struct<value: binary nullable, typed_value: timestamp(us, UTC) nullable> non-nullable,
    location: struct<
      value: binary nullable,
      typed_value: struct<
        longitude: struct<value: binary nullable, typed_value: double nullable> non-nullable,
        latitude: struct<value: binary nullable, typed_value: double nullable> non-nullable
      > nullable> non-nullable,
    tags: struct<
        value: binary nullable,
        typed_value: list<struct<value: binary nullable, typed_value: string nullable> non-nullable> nullable
      > non-nullable
  > nullable
>