9.11. Parquet Converter¶
The Parquet converter handles data written by Apache Parque. To use the Parquet
converter, specify type = "parquet"
in your converter definition.
9.11.1. Configuration¶
The Parquet converter supports parsing whole Parquet files. Due to the Parquet random-access API, the file path
must be specified in the EvaluationContext
. Further, pure streaming conversion is not possible (i.e. using
bash pipe redirection into the ingest
or convert
command).
As Parquet does not define any object model, standard practice is to parse a Parquet file into Avro GenericRecords.
The Avro GenericRecord being parsed is available to field transforms as $0
.
9.11.2. Avro Paths¶
Because Parquet files are converted into Avro records, it is possible to use Avro paths to select elements. See Avro Converter for details on Avro paths. Note that the result of an Avro path expression will be typed appropriately according to the Parquet column type (e.g. String, Double, List, etc).
9.11.3. Parquet Transform Functions¶
GeoMesa defines several Parquet-specific transform functions, in addition to the ones defined under Avro Transform Functions.
9.11.3.1. parquetPoint¶
Description: Parses a nested Point structure from a Parquet record
Usage: parquetPoint($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The point function can parse GeoMesa-encoded Point columns, which consist of a Parquet group of two double-type
columns named x
and y
.
9.11.3.2. parquetLineString¶
Description: Parses a nested LineString structure from a Parquet record
Usage: parquetLineString($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The linestring function can parse GeoMesa-encoded LineString columns, which consist of a Parquet group of two
repeated double-type columns named x
and y
.
9.11.3.3. parquetPolygon¶
Description: Parses a nested Polygon structure from a Parquet record
Usage: parquetPolygon($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The polygon function can parse GeoMesa-encoded Polygon columns, which consist of a Parquet group of two list-type
columns named x
and y
. The list elements are repeated double-type columns.
9.11.3.4. parquetMultiPoint¶
Description: Parses a nested MultiPoint structure from a Parquet record
Usage: parquetMultiPoint($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The multi-point function can parse GeoMesa-encoded MultiPoint columns, which consist of a Parquet group of two
repeated double-type columns named x
and y
.
9.11.3.5. parquetMultiLineString¶
Description: Parses a nested MultiLineString structure from a Parquet record
Usage: parquetMultiLineString($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The multi-linestring function can parse GeoMesa-encoded MultiLineString columns, which consist of a Parquet group
of two list-type columns named x
and y
. The list elements are repeated double-type columns.
9.11.3.6. parquetMultiPolygon¶
Description: Parses a nested MultiPolygon structure from a Parquet record
Usage: parquetMultiPolygon($ref, $pathString)
$ref
- a reference object (Avro root record or extracted object)pathString
- forward-slash delimited path string. See Avro Paths, above
The multi-polygon function can parse GeoMesa-encoded MultiPolygon columns, which consist of a Parquet group
of two list-type columns named x
and y
. The list elements are also lists, and the nested list elements
are repeated double-type columns.
9.11.4. Example Usage¶
For this example we’ll consider the following JSON file:
{ "id": 1, "number": 123, "color": "red", "physical": { "weight": 127.5, "height": "5'11" }, "lat": 0, "lon": 0 }
{ "id": 2, "number": 456, "color": "blue", "physical": { "weight": 150, "height": "5'11" }, "lat": 1, "lon": 1 }
{ "id": 3, "number": 789, "color": "green", "physical": { "weight": 200.4, "height": "6'2" }, "lat": 4.4, "lon": 3.3 }
This file can be converted to Parquet using Spark:
import org.apache.spark.sql.SparkSession
val session = SparkSession.builder().appName("testSpark").master("local[*]").getOrCreate()
val df = session.read.json("/tmp/example.json")
df.write.option("compression","gzip").parquet("/tmp/example.parquet")
The following SimpleFeatureType and converter would be sufficient to parse the resulting Parquet file:
{
"geomesa" : {
"sfts" : {
"example" : {
"fields" : [
{ "name" : "color", "type" : "String" }
{ "name" : "number", "type" : "Long" }
{ "name" : "height", "type" : "String" }
{ "name" : "weight", "type" : "Double" }
{ "name" : "geom", "type" : "Point", "srid" : 4326 }
]
}
},
"converters" : {
"example" : {
"type" : "parquet",
"id-field" : "avroPath($0, '/id')",
"fields" : [
{ "name" : "color", "transform" : "avroPath($0,'/color')" },
{ "name" : "number", "transform" : "avroPath($0,'/number')" },
{ "name" : "height", "transform" : "avroPath($0,'/physical/height')" },
{ "name" : "weight", "transform" : "avroPath($0,'/physical/weight')" },
{ "name" : "geom", "transform" : "point(avroPath($0,'/lon'),avroPath($0,'/lat'))" }
],
"options" : {
"encoding" : "UTF-8",
"error-mode" : "log-errors",
"parse-mode" : "incremental",
"validators" : [ "index" ]
}
}
}
}
}