11.4. Spatial RDD Providers¶
11.4.1. Accumulo RDD Provider¶
The AccumuloSpatialRDDProvider
is a spatial RDD provider for Accumulo data stores. The core code is in
the geomesa-accumulo-spark
module, and the shaded JAR-with-dependencies are available in the
geomesa-accumulo-spark-runtime-accumulo1
and geomesa-accumulo-spark-runtime-accumulo2
modules.
Note
The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store.
There are two Accumulo Spark runtime JARs, one for Accumulo 1.x (geomesa-accumulo-spark-runtime-accumulo1
)
and one for Accumulo 2.x (geomesa-accumulo-spark-runtime-accumulo2
). Make sure that you use the JAR
corresponding to your Accumulo version.
This provider can read from and write to a GeoMesa AccumuloDataStore
. The configuration parameters
are the same as those passed to DataStoreFinder.getDataStore()
. See Accumulo Data Store Parameters for details.
The feature type to access in GeoMesa is passed as the type name of the query passed
to the rdd()
method. For example, to load an RDD
of features of type gdelt
from the geomesa
Accumulo table:
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val params = Map(
"accumulo.instance.id" -> "mycloud",
"accumulo.user" -> "user",
"accumulo.password" -> "password",
"accumulo.zookeepers" -> "zoo1,zoo2,zoo3",
"accumulo.catalog" -> "geomesa")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
11.4.2. HBase RDD Provider¶
The HBaseSpatialRDDProvider
is a spatial RDD provider for HBase data stores. The core code is in
the geomesa-hbase-spark
module, and the shaded JAR-with-dependencies (which contains all the required
dependencies for execution) is available in the geomesa-hbase-spark-runtime-hbase1
and
geomesa-hbase-spark-runtime-hbase2
modules.
Note
The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store.
There are two HBase Spark runtime JARs, one for HBase 1.x (geomesa-hbase-spark-runtime-hbase1
)
and one for HBase 2.x (geomesa-hbase-spark-runtime-hbase2
). Make sure that you use the JAR
corresponding to your HBase version.
This provider can read from and write to a GeoMesa HBaseDataStore
. The configuration parameters
are the same as those passed to DataStoreFinder.getDataStore()
. See HBase Data Store Parameters for details.
Note
Connecting to HBase generally requires the hbase-site.xml
file to be available on the Spark classpath.
This may be accomplished by specifying it with --jars
. For example:
$ spark-shell --jars file:///opt/geomesa/dist/spark/geomesa-hbase-spark-runtime-hbase1_${VERSION}.jar,file:///usr/lib/hbase/conf/hbase-site.xml
Alternatively, you may specify the zookeepers in the data store parameter map. However, this may not work for every HBase setup.
The feature type to access in GeoMesa is passed as the type name of the query passed
to the rdd()
method. For example, to load an RDD
of features of type gdelt
from the geomesa
HBase table:
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val params = Map("hbase.zookeepers" -> "zoo1,zoo2,zoo3", "hbase.catalog" -> "geomesa")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
11.4.3. FileSystem RDD Provider¶
The FileSystemRDDProvider
is a spatial RDD provider for GeoMesa file system data stores. The core code is in
the geomesa-fs-spark
module, and the shaded JAR-with-dependencies (which contains all the required
dependencies for execution) is available in the geomesa-fs-spark-runtime
module.
This provider can read from and write to a GeoMesa FileSystemDataStore
. The configuration parameters
are the same as those passed to DataStoreFinder.getDataStore()
. See FileSystem Data Store Parameters for details.
The feature type to access in GeoMesa is passed as the type name of the query passed
to the rdd()
method. For example, to load an RDD
of features of type gdelt
from an s3 bucket:
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val params = Map("fs.path" -> "s3a://mybucket/geomesa/datastore")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
See FileSystem Data Store Spark SQL Example for an example of using SparkSQL with the FileSystem data store.
11.4.4. Converter RDD Provider¶
The ConverterSpatialRDDProvider
is provided by the geomesa-spark-converter
module.
ConverterSpatialRDDProvider
reads features from one or more data files in formats
readable by the GeoMesa Convert library, including delimited and fixed-width text,
Avro, JSON, and XML files. It takes the following configuration parameters:
geomesa.converter
- the converter definition as a Typesafe Config string
geomesa.converter.inputs
- input file paths, comma-delimited
geomesa.sft
- theSimpleFeatureType
, as a spec string, configuration string, or environment lookup name
geomesa.sft.name
- (optional) the name of theSimpleFeatureType
Consider the example data described in the Example Usage section of the
GeoMesa Convert documentation. If the file example.csv
contains the
example data, and example.conf
contains the Typesafe configuration file for the
converter, the following Scala code can be used to load this data into an RDD
:
import com.typesafe.config.ConfigFactory
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val exampleConf = ConfigFactory.load("example.conf").root().render()
val params = Map(
"geomesa.converter" -> exampleConf,
"geomesa.converter.inputs" -> "example.csv",
"geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326",
"geomesa.sft.name" -> "example")
val query = new Query("example")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
It is also possible to load the prepackaged converters for public data sources (GDELT, GeoNames, etc.) via Maven or SBT. See Prepackaged Converter Definitions for more details.
Warning
ConvertSpatialRDDProvider
is read-only, and does not support writing features
to data files.
11.4.5. GeoTools RDD Provider¶
GeoToolsSpatialRDDProvider
is provided by the geomesa-gt-spark
module.
GeoToolsSpatialRDDProvider
generates and saves RDD
s of features stored in
a generic GeoTools DataStore
. The configuration parameters passed are the same as
those passed to DataStoreFinder.getDataStore()
to create the data store of interest,
plus a required boolean parameter called “geotools” to indicate to the SPI to load
GeoToolsSpatialRDDProvider
. For example, to use the Postgis DataStore with
GeoMesa Spark, do the following:
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val params = Map(
"geotools" -> "true",
"dbtype" -> "postgis",
"host" -> "localhost",
"user" -> "postgres",
"passwd" -> "postgres",
"port" -> "5432",
"database" -> "example")
val query = new Query("locations")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
The name of the feature type to access in the data store is passed as the type name of the
query passed to the rdd()
method. In the example above, this is “locations”.
Warning
Do not use the GeoTools RDD provider with a GeoMesa data store that has a provider implementation. The providers described above provide additional optimizations to improve read and write performance.
If your data store supports it, use the save()
method to save features:
GeoMesaSpark(params).save(rdd, params, "locations")