17.5. GeoMesa Spark SQL on Google Cloud DataprocΒΆ
GeoMesa can run Spark SQL with Bigtable as the underlying datastore. In order to set up Spark SQL, you will need to launch a Google Cloud Dataproc cluster. First, you will need to install the Google Cloud SDK command line tools. Instructions for doing so can be found here. Ensure that you have installed all the appropriate components for working with GCP Bigtable and GCP Dataproc.
$ gcloud components install cbt
$ gcloud components install alpha
$ gcloud components install beta
Then, provision a new Bigtable instance.
$ gcloud beta bigtable instances create geomesa --cluster geomesa --cluster-zone us-east1-b --cluster-num-nodes 3 --description GeoMesa
And provision a GCP Dataproc cluster.
$ gcloud dataproc clusters create geomesa
Once you have a GCP Dataproc cluster up and running, you can scp the GeoMesa Bigtable distribution to the master node. You can determine the name of the master node by using the following command line.
$ gcloud compute instances list
Find the master node instance and log into it using gcloud ssh as follows:
$ gcloud compute --project $PROJECTID ssh --zone $ZONEID $MASTER
Now, configure the installation by downloading and unpacking the GeoMesa distribution, editing the hbase-site.xml appropriately, and including the hbase-site.xml in the spark runtime jar. First set the version you want to use:
$ export TAG="3.5.2" # note: 2.12 is the Scala build version $ export VERSION="2.12-${TAG}"
Then download and configure the distribution:
$ wget "https://github.com/locationtech/geomesa/releases/download/geomesa-${TAG}/geomesa-bigtable_${VERSION}-bin.tar.gz"
$ tar xvf geomesa-bigtable_${VERSION}-bin.tar.gz
$ ln -s geomesa-bigtable_${VERSION} geomesa
$ export PATH=$PATH:~/geomesa/bin
$ export HADOOP_HOME=/usr/lib/hadoop
$ export HBASE_HOME=/usr/lib/hbase
$ vi geomesa/conf/hbase-site.xml
$ cp geomesa/conf/hbase-site.xml geomesa/dist/spark/
$ cd geomesa/dist/spark/
$ jar uvf geomesa-bigtable-spark-runtime_${VERSION} hbase-site.xml
Download sample GDELT data and ingest it as follows.
$ geomesa-bigtable ingest -t 8 -c geomesa.gdelt -s gdelt -C gdelt \*.csv
Now, you can run a spark shell and execute Spark SQL over your GeoMesa on Bigtable instance. Set the version to your installed version of GeoMesa.
$ spark-shell --num-executors 4 --master yarn --jars file://$HOME/geomesa/dist/spark/geomesa-bigtable-spark-runtime_${VERSION}.jar,file://$HOME/geomesa/lib/bigtable-hbase-1.2-0.9.4.jar,file://$HOME/geomesa/lib/netty-tcnative-boringssl-static-1.1.33.Fork19.jar
From the Spark shell prompt.
scala> val df = spark.read.format("geomesa").option("bigtable.catalog", "geomesa.gdelt").option("geomesa.feature", "gdelt").load()
scala> df.createOrReplaceTempView("gdelt")
scala> spark.sql("SELECT actor1Name,actor2Name,geom,dtg FROM gdelt WHERE st_contains(st_geomFromWKT('POLYGON((-80 35,-70 35,-70 40,-80 40,-80 35))'),geom)").show()