8.3. Ingest Commands¶
These commands are used to insert and delete simple features. Required parameters are indicated with a *
.
8.3.1. delete-features
¶
Delete specific features from a schema. Note that if deleting all features, it may be faster to delete the schema and re-create it.
Argument | Description |
---|---|
-c, --catalog * |
The catalog table containing schema metadata |
-f, --feature-name * |
The name of the schema |
-q, --cql |
CQL filter to select features to delete |
--force |
Suppress confirmation prompt |
8.3.2. ingest
¶
The ingest command takes files in various formats and ingests them as SimpleFeature
s in GeoMesa.
Generally, a GeoMesa ‘converter’ definition is required to map input data to SimpleFeature
s. GeoMesa
supports common input formats such as delimited text (TSV, CSV), fixed width files, JSON, XML, and Avro.
The converter framework is extensible via Java SPI, to allow support for custom formats. See
GeoMesa Convert for more information on converters.
See Moving and Migrating Data for details on how the export/import commands can be used to move data between clusters.
Argument | Description |
---|---|
-c, --catalog * |
The catalog table containing schema metadata |
-f, --feature-name |
The name of the schema |
-s, --spec |
The SimpleFeatureType specification to create |
-C, --converter |
The GeoMesa converter used to create SimpleFeature s |
--converter-error-mode |
Override the error mode defined by the converter |
-t, --threads |
Number of parallel threads used |
--input-format |
Format of input files (csv, tsv, avro, shp, json, etc) |
--no-tracking |
This application closes when ingest job is submitted. Useful for launching jobs with a script |
--run-mode |
Must be one of local , distributed , or distributedcombine |
--split-max-size |
Maximum size of a split in bytes (distributed jobs) |
--src-list |
Input files are text files with lists of files, one per line, to ingest |
--force |
Suppress any confirmation prompts |
<files>... |
Input files to ingest |
The --converter
argument may be any of the following:
- The name of a GeoMesa converter already available on the classpath
- A converter configuration string
- The name of a file containing a converter configuration
If a converter is not specified, GeoMesa will attempt to infer a converter definition based on the input files. Currently this supports GeoJSON, self-describing Avro, delimited text (TSV, CSV) or Shapefiles. If GeoMesa is able to infer a schema and converter definition, the user can accept them as-is, or alternatively use them as the basis for a fully custom converter. If desired, the user can persist the inferred converter to file, which allows for easy modification and reuse. When ingesting a large data set, it can be useful to ingest a single file in local mode, using schema inference to generate the converter. The converter definition can be persisted and tweaked to satisfaction, then used for the entire data set with a distributed ingest.
See Defining Simple Feature Converters for more details on specifying the converter.
The converter-error-mode
argument may be used to override the error mode defined in the converter. It must be
one of skip-bad-records
or raise-errors
.
If the --feature-name
is specified and the schema already exists, then --spec
is not required. Likewise,
if a converter is not defined, the schema will be inferred alongside the converter. Otherwise, --spec
may be
any of the following:
- A string of attributes, for example
name:String,dtg:Date,*geom:Point:srid=4326
- The name of a
SimpleFeatureType
already available on the classpath - A string of attributes, defined as a TypeSafe configuration
- The name of a file containing one of the above
If the schema doesn’t exist, the --feature-name
argument is required if it is not implied by
the specification string. It may also be used to override the implied feature name.
See Defining Simple Feature Types for more details on specifying the SimpleFeatureType
.
The --input-format
argument can be used to specify the type of files being ingested. Currently
GeoMesa supports Avro, CSV, TSV, Json/GeoJson, GML, and SHP. If not specified, the input file extensions
will be used to determine the file type.
The --no-tracking
argument instructs the application to close when the ingest job has been submitted rather than
tracking and displaying the progress of the ingest. This is useful when a script is submitting the job or it is
undesirable to leave the JVM running. Note that supplying this parameter does not silence the application and it will
still provide information about the status of the job submission.
The --run-mode
argument can be used to run ingestion locally or distributed (using map/reduce). Note that in
order to run in distributed mode, the input files must be in HDFS. By default, input files on the local filesystem
will be ingested in local mode, and input files in HDFS will be ingested in distributed mode. If using the
distributedcombine
mode, multiple files will be processes by each mapper up to the limit specified by
--split-max-size
.
The --threads
argument can be used to increase local ingest speed. However, there can not be more threads
than there are input files. The --threads
argument is ignored for distributed ingest.
The --split-max-size
argument can be used to control the amount of data each mapper processes. This is useful
when used in conjunction with the DistributedCombine --run-mode
and if input files are small or starting a mapper
for each one becomes prohibitively slow. For example, if you have 100 5MB files then a setting of 100000000 (100MB)
would schedule 5 mappers.
The --src-list
argument is useful when you have more files to ingest than the command line will allow you to
specify. This file instructs GeoMesa to treat input files as new-line-separated file lists. As this makes it very
easy to run ingest jobs that can take days it’s recommended to split lists into reasonable chunks that can be completed
in a couple hours.
The --force
argument can be used to suppress any confirmation prompts (generally from converter inference),
which can be useful when scripting commands.
The <files>...
argument specifies the files to be ingested. *
may be used as a wild card in file paths.
GeoMesa can handle gzip, bzip and xz file compression as long as the file extensions match the
compression type. GeoMesa supports ingesting files from local disks or HDFS. In addition, Amazon’s S3
and Microsoft’s Azure file systems are supported with a few configuration changes. See
Remote File System Support for details. Note: The behavior of this argument is changed by the --src-list
argument.
Instead of specifying files, input data may be piped directly to the ingest command using stdin shell redirection. Note that this will only work in local mode, and will only use a single thread for ingestion. Schema inference is disabled in this case, and progress indicators may not be entirely accurate, as the total size isn’t known up front. For example:
cat foo.csv | geomesa-accumulo ingest ...
For local ingests, feature writers will be pooled and only flushed periodically. The frequency of flushes can be
controlled via the system property geomesa.ingest.local.batch.size
, and defaults to every 20,000 features.