14.4. Accumulo Command-Line Tools

The GeoMesa Accumulo distribution includes a set of command-line tools for feature management, ingest, export and debugging.

To install the tools, see Setting up the Accumulo Command Line Tools.

Once installed, the tools should be available through the command geomesa-accumulo:

$ geomesa-accumulo
INFO  Usage: geomesa-accumulo [command] [command options]
  Commands:
    ...

Commands that are common to multiple back ends are described in Command-Line Tools. The commands here are Accumulo-specific.

14.4.1. General Arguments

Most commands require you to specify the connection to Accumulo. This generally includes the instance name, zookeeper hosts, username, and password (or Kerberos keytab file). Specify the instance with --instance-name and --zookeepers, and the username and password with --user and --password. The password argument may be omitted in order to avoid plaintext credentials in the bash history and process list - in this case it will be prompted case for later. To use Kerberos authentication instead of a password, use --keytab with a path to a Kerberos keytab file containing an entry for the specified user. Since a keytab file allows authentication without any further constraints, it should be protected appropriately.

Instead of specifying the cluster connection explicitly, an appropriate accumulo-client.properties may be added to the classpath. See the Accumulo documentation for information on the necessary configuration keys. Any explicit command-line arguments will take precedence over the configuration file.

The --auths argument corresponds to the AccumuloDataStore parameter geomesa.security.auths. See Data Security for more information.

14.4.2. Commands

14.4.2.1. add-index

Add or update indices for an existing feature type. This can be used to upgrade-in-place, converting an older index format into the latest. See Upgrading Existing Indices for more information.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

--index *

The name of the index to add (z2, z3, etc)

-q, --cql

A filter to apply for back-filling data

--no-back-fill

Skip back-filling data

The --index argument specifies the index to add. It must be the name of one of the known index types, e.g. z3 or xz3. See Index Overview for available indices.

By default, the command will launch a map/reduce job to populate the new index with any existing features in the schema. For large data sets, this may not be desired. The --no-back-fill argument can be used to disable index population entirely, or --cql can be used to populate the index with a subset of the existing features.

When running this command, ensure that the appropriate authorizations and visibilities are set. Otherwise data might not be back-filled correctly.

14.4.2.2. add-attribute-index

Add an index on an attribute. Attributes can be indexed individually during schema creation; this command can add a new index in an existing schema. See Attribute Index for more information on indices.

This command is a convenience wrapper for launching the map/reduce job described in Attribute Indexing.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

-a, --attributes *

Attribute(s) to index, comma-separated

--coverage *

Type of index, either join or full

For a description of index coverage, see Attribute Indices.

14.4.2.3. bulk-copy

The bulk copy command will directly copy Accumulo RFiles between two feature types, bypassing the normal write path. The main use case it to move data between different storage tiers, e.g. hdfs and s3. See Bulk Ingest in the Accumulo documentation for additional details.

Warning

The two feature types must be identical.

Argument

Description

--from-catalog *

Catalog table containing the source feature type

--from-instance *

Source Accumulo instance name

--from-zookeepers *

Zookeepers for the source instance (host[:port], comma separated)

--from-user *

User name for the source instance

--from-password

Connection password for the source instance

--from-keytab

Path to Kerberos keytab file for the source instance (instead of using a password)

--from-config

Additional Hadoop configuration file(s) to use for the source instance

--to-catalog *

Catalog table containing the destination feature type

--to-instance *

Destination Accumulo instance name

--to-zookeepers *

Zookeepers for the destination instance (host[:port], comma separated)

--to-user *

User name for the destination instance

--to-password

Connection password for the destination instance

--to-keytab

Path to Kerberos keytab file for the destination instance (instead of using a password)

--to-config

Additional Hadoop configuration file(s) to use for the destination instance

-f, --feature-name *

The name of the schema to copy

--export-path *

HDFS path to used for file export - the scheme and authority (e.g. bucket name) must match the destination table filesystem

--partition

Partition(s) to copy (if schema is partitioned)

--partition-value

Value(s) used to indicate partitions to copy (e.g. 2024-01-01T00:00:00.000Z) (if schema is partitioned)

-t, --threads

Number of index tables to copy concurrently, default 1

--file-threads

Number of files to copy concurrently, per table, default 2

--distcp

Use Hadoop DistCp to move files from one cluster to the other, instead of normal file copies

--resume

Resume a previously interrupted run from where it left off

Note

--partition and/or --partition-value may be specified multiple times in order to copy multiple partitions, or omitted to copy all existing partitions.

14.4.2.4. bulk-ingest

The bulk ingest command will ingest directly to Accumulo RFiles and then import the RFiles into Accumulo, bypassing the normal write path. See Bulk Ingest in the Accumulo documentation for additional details.

The data to be ingested must be in the same distributed file system that Accumulo is using, and the ingest must run in distributed mode as a map/reduce job.

In order to run efficiently, you should ensure that the data tables have appropriate splits, based on your input. This will avoid creating extremely large files during the ingest, and will also prevent the cluster from having to subsequently split the RFiles. See Configuring Index Splits for more information.

Note that some of the below options are inherited from the regular ingest command, but are not relevant to bulk ingest. See ingest for additional details on the available options.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

--output *

The output directory used to write out RFiles

-f, --feature-name

The name of the schema

-s, --spec

The SimpleFeatureType specification to create

-C, --converter

The GeoMesa converter used to create SimpleFeatures

--converter-error-mode

Override the error mode defined by the converter

-q, --cql

If using a partitioned store, a filter that covers the ingest data

-t, --threads

Number of parallel threads used

--input-format

Format of input files (csv, tsv, avro, shp, json, etc)

--index

Specify a particular GeoMesa index to write to, instead of all indices

--temp-path

A temporary path to write the output. When using Accumulo on S3, it may be faster to write the output to HDFS first using this parameter

--no-tracking

This application closes when ingest job is submitted. Note that this will require manual import of the resulting RFiles.

--run-mode

Must be distributed for bulk ingest

--split-max-size

Maximum size of a split in bytes (distributed jobs)

--src-list

Input files are text files with lists of files, one per line, to ingest

--skip-import

Generate the RFiles but skip the bulk import into Accumulo

--force

Suppress any confirmation prompts

<files>...

Input files to ingest

The --output directory will be interpreted as a distributed file system path. If it already exists, the user will be prompted to delete it before running the ingest.

The --cql parameter is required if using a partitioned schema (see Configuring Partitioned Indices for details). The filter must cover the partitions for all the input data, so that the partition tables can be created appropriately. Any feature which doesn’t match the filter or correspond to a an existing table will fail to be ingested.

--skip-import can be used to skip the import of the RFiles into Accumulo. The files can be imported later through the importdirectory command in the Accumulo shell. Note that if --no-tracking is specified, the import will be skipped regardless.

14.4.2.5. compact

Incrementally compact tables for a given feature type. Compactions in Accumulo will merge multiple data files into a single file, which has the side effect of permanently deleting rows which have been marked for deletion. Compactions can be triggered through the Accumulo shell; however queuing up too many compactions at once can impact the performance of a cluster. This command will handle compacting all the tables for a given feature type, and throttle the compactions so that only a few are running at one time.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

--threads

Number of ranges to compact simultaneously, by default 4

--from

How long ago to compact data, based on the default date attribute, relative to current time. E.g. ‘1 day’, ‘2 weeks and 1 hour’, etc

--duration

Amount of time to compact data, based on the default date attribute, relative to --from. E.g. ‘1 day’, ‘2 weeks and 1 hour’, etc

--z3-feature-ids

Indicates that feature IDs were written using the Z3FeatureIdGenerator. This allows optimization of compactions on the ID table, based on the configured time. See geomesa.feature.id-generator for more information

The --from and --duration parameters can be used to reduce the number of files that need to be compacted, based on the default date attribute for the schema. Due to table keys, this is mainly useful for the Z3 index, and the ID index when used with --z3-feature-ids. Other indices will typically be compacted in full, as they are not partitioned by date.

This command is particularly useful when using Feature Expiration, to ensure that expired rows are physically deleted from disk. In this scenario, the --from parameter should be set to the age-off period, and the --duration parameter should be set based on how often compactions are run. The intent is to only compact the data that may have aged-off since the last compaction. Note that the time periods align with attribute-based age-off; ingest time age-off may need a time buffer, assuming some relationship between ingest time and the default date attribute.

This command can also be used to speed up queries by removing entries that are duplicated or marked for deletion. This may be useful for a static data set, which will not be automatically compacted by Accumulo once the size stops growing. In this scenario, the --from and --duration parameters can be omitted, so that the entire data set is compacted.

14.4.2.6. configure-age-off

List, add or remove age-off on a given feature type. See Feature Expiration for more information.

Warning

Any manually configured age-off iterators should be removed before using this command, as they may not operate correctly due to the configuration name.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

-l, --list

List any age-off configured for the schema

-r, --remove

Remove age-off for the schema

-s, --set

Set age-off for the schema (requires --expiry)

-e, --expiry

Duration before entries are aged-off(‘1 day’, ‘2 weeks and 1 hour’, etc)

--dtg

Use attribute-based age-off on the specified date field

The --list argument will display any configured age-off:

$ geomesa-accumulo configure-age-off -c test_catalog -f test_feature --list
INFO  Attribute age-off: None
INFO  Timestamp age-off: name:age-off, priority:10, class:org.locationtech.geomesa.accumulo.iterators.AgeOffIterator, properties:{retention=PT1M}

The --remove argument will remove any configured age-off:

$ geomesa-accumulo configure-age-off -c test_catalog -f test_feature --remove

The --set argument will configure age-off. This will remove any existing age-off configuration and replace it with the new specification. When using --set, --expiry must also be provided. --expiry can be any time duration string, specified in natural language.

If --dtg is provided, age-off will be based on the specified date-type attribute:

$ geomesa-accumulo configure-age-off -c test_catalog -f test_feature --set --expiry '1 day' --dtg my_date_attribute

Otherwise, age-off will be based on ingest time:

$ geomesa-accumulo configure-age-off -c test_catalog -f test_feature --set --expiry '1 day'

Warning

Ingest time expiration requires that logical timestamps are disabled in the schema. See Logical Timestamps for more information.

14.4.2.7. configure-stats

List, add or remove stat iterator configuration on a given catalog table. GeoMesa automatically configures an iterator on the summary statistics table (_stats). Generally this does not need to be modified, however if the Accumulo classpath is mis-configured, or data gets corrupted, it may be impossible to delete the table without first removing the iterator configuration.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-l, --list

List any stats iterator configured for the catalog table

-r, --remove

Remove the stats iterator configuration for the catalog table

-a, --add

Add the stats iterator configuration for the catalog table

The --list argument will display any configured stats iterator.

The --remove argument will remove any configured stats iterator.

The --add argument will add the stats iterator.

14.4.2.8. configure-table

This command will list and update properties on the Accumulo tables used by GeoMesa. It has two sub-commands:

  • list List the configuration options for a table

  • update Update a given configuration option for a table

To invoke the command, use the command name followed by the subcommand, then any arguments. For example:

$ geomesa-accumulo configure-table list --catalog ...

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

--index *

The index table to examine/update (z2, z3, etc)

-k, --key

Property name to operate on (required for update sub-command)

-v, --value *

Property value to set (only for update sub-command)

The --index argument specifies the index to examine. It must be the name of one of the known index types, e.g. z3 or xz3. See Index Overview for available indices. Note that not all schemas will have all index types.

The --key argument can be used during both list and update. For list, it will filter the properties to only show the one requested. For update, it is required as the property to update.

The --value argument is only used during update.

14.4.2.9. query-audit-logs

This command will query the audit logs produced by GeoMesa.

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema

-b, --begin

Lower bound (inclusive) on the date of log entries to return, in ISO 8601 format

-e, --end

Upper bound (exclusive) on the date of log entries to return, in ISO 8601 format

-q, --cql

CQL predicate used to filter log entries

--output-format

Output format for result, one of either csv (default) or json

The --begin and --end arguments can be used to filter logs by date (based on when the query completed). For more advanced filtering, the --cql argument accepts GeoTools filter expressions. The schema to use for filtering is:

user:String,filter:String,hints:String:json=true,metadata:String:json=true,start:Date,end:Date,planTimeMillis:Long,scanTimeMillis:Long,hits:Long

The --output-format argument can be used to return logs as CSV or as JSON (JSON lines).

14.4.2.10. stats-analyze

This command will re-generate the cached data statistics maintained by GeoMesa. This may be desirable for several reasons:

  • Stats are compiled incrementally during ingestion, which can sometimes lead to reduced accuracy

  • Most stats are not updated when features are deleted, as they do not maintain enough information to handle deletes

  • Errors or data corruption can lead to stats becoming unreadable

Argument

Description

-c, --catalog *

The catalog table containing schema metadata

-f, --feature-name *

The name of the schema