Download OpenSearch Data

How to Set Up Elasticdump to Download Data

Elasticdump is a tool that can be used for moving and saving indices. It can be used to download or restore data to an OpenSearch instance

Reference: https://www.npmjs.com/package/elasticdump

Notes & Recommendations

When dumping data, be sure the target has sufficient storage space.
Similarly, when restoring data to an OpenSearch index, be sure that there are a sufficient number of data nodes and shards. Otherwise, data will load extremely slowly.
If you get an error: Error Emitted => unable to verify the first certificate, you need to set an environment variable NODE_TLS_REJECT_UNAUTHORIZED=0.
Elasticdump has a built --fsCompress option for compressing the data with gzip. In testing, it seemed faster to dump the data in a raw format and then just use pigz (or the equivalent) to compress it.
Dumping indices in parallel did not seem to work as well for indices with large numbers of documents (ex: over 100 million). Usually, all but one of the indices would error out when dumping. It's possible that one or more of Elasticdump's knobs could be adjusted to keep this from happening / increase performance. Unfortunately, Elasticdump's documentation does not provide much guidance for best performance with very large indices.

Sample Script

Here's a simple python script which uses multielasticdump to download a set of indices in parallel. The amount of time it takes is dependent on the number of indices and the number of docs in each index. You may need to experiment a bit to find the proper settings for your environment.

Description of selected parameters:

--direction: defaults to "dump"; set to "load" if restoring data to an OpenSearch instance
--input: source environment (can be an OpenSearch URL or local files)
--output: target environment (can be an OpenSearch URL or local files)
--match: regex to match indices
--noRefresh: disable input refresh; recommended for large indices.
--limit: number of docs to move in each batch
--parallel: number of forks to run simultaneously.
--intervalCap: max requests within a concurrency interval.
--scrollTime: dumps will be resumed if scrollTime has not expired
--ignoreChildError: operation can continue if child throws an error
--prefix: add a prefix to the index being created
--suffix: allows you to add a suffix to the index being created

Use the following syntax for HTTP basic authentication: https://ES_USERNAME:ES_PASSWORD@elasticsearch.handu-phx.handu.developers.oracledx.com

Please follow the link above for more information.

#!/usr/local/bin/python
# Helper script to dump OpenSearch data to a local host in raw json format
import os, time

DUMP_PATH   = "/path/to/node_modules/elasticdump/bin"
WORKING_DIR = "/path/to/work/directory"
DUMP_LOG    = WORKING_DIR + "/dump.log"

######################################
# INPUT
######################################
ENV     = "elasticsearch.handu-phx.handu.developers.oracledx.com"
USER    = "ES_USERNAME_HERE"
PWD     = "ES_PASSWORD_HERE"
INDICES = "^.*myIndex-2019.11.*$"

INPUT    = "https://%s:%s@%s" % (USER, PWD, ENV)

######################################
OUTPUT   = "%s/output" % WORKING_DIR
######################################

if not os.path.isdir(OUTPUT):
  os.mkdir(OUTPUT)

CMD = "%s/multielasticdump --direction=dump --input=%s --output=%s --match=\'%s\' --noRefresh --limit 5000 --parallel=6 --intervalCap=25 --scrollTime=60m --ignoreChildError &> %s" % (DUMP_PATH, INPUT, OUTPUT, INDICES, DUMP_LOG)

# Environment variable required by elasticdump
os.environ["NODE_TLS_REJECT_UNAUTHORIZED"] = "0"

start = time.time()
os.system(CMD)
elapsed = time.time() - start

SEPARATOR = "============================"
log = open(DUMP_LOG, "a")
log.write(SEPARATOR + "\nCommand executed:\n" + CMD + "\n\n")
log.write(SEPARATOR + "\nDump completed.\n")
log.write("Elapsed time: "+str(elapsed)+" seconds\n" + SEPARATOR)

Uploading Data to OpenSearch

The script above is easily modified to load data back to an OpenSearch instance. Here's a sample command to upload data to a runningOpenSearchinstance:

multielasticdump --direction=load --input=/path/to/dumped/data --output=https://ES_USER:ES_PASSWORD@elasticsearch.handu-phx.handu.developers.oracledx.com --match='^.*myIndex-2019.11.*$' --limit=5000 --parallel=6 --intervalCap=25