How to Set Up Elasticdump to Download Data
Elasticdump is a tool that can be used for moving and saving indices. It can be used to download or restore data to an OpenSearch instance
Reference: https://www.npmjs.com/package/elasticdump
Notes & Recommendations
- When dumping data, be sure the target has sufficient storage space.
- Similarly, when restoring data to an OpenSearch index, be sure that there are a sufficient number of data nodes and shards. Otherwise, data will load extremely slowly.
- If you get an error:
Error Emitted => unable to verify the first certificate, you need to set an environment variableNODE_TLS_REJECT_UNAUTHORIZED=0. - Elasticdump has a built
--fsCompressoption for compressing the data withgzip. In testing, it seemed faster to dump the data in a raw format and then just usepigz(or the equivalent) to compress it. - Dumping indices in parallel did not seem to work as well for indices with large numbers of documents (ex: over 100 million). Usually, all but one of the indices would error out when dumping. It's possible that one or more of Elasticdump's knobs could be adjusted to keep this from happening / increase performance. Unfortunately, Elasticdump's documentation does not provide much guidance for best performance with very large indices.
Sample Script
Here's a simple python script which uses multielasticdump to download a set of indices in parallel. The amount of time it takes is dependent on the number of indices and the number of docs in each index. You may need to experiment a bit to find the proper settings for your environment.
Description of selected parameters:
--direction: defaults to "dump"; set to "load" if restoring data to an OpenSearch instance--input: source environment (can be an OpenSearch URL or local files)--output: target environment (can be an OpenSearch URL or local files)--match: regex to match indices--noRefresh: disable input refresh; recommended for large indices.--limit: number of docs to move in each batch--parallel: number of forks to run simultaneously.--intervalCap: max requests within a concurrency interval.--scrollTime: dumps will be resumed ifscrollTimehas not expired--ignoreChildError: operation can continue if child throws an error--prefix: add a prefix to the index being created--suffix: allows you to add a suffix to the index being created
Use the following syntax for HTTP basic authentication:
https://ES_USERNAME:ES_PASSWORD@elasticsearch.handu-phx.handu.developers.oracledx.com
Please follow the link above for more information.
#!/usr/local/bin/python
# Helper script to dump OpenSearch data to a local host in raw json format
import os, time
DUMP_PATH = "/path/to/node_modules/elasticdump/bin"
WORKING_DIR = "/path/to/work/directory"
DUMP_LOG = WORKING_DIR + "/dump.log"
######################################
# INPUT
######################################
ENV = "elasticsearch.handu-phx.handu.developers.oracledx.com"
USER = "ES_USERNAME_HERE"
PWD = "ES_PASSWORD_HERE"
INDICES = "^.*myIndex-2019.11.*$"
INPUT = "https://%s:%s@%s" % (USER, PWD, ENV)
######################################
OUTPUT = "%s/output" % WORKING_DIR
######################################
if not os.path.isdir(OUTPUT):
os.mkdir(OUTPUT)
CMD = "%s/multielasticdump --direction=dump --input=%s --output=%s --match=\'%s\' --noRefresh --limit 5000 --parallel=6 --intervalCap=25 --scrollTime=60m --ignoreChildError &> %s" % (DUMP_PATH, INPUT, OUTPUT, INDICES, DUMP_LOG)
# Environment variable required by elasticdump
os.environ["NODE_TLS_REJECT_UNAUTHORIZED"] = "0"
start = time.time()
os.system(CMD)
elapsed = time.time() - start
SEPARATOR = "============================"
log = open(DUMP_LOG, "a")
log.write(SEPARATOR + "\nCommand executed:\n" + CMD + "\n\n")
log.write(SEPARATOR + "\nDump completed.\n")
log.write("Elapsed time: "+str(elapsed)+" seconds\n" + SEPARATOR)
Uploading Data to OpenSearch
The script above is easily modified to load data back to an OpenSearch instance. Here's a sample command to upload data to a runningOpenSearchinstance:
multielasticdump --direction=load --input=/path/to/dumped/data --output=https://ES_USER:ES_PASSWORD@elasticsearch.handu-phx.handu.developers.oracledx.com --match='^.*myIndex-2019.11.*$' --limit=5000 --parallel=6 --intervalCap=25