Elasticsearch with AWS spot instances

Why Elasticsearch?
Elasticsearch is a distributed, multitenant-capable full-text search engine that we recently used to optimize our analytics. A number of queries that are quite expensive in Postgres are extremely fast in Elasticsearch, and document based storage allowed to us store all relevant data in a single index (table).
Elasticsearch (virtual?) hardware needs
However, Elasticsearch generally prefers relatively powerful hardware, as is discussed here
A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines are also common. Less than 8 GB tends to be counterproductive (you end up needing many, many small machines)...
In addition, because it is distributed, its also recommended that you have multiple hosts. 2 hosts is the bare minimum for getting a green status, but its recommended to have at least 3.
Doing the math
In ec2, m3.2xlarges (30g of RAM) costs .532/hr (in US east, at the time of this writing), which comes out to $1149.12/mo, which adds up to be ~$14,000 a year!
m3.2xlarge (reserved)$2840/yr$233/mo$7.7/day$0.3242/hrTotal (x3) $8520/yr$699/mo $23.32/day $0.971/hr
Of course, you can save some money by purchasing reserved instances, but at $2840/yr each ($8520), its still a pretty expensive cluster.
Why spot instance?
However, because Elasticsearch is distributed, it is capable of coping with failure. If a node is lost, or goes down for some reason, the remaining nodes should have all the data to continue as if nothing has happened.This makes Elasticsearch a really good candidate for spot instances in AWS. Spot instances are ec2 instances that you bid for, and can come and go as the market price changes. So, when the market price exceeds your bid price, your instance shuts down, but when the market price is beneath your bid price, your instance comes back up.However, failure of all nodes in the cluster isn't acceptable, so we'll still need to have at least one reserved instance. The cost breakdown looks something like this:
m3.2xlarge (reserved)$2840/yr $233/mo$7.7/day$.3242/hrm3.2xlarge (spot) $744/yr$61.2/mo$2.04/day~$0.085/hrm3.2xlarge (spot) $744/yr$61.2/mo$2.04/day~$0.085/hrtotal$4328/yr$355.4/mo$11.78/day$0.4942/hr
Spot instances don't cost exactly .085/hr, but looking at the pricing history for them, it seems like a relatively reasonable estimate. This also depends on what you set your bid price at, since if it exceeds your bid price, you won't be paying for it.
What is clear, however, is that significant savings can be had by bringing spot instances into the mix.
Setting bid prices
There is definitely a way to set an optimal bid price, but to put it simply - you're better off bidding higher and paying a little more for that one hour when the market spikes. While Elasticsearch can handle failure, its still generally preferable to avoid it.
Tip: If you can, diversify you spot instances in different AZ's, and at different bid prices - this hopefully prevents multiple failures, and keeps it isolated to one price or one AZ.
Awesome, where do I start?
While in theory, there's lots of money to be saved with spot instances, the possibility that it will go down at any time means that the process for which a node is brought up needs to be automated. This means the spot instance needs to be created with installation instructions that will bring the node up, install elasticsearch, update the configuration, and add it to the cluster.
Using user-data
When you launch an instance in Amazon EC2, you have the option of passing user data to the instance that can be used to perform common automated configuration tasks and even run scripts after the instance starts
We'll want to set up something along these lines. At the very basic, we'll obviously need to install elasticsearch and run it, but we'll also need java. Let's create a shell script called userdata.sh:
#!/bin/bashadd-apt-repository -yppa:webupd8team/javawget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://packages.elastic.co/elasticsearch/1.5/debian stable main" | sudo tee -a /etc/apt/sources.list
apt-get update && apt-get upgrade -y
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
# Install required packages
apt-get -y install oracle-java7-installer
apt-get -y install elasticsearch /etc/init.d/elasticsearch restart
We can then use the aws cli to request a spot instance with that script like so:
# aws ec2 request-spot-instances --spot-price="" 15="" span=""> --type persistent --instance-count 1 --launch-specification "{ "ImageId":"
", "InstanceType":"", "KeyName":"", "SecurityGroups": [""],"Placement": {"AvailabilityZone": ""}, "UserData":"base64 userdata.sh" }"
Adding configurations
The standard recommendation is to give 50% of the available memory to Elasticsearch heap, while leaving the other 50% free
In order to configure this, we can add the following line to our userdata.sh script
# Update /etc/default/elasticsearchcat >> /etc/default/elasticsearch << EOFES_HEAP_SIZE=15gEOF
Similarly, you can make changes to the main configuration file (/etc/elasticsearch/elasticearch.yml).For example, its usually a good idea to disable multicast detection of nodes, since you'll want to specify exactly who the other nodes are.
cat >> /etc/elasticsearch/elasticsearch.yml << EOFcluster.name: anythingbutelasticsearchdiscovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: [insert ips here ]index.number_of_replicas: 2EOF
Also, you may find it valuable to utilize plugins, such as HQ or bigdesk. You can also automate their installation:
cd /usr/share/elasticsearch/ && ./bin/plugin -install royrusso/elasticsearch-HQcd /usr/share/elasticsearch/ && ./bin/plugin -install lukas-vlcek/bigdesk
Final script
#!/bin/bashadd-apt-repository -y ppa:webupd8team/java
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://packages.elastic.co/elasticsearch/1.5/debian stable main" | sudo tee -a /etc/apt/sources.list
apt-get update && apt-get upgrade -y
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
# Install required packages
apt-get -y install oracle-java7-installer
apt-get -y install elasticsearch
# Update /etc/default/elasticsearch
cat >> /etc/default/elasticsearch << EOF
ES_HEAP_SIZE=15g
EOF
cat >> /etc/elasticsearch/elasticsearch.yml << EOF
cluster.name: anythingbutelasticsearch
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: []
index.number_of_replicas: 2
EOF
cd /usr/share/elasticsearch/ && ./bin/plugin -install royrusso/elasticsearch-HQ
cd /usr/share/elasticsearch/ && ./bin/plugin -install lukas-vlcek/bigdesk /etc/init.d/elasticsearch restart
Brief note on security
Its also important to remember that securing your cluster is critical. Elasticsearch doesn't have any built in security, and they strongly recommend that it be on an internal network or that you set up a firewall. Some helpful blog posts in this regard are by found and Andy BrudtkuhlFor security purposes, I won't discuss our security implementation here, but if you run into issues trying to handle it in AWS, feel free to bounce some ideas off of me at dennis@pixleeteam.com
Additional Notes
A cost discussion isn't completely valid without taking a look at Elasticsearch service providers:
Found
(32g RAM, 2 DC, 3 nodes, USE)$16,269.95/yr$1337.26/mo$44.58/day$1.8573/hr Qbox
(30g RAM, 1 DC?, ? nodes)$18,396/yr $1512/mo$50.4/day$2.10/hr Compose.io
unclear pricing
(wrt to hardware)$45/month first 2GB
$18 for each GB after