Recently I received a comment from someone asking me to share my solution for Index Management for ELK. This comment was posted on the blog post regarding how to monitor firewall traffic using Elastic stack
So I decided why not share the entire solution. To be honest, this may be the area I know least about and by no means I would claim to be subject mater expert on the area, however, I have managed to find a working solution that I normally use, when setting up solutions for my customers. This I will share, after all Elastic is a wonderful product that you should be sharing even more.
Index management – Index what?
Before you proceed, I must state this that by no means am I a subject mater expert or claim that I know ELK as a certified or professional working with or specializing in ELK. These are my personal ideas and understanding of ELK. If you disagree or I have completely misunderstood the concepts, please correct me by either dropping me an email or just by dropping a comment in the section below. Anyhow, due to lack of a better definition I would explain it by stating, index or indices are how your data is organized within your ELK solution. You can consider it as being primary keys in a rational database. In a similar manner, as ELK or simply put Elastic (as Logstash and Kibana are not directly relevant) is a noSQL solution, it uses index to organize your data, meaning whenever you fetch data it can be retrieved by using the fields that have been indexed. To have a well functioning solution or at least best performant solution, your indices should be correct. Indexes are composed of Shards, which again can be placed on several disks.
In a typical scenario, when you are using Kibana to query Elastic, this is just an example because you could be accessing data directly from Elastic by using several other methods! However, lets stick to Kibana for the moment, lets say you were searching for a specific term, lets assume IP address, when you enter the filter and hit enter, Kibana composes a search job that is assigned to all the nodes in your cluster. Surely there are nuances to it, that you might have different type of nodes in you cluster and hence the search might not be able to utilize all the nodes anyway. But for the sake of this discussion, we stick with the claim that all nodes receive this instruction. As soon as they have received the instruction they start pulling out data which is relevant and starts showing this in Kibana. Search process seems or feels much more efficient that it actually is as each cluster node sends some data that you can start viewing, while you are viewing data at hand more and more data is being retrieved and you sit there with the highly efficient search tools. This is actually at least according to my understanding what is also called a data lake like solution.
So, how can you make this search process even more efficient? Well, you put the data where it belongs meaning active data on high performant SSD or M2 disks and cold data on cheaper spinning disks. The index lifecycle management in ElasticSearch allows you to define policies that apply to specific indices or to all indices. The tool provides a lot of options, and it is up to you to define the policy that matches your requirements. I have worked with environments which have more that 800 million records in Elastic, but when you search for something, it is instant! I will though next time try to figure out the largest amount of events in an ELK solution, as the environment I am talking about had no more than 700 GB of storage available, but I also have several customers having several TB of data. May be a topic for another blog post! However, I feel and hope that you now have a better understanding of what we are talking about. So lets delve into the technical solution.
Index Management for Palo Alto Firewall
Index Management policies are done normally using Curl. For those who are less tech savvy and Windows people (not normally working with Linux/Unix) tend to lean towards the GUI to perform this task. Then Kibana is you friend, as while working with log analysis. You can create your own Index management policy, while there are some pre-existing ones which you should leave untouched, unless you really know what you are doing.
Me being who I am, I just create one matching the index pattern. In the picture you can see the one called LogstashIndex, and pattern is ls* Normally you would create the policy, but I am just sharing the details so that you follow this instructions or settings as I guideline, if you need the help.
As with regular expression and so many other branches of IT in the term LS* the star indicates match anything. So basically this policy is hitting all indices whose name start with ls. If you have followed my blog in order to set up monitoring, you would be utilizing indices with names ls and dates they are being creating, hence you should be good to go. Otherwise, it is just about creating a matching criteria, which will identify you indices. However, make sure that the scope of your policy is not encompassing unintended indices.
Under the settings sections you see all the settings that have been defined. Not too much that has been added here as you can see. However, rollover_alias has been defined, which describes the formatting ls 4 digits, dash and 2 digits, another dash and 2 digits and at last 4 digits followed by an N.
Under the mappings section you see the mapping that are defined. You are defining the datatypes for different variables being utilized, that is ip datatype for IP addresses, character and strings, as well as number for the ones which only contains of digits.
Under is the continuation, which is showing the src_ip, src_port, dst_port_nat etc
Why in the world do you want to do it? That is a good question. And the answer is very simple. If you are familiar with rational databases, you must also be familiar with different datatypes like char, char2, varchar, int vs double. The reason you are encouraged to set correct or most appropirate datatypes is to not waste amount of memory required for a specific field. In the same manner defining the datatypes will save you A LOT of space and memory. For example, using numbers datatypes instead of using just string will almost half the amount of memory required for those fields. This means, if you are generation a very large number of EPM (Events per minute) you might potentially end up saving hundreds or gigabytes of data if not terra bytes. Ab course it will depend on the amount of data you are storing.
You also see the alias that has been defined.
Index Lifecycle policies
Based on the pattern matching criteria you indicies will now be managed and maintained by the policies you define here. Remember the part where I was describing hot and cold storage, actually I should have used warm and cold, but where is the fun in that? No one says, warm spices! It always hot and spicy. Fun and jokes set aside, in the picture bellow you see 129 indices being managed by this specific policy. How are these indicies managed? Lets have a look.
This is basically the place where you define, how long you want to retain the data one which of the nodes in the cluster. That is, which shards are supposed to be hosted where, and where should the data be fetched from during searches etc.
The settings are quite self explanatory, so I will not be spending time on re-writing what you alreday can read by just looking through the settings.
As you can see, I have not activated neither the Cold phase nor the Delete phase. This is because, this blog post has borrowed pictures from a test environment, where I normally delete the indices manually.
If you end up in a predicament where you are running out of space on you ELK solution, you only need to go there, and select some indices that are not longer required and delete them. On the other hand, if you cannot delete any of the indices, you other option is to introduce more nodes in the cluster that will add more storage space, whatever solution works for you, the options are plentiful. Now tell me, is it not just another reason to fall in love with ELK? I am not paid nor supported by Elastic.io in any manner, the only reason I love this product is that, its just awesome as a tool and on top of it its free, at least to some extent. If you are willing to put in the effort, this is THE product that is more that sufficient enough for most SMB and even large customer. On top of that you can add the license and start using the more advanced feature like machine learning and security modules.
Hope as always that this blog post helps someone out there in need. I know for sure that I spent quite a long time learning ELK, and still I feel like there are miles left just to come to the beginner level of the fascinating and complex toolset.