ESXi server log management is one of those activites often ignored or poorly managed by many administrators. Usually people are more concerned about hosted virtual machines, and since lately phyisical servers are usually really reliable, as it is ESXi, they end up thinking they will never have to access their logs.
In my opinion, this is a completely wrong behaviour, because it impedes eventually the possibility to analyze events or errors happening in an ESXi servers. And they happen mor frequently than you could think.
While meeting several customers, I found at the end three kind of behaviour, that ends up in three different ways of managing ESXi logs. Let’s see them.
Sweep under the rug
Here, logs are only another bother to manage, so the quickest we remove it the better. On an ESXi server with local storage, this problem does not even surfaces, since by default ESXi saves logs in the /scratch/log directory if it finds a local storage.
More and more frequently, however, new ESXi installations are done on USB or SD memories, and those are not recognized as local storage for logs. In fact, once the installation is complete, we usually find an error like this one:
To solve it, the “sweep under the rug” method provides that the admin selects quickly one of the available shared datastores, and uses it for logs. In fact you can configure the advanced parameter Syslog.global.logDir and set the desired datastore.
If you also want to use the same datastore for all your ESXi servers, you also need to configure Syslog.global.logDirUnique, in order to have different subfolders (once per ESXi server) under the common LogDir directory.
This method has two problems. First, the limited manamanageability of logs; when you need to check them, you can use for example the Log Browser inside the Web Clientand retrieve in real time the logs from a ESXi server:
but this task is really intensive and slow, since every time vCenter needs to retireve logs directly from the ESXi server. The second problem is crystal clear: if the ESXi server is unreachable, how can you read its logs saved inside the failed server itself? In this case in fact you can only retrieve logs if they were saved into an external datastore…
Remote Syslog
A better solution is for sure to send them to a remote syslog where we can safely save them and do searches. In this ways, when you have problems in a ESXi server and its local logs are not available, you will be able anyway to check its logs from the syslog itself. This for sure useful if you are running stateless servers deployed with AutoDeploy.
VMware gives you a syslog server directly in the vCenter installation. It’s easy to install (Jason Boche wrote a nice post about it) and on ESXi server you only need to configure the advanced parameter Syslog.global.logHost with values like tcp://hostname:514 or udp://hostname:514, depending on the the configuration you choose.
If you have several ESXi server, you can even configure this parameter directly into Host Profiles, so every new server gets the right configuration.
Data Analysis?
A simple syslog server, beeing it the VMware’s one or something else, is for sure a great solution, for all the reasons I explained before. But is it enough? Let’s suppose we have a problem in a software component of a ESXi server: maybe we would like to know if in reality the error was happening at regular times since several months, each time at the same hour of the day? Or maybe the alarm appears on all your ESXi servers using that same hardware?
So, a richer data analysis solution probably would be a better fit than a simple syslog. Among the several software available on the market, I usually use Splunk. It’s a powerful solution for log collection (it “swallows” every kind of text you give it, without any problem about format, source or size), but it’s value is in the classification and tagging of every log entry.
Once the data is saved into Splunk, you can do whatever search you want, correlate data, and get results in text form or even in graphical mode, also created automatically and sent to you by mail.
If you would like to test it, there is a free version available. Splunk is licensed on a maximum daily amount of data it can collect, and the free version has a daily limit of 500 Mb. If this seems a high number to you, be aware a single ESXi server can produce almost 1 GB of daily logs, much more than the limit of the free version.
There are many tutorials in the internet about how to install and configure Splunk (it’s really simple), but probably you are more interested about this thread in the Splunk forums, where a user explains how to reduce the daily amount of logs created by a ESXi server so it can be collected with the free version of Splunk.
So?
What level of logging do you use for your logs?