Thoughts on Logging Strategy
Like many things in programming, doing a decent job logging isn't rocket science. And yet we've all seen it done badly so many times. Logging is often a critical part of your application. Fellow developers, here are a few thoughts to help you get it right and get some love from your support teams (and customers).
Backing up - why are we logging?
- Monitor the application lifecycle: Administrators need to know if the application is up, initialized and ready to go. They also want to see that it shuts down cleanly.
- Monitor errors: Serious system errors need to be logged for administrators and support. Stacktraces and other relevant details should be included in each problem report. The standard for ERROR logs should mean that action is necessary by the administrator.
- Debugging: Developers add debug logs to help see the state of the application. These logs should not be enabled in production and are likely meaningless to customers. Customers should not rely on these logs for any reason - they may change or go away.
- Tracing: When debugging particular use cases, administrators, support and developers want to trace the execution of the program. This means entry and exit messages at the method level. Ideally, this is not something you have to hand-code. Don't do this in production unless you have a way to trace just the transaction in question lest you kill performance.
- Checkpointing: A special case of tracing that is customer focused. Well-known status logs are written at important points in the execution of certain use cases. These logs are documented and map to documented use cases.
- Analytics: Customers will use the log files to find trends in user and system behavior. See "Big Data".
Good ideas in logging
Customers should control log configuration
Give customers control over what is logged and how. You will not be able to anticipate all their needs. Typically this means a log config file that controls what is logged, at what level and to which file (or other target). Allow this file to be changed at runtime - don't force them to restart the application to change the log config. If you're using a well-known logging framework (e.g., Log4j), customers will know how to configure logging - don't abstract this away in an attempt to "simplify" log control.
Quiet logs in production
In production, administrators want to see lifecycle events and serious errors - and nothing else. If they want more, they need to change the log configuration. Log files can get very large and become hard to analyze. Excessive logging can seriously slow a system. So minimal logs should be the default.
Rotate log files
Make sure the log files can be rotated based on time or size to avoid creating monstrously large files. Don't fill up a disk and crash the application with a run away log file. Customers should be in control of the parameters for rotation.
Well-defined log levels
Use a set of distinct, standardized log levels. Many apps use the following:
- ERROR indicates a system error (not a user error) and that administrator action is necessary. The log should indicate the action to take. Perhaps support should be notified.
- WARNING means there is a condition that will degrade functionality or performance. Administrators may want to take action. Logs should indicate an action to take (if any).
- INFO indicates lifecycle events like startup, shutdown and configuration changes. The absence of a particular INFO event may indicate trouble for administrators. Normal use by end users should rarely result in INFO logs.
- DEBUG is for developers only. Customers shouldn't ever need to turn this on.
Don't depend on debug logs
Customers should never need debug logs and should never depend on them. Other mechanisms should help customers configure their systems. Tracing/checkpointing tools are helpful for debugging. The application or accessory programs should help them validate and test their configuration.
Don't force customers to turn on system-wide debug logs to get a grip on a specific failure. Provide them a way to trace and debug execution for a single user, an IP address, a specific URL, a single use case, etc. This will allow them to trace without killing performance or generating too many log entries.
Allow customers to adjust log levels for specific log sources ("categories" or "loggers" in log4j terminology) . Often the category is the file or class producing the log. This is handy for developers, but not always best for customers. Consider adding categories for customer-focused concepts like "security" or for system components that the administrator understands like "caching".
System errors vs. User errors
Users provide bad inputs all the time. These are not problems that the administrator needs to correct and should not be logged as ERRORs. Operation Measures (e.g., event counters) give administrators a view of the number and type of user errors. Checkpointing and tracing mechanisms help administrators debug bad inputs.
Logs are an attack vector
Logs should never contain sensitive information like passwords. Even debug logs have to obey this rule if customers can turn them on. Customers should be able to share logs with support without fear of exposing sensitive info.
Log the system's environment
To help support, log the system's environment at startup:
- OS type and version
- Software version
- Host name, IP
- Versions of supporting software (e.g., Java)
- Relevant environment variables, system properties
- Locations of config files
- Values in config files
Number of log files
Not all logs should be written to a single file. Trace and debug messages often belong in a separate file from lifecycle logs. The reason is that these logs have different audiences. Admins want to lifecycle info and errors. Support wants to see the tracing for a specific use case. If you use multiple logs files, be very clear about which files support will need. Consider including a tool to bundle and send the files to support.
When logging an error, give a short, stable id as well as a descriptive message like "COMPX01234 Missing password, authentication failed." The "COMPX01234" value makes communication easier between administrators, support and developers. Be careful re-using error identifiers. Ideally, they should lead developers to a single source code location.
Especially in clustered or distributed environments, customers may want to aggregate their logs to a central place for analysis. Many logging frameworks allow logging to syslog, JMX, a database or a network pipe. Allow customers the flexibility to configure these features.
Log entry structure
Both humans and machines will be reading the log files. You need enough structure to support machine analysis, but the files should also be readable by human beings. A line-per-log format helps very much when using grep and search tools. XML is amenable to machine search, but is terrible for humans to read.
What goes in each log entry?
- Datetime - To the millisecond
- Thread Id
- Transaction or Session Id - Allows you to distinguish between transactions. In a distributed system, use the same transaction id across components.
- Source code location
- If error, Error Id
- A descriptive message - Take care to include the context. If this is a NumberFormatExecption, include the value that failed to parse as a number.
Too many debug logs are a code smell
I've worked on systems that had huge numbers of debug logs. They were all added by well-intentioned developers, but they cluttered both the source and log files. I think this may be symptomatic of a team that doesn't utilize unit testing and good debuggers sufficiently. Developers should consider removing the truly extraneous logs before check-in.
Logs are not a complete monitoring solution
Many aspects of the system should be visible to administrators without log analysis. Monitoring systems should be able to give them the current system state. Operational measures (e.g., counters and statistics over them) and Alarms are other ways to indicate system state. For example, an Alarm may indicate the connection to a back-end system is down. An Operational Measure can indicate how many times the connection has dropped in the last hour. Of course, it still might be appropriate to log the "connection down" event for auditing purposes.
What have I missed?
I'm a developer, though an experienced one who has shipped a number of servers. Admins, DevOps guys - what am I missing? What advice would you give developers about logging in enterprise-y environments?