Linux monitoring tools to keep your hardware cool


Image by: Modified by David Both

Have you ever noticed that light bulbs (the incandescent ones especially) seem to burn out most frequently at the instant they’re turned on? Or that electronic components like home theater systems or TVs worked fine yesterday but don’t today when you turn them on? I have, too.

Have you ever wondered why that happens?

Thermal stress

There are many factors that affect the longevity of electronic equipment. One of the most ubiquitous sources of failure is heat. In fact, the heat generated by most electronic devices as they perform their assigned tasks is the very heat that shortens their electronic lives.

When I worked at IBM in Boca Raton at the dawn of the PC era, I worked as part of a group that was responsible for the maintainability of computers and other hardware of all types. One task of the labs in Boca Raton was to ensure that hardware broke very infrequently, and that when it did, it was easy to repair. I learned some interesting things about the effects of heat on the life of computers while I was there.

Let’s go back to the light bulb because it is an easily visible, if somewhat infrequent, example.

Every time a light bulb is turned on, electric current surges into the filament and heats it very rapidly from room temperature to about 340° Fahrenheit (the temperature depends upon the wattage of the bulb). This causes thermal stress through vaporization of the metal of which the filament is made, as well as rapid expansion of the metal just caused by heating. When a light bulb is turned off, the thermal stress is repeated—though less severely—during the cooling phase as the filament shrinks. The more times a bulb is cycled on and off, the more the effects of this stress accumulate.

The primary effect of thermal stress is that some small parts of the filament—usually due to minute manufacturing variances—tend to become hotter than the other parts, causing the metal at those points to evaporate faster. This makes the filament even weaker at that point and more susceptible to rapid overheating in subsequent power-on cycles. Eventually, the last of the metal evaporates when the bulb is turned on and the filament dies in a very bright flash.

The electrical circuitry in computers is much the same as the filament in a light bulb. Repeated heating and cooling cycles can damage the computer’s internal electronic components just as the filament of the light bulb was damaged over time.

Cooling is essential

Keeping computers cool is essential for helping to ensure that they have a long life. Large data centers spend a great deal of energy to keep the computers in them cool. Without going into the details, designers need to ensure that the flow of cool air is directed into the data center and specifically into the racks of computers to keep them cool. It is even better if they can be kept at a fairly constant temperature.

Proper cooling is essential even in a home or office environment. In fact, it is even more essential in those environments because the ambient temperature is so much higher (as it is primarily for the comfort of the humans).

Temperature monitoring

One can measure the temperature of many different points in a data center as well as within individual racks. But how can the temperature of the internals of a computer be measured?

Fortunately, modern computers have many sensors built into various components to enable monitoring of temperatures, fan speeds, and voltages. If you have ever looked at some of the data available when a computer is in BIOS configuration mode, you can see many of these values. But this does not show what is happening inside the computer when it is in a real world situation under loads of various types.

Linux has some software tools available to allow system administrators to monitor those internal sensors. Those tools are all based on the lm_sensors, Smart, and hddtemp library modules, which are available on all Red Hat based distributions and most others as well.

The simplest tool is the sensors command. Before the sensors command can be used, the sensors-detect command is used to detect as many of the sensors installed on the host system as possible. The sensors command then produces output including motherboard and CPU temperatures, voltages at various points on the motherboard, and fan speeds. The sensors command also displays the range of temperatures considered to be normal, high, and critical.

The hddtemp command displays temperatures for a specified hard drive. The smartctl command show the current temperature of the hard drive, various measurements that indicate the potential for hard drive failure, and, in some cases, an ASCII text history graph of the hard drive temperatures. This last output can be especially helpful in some types of problems.

When used with the appropriate library modules, the glances command can display hard drive temperatures as well as all of the same temperatures provided by the sensors command. glances is a top-like command that provides a lot of information about a running system including CPU and memory usage, I/O information about the network devices and hard drive partitions, as well as a list of the processes using the highest amounts of various system resources.

There are also a number of good graphical monitoring tools that can be used to monitor the thermal status of your computers. I like GKrellM for my desktop. There are plenty of others available for you to choose from.

I suggest installing these tools and monitoring the outputs on every newly installed system. That way, you can learn what temperatures are normal for your computers. Using a tool like glances allows you to monitor the temperatures in real time and understand how added loads of various types affect those temperatures. The other tools can be used to take snapshot looks at your computers.

Taking action

Doing something about high temperatures is pretty straightforward. It is usually a matter of replacing defective fans; installing newer, higher-capacity fans; and reducing the ambient temperature.

When building new computers or refurbishing older ones, I always install additional case fans or replace existing ones with larger ones where possible. Maximum airflow is important to efficient cooling. In some extreme environments, such as for gamers, liquid cooling can replace air cooling; most of us don’t need to take it to that level.

I also typically replace the standard CPU cooling units with high capacity ones. At the very least, I replace the thermal compound between the CPU and the cooling radiator. I find that the thermal compound from the factory or computer store is not always evenly distributed over the surface of the CPU, which can leave some areas of the CPU with insufficient heat dissipation.

I have a large room over my attached garage that my wife and I use for our offices. Altogether I have 10 running computers, two laser printers (in sleep mode most of the time), multiple external hard drive enclosures with from one to four drives each, and six uninterruptible power supplies (UPS). These devices all generate significant amounts of heat.

Over the years I have had to deal with several window mounted air-conditioning units to keep our home office at a reasonable temperature. A couple years ago our HVAC unit died and it made sense to install a zoning system so that the upstairs office space would be cooled directly and the remaining cool air, being denser than any warm air downstairs, would flow down to the lower level. This works very well for me and keeps me and the computers at a comfortable temperature.

It is also possible to test the efficacy of your cooling solutions. There are a number of options and the one I prefer also performs useful work.

I have BOINC (Berkeley Open Infrastructure for Network Computing) installed on many of my computers and I run Seti@Home to do something productive with all of the otherwise wasted CPU cycles I own. It also provides a great test of my cooling solutions. There are also commercially available test suites that allow stress testing of memory, CPU, and I/O devices, which can be used to test cooling solutions as a side benefit.

So keep cool and compute on!