The Secrets to Successfully Avoiding Unplanned OT Network Downtime
November 16, 2020

The Secrets to Successfully Avoiding Unplanned OT Network Downtime

To increase operational efficiency across the plant floor, many manufacturers are implementing Industrial Internet of Things (IIoT) technology and Industry 4.0 best practices. However, the resulting increased connectivity across the once technologically dark plant floor is adding new layers of complexity that manufacturers need to adapt to.

In the past, to connect equipment, a technician would run wires to devices that were controlled by relays or programmable logic controllers (PLCs). These devices would function in their own element, and not much information was passed between them or ‘up’ to business units. Today, this practice is being replaced with Ethernet-enabled devices connected to switching and network infrastructure that routes back for control and data acquisition.

While the subsequent increased availability of data and insight into operations is a positive advancement, as more connected systems are put in place, there are more opportunities for failure, especially with new systems that organizations may not know how to properly monitor or support. To minimize these new risks, a multilayer approach to monitoring devices on the OT network that includes implementing preventative maintenance, ensuring best practices are being following, and developing a strategy for redundancy, is needed.

Implementing Preventative Maintenance Practices on the Plant Floor

One key strategy for avoiding downtime is to perform continuous monitoring of all connected systems to identify trends that show when a failure is likely to occur. On the hardware side, this can be done by monitoring usage data for components such as motors. This data allows technicians to calculate when a seal or gasket is likely to wear out so it can proactively be replaced before a failure resulting in downtime occurs. Additionally, as part of this plan, it is also a good practice to integrate notifications for equipment manufacturer’s maintenance recommendations. On the software side, preventative maintenance practices can involve creating alerts regarding resource utilization, such as when storage space is nearly full, so that connections are never interrupted and data is not lost.

To ensure the effectiveness of a proactive maintenance strategy, a system for efficiently tackling maintenance work orders is also needed. A good way to do this is to integrate an organization’s CMMS with the data collected from the control systems so that maintenance staff knows in advance when they will need to work on certain equipment. Additionally, a system that allows you to document what work was done and when is also needed. Readily knowing this information will help further reduce future downtime and improve maintenance efficiency.

Making Network Glitches Miraculously Disappear by Following Best Practices

In general, many OT network issues can be prevented or resolved by following best practices including developing excellent documentation and adhering to manufacturer’s recommendations for implementing and maintaining physical and software components. To start, the Open Systems Interconnection (OSI) model (Figure 1) shows that a network’s physical layer is the foundation to the entire system.


Figure 1. The OSI model illustrates the hierarchy of the seven layers systems use to communicate over a network.

As a result, physical issues such as cable defects, improper terminations, or misplaced connections will cause the rest of the system to fail. A simple step to prevent physical issues is to follow manufacturer’s best practices for making physical connections – from guidance on avoiding using unshielded cables to keeping wire runs to recommended lengths.

Another best practice to follow is to make sure all systems are up to date. This includes having a documented plan to rollout the latest patches and upgrades in a timely manner as well as a system for monitoring these changes and ensuring there are no anomalies or conflicts between software.

Additionally, a lot of network downtime relates to how quickly an issue can be tracked. The ability to efficiently locate a problem all starts with proper network documentation. For example, when downtime is caused by someone incorrectly plugging in a device, having the ability to track the improper connection instead of blindly unplugging things until the issue is resolved will save a lot of time and resources.

A Strategy for Redundancy is Key

Most facilities are not truly operating 24/7/365. Therefore, if possible, it is best to plan upgrades for when there is scheduled downtime. However, many changes cannot wait and need to be made out-of-cycle. With the right redundant solutions in place, many changes can be done while the plant is running.

On the hardware side, with the proper design, a hot swap or live failover to a switch can be done. On the software side, to minimize downtime, updates and testing can occur on a back-up system if one is in place prior to making changes on the primary system. Still, it is rare that any system is designed to be perfectly redundant, thus, a carefully coordinated effort is likely needed to make upgrades.

Additionally, one often overlooked issue with redundancy is that backup systems cannot just be put into place and forgotten about until a problem arises. Doing this simply creates a second failure waiting to happen, which can result in redundancy failing. Without a system that also monitors any backups put into place, a redundancy failure may go unnoticed until the primary system fails and causes unplanned downtime because the backup had also long since failed. In short, redundant solutions keep the plant running in the event of an issue. No one noticing when an issue occurs is the goal.

Cybertrol Monitoring and Support Solutions Help Limit OT Network Disruptions

At the beginning of any project involving changes to an OT network, Cybertrol engineers perform a site assessment to create a road map of changes needed to the current infrastructure to meet the organization’s end goals. As part of this assessment, we outline any risks identified for older equipment and best practices for adding more “supportable” equipment.

After these initial steps, if the customer wants to implement a long-term support solution, we usually setup a periodic review plan with monthly check-in diagnostics. During this review, we look at all the back-end logs to make sure all systems have been acting normal. If an outlying event occurred that was not severe enough to lead to a push notification, we investigate this further to ensure an issue is not brewing.

Overall, our effectiveness is measured by how little actual disruptive events impact our customers. If everything is running perfectly, this is because our systems and periodic reviews are catching minor issues before they become real problems. To provide more insight to our customers in the future, we are developing a self-access portal that shows clients more of this backend data and integrates their work orders with ours.

While following manufacturer and general OT network design best practices are essential to avoiding unplanned downtime, this practice alone is not a cure all. Putting proper redundant solutions in place and implementing a monitoring and alert notification system to ensure preventative maintenance is implemented are also necessary. To be sure these practices are functioning properly, dedicate resources to monitoring these systems or engage with an experienced third-party company such as Cybertrol. Taking these steps will help maximize OT network uptime and keep your facility running smoothly.