While we are continuing to make updates in order to improve the recoverability of the backend components, we believe that we have resolved the primary issues which caused this issue in the first place. As previously stated, it centered around timing issues between the backend services and was exacerbated by an issue in the Linux kernel. We have updated both components and have seen no service interruption since the components were updated.
Mar 3, 15:44 CST
For those who have found the status updates less than illuminating, I apologize. The issue that we have been dealing with is one which stems from a kernel issue in the operating system. It exacerbated a timing issue within the product which caused the backend data store to get out of sync. We spent the entire day getting the system to the point where the data was not out of sync. We should have seen little to no data loss during this outage period because of the queueing technology (Kafka) that we had recently implemented to prevent data loss. We understand that this is irrelevant for the purposes of our customers as the real time nature of Boundary is its chief selling point, but I do want all of the customers to understand that this was an anomaly and not a fatal product flaw. We will continue to closely monitor the system for the rest of the week. We have and will continue to modify our test bed to be able to simulate the conditions which precipitated this outage in an effort to put even more preventative measures in place to avoid the type of situation in which we found ourselves yesterday.
Mar 3, 08:38 CST
The read path is back online. We will continue to monitor the system.
Mar 3, 07:35 CST
The system is catching up on a day's worth of data. We will turn the read path on in the morning in order to allow the backend a chance to process all data overnight. At that point we will systematically check the dashboards and make sure that the OS fixes and the application fixes have increased the system's stability.
Mar 2, 23:42 CST
There are still issues with the connectivity between the Boundary services which are currently being investigated. For the moment, the read path has been disabled in order to enable faster processing of the data collected today. We will continue to update this incident as the situation changes
Mar 2, 18:06 CST
All of the instances have been updated and the system is being rolled to capture backlog data
Mar 2, 16:23 CST
The problem seems to be related to a kernel patch which came out on February 24th dealing with buffer overruns in TCP/IP. We are currently applying this patch and will see if it has the intended effect.
Mar 2, 15:54 CST
We are investigating a problem with the dashboards not keeping up with the backend system. The problem is not on the data capture side and no data is being lost. The issue is in the API connection to the backend. We are resolving this issue at the moment and expect the system to be back to functioning at 100% in a short period of time.
Mar 2, 10:26 CST