A Government Printing Works (GPW) system crash that led to a massive data loss last year was caused by poor ICT maintenance, and the chief information officer (CIO) had a team that didn't know how to perform proper functions on the server.
This was the finding of a panel that Home Affairs Minister Aaron Motsoaledi appointed to investigate the system glitch.
The server supporting corporate services and e-Gazettes at the GPW crashed on 4 February 2021, resulting in a loss of critical information. Some of the information, the panel was informed, might never be recovered.
Papati Robert Malavi, one of the panel members, presented the findings to the Portfolio Committee on Home Affairs on Tuesday.
He told MPs that staff members of the ICT division of the GPW told the panel the crash was caused by a surge in electricity when power resumed after a blackout/load-shedding.
"The panel contacted Eskom and the City of Tshwane and established that there were no power outages on the relevant days. The panel subsequently found that the surge was caused by non-compliant electrical installations at Pavillion 2 (GPW) which housed the crashed server," Malavi said in his presentation.
He added: "The panel's key direct finding is that the incident was caused by poor maintenance of the ICT infrastructure due, essentially, to the fact that the CIO and his team did not know how to perform proper functions on the server, such as loading discs, scrubbing them before loading new data, ensuring that there is proper backups should there be a problem, because ICT equipment...fail.
"All of this [is] accompanied by a lack of support and maintenance contracts with service providers for the servicing of ICT-related equipment. Underpinning these issues, however, is a failure of management and supervision at various levels which is the ultimate cause of systemic failures at the GPW."
The investigation findings were the following:
- On 23 April 2019, the servicing of the two UPSes (uninterrupted power supply), undertaken by Tescom SA, a third-party specialist in UPS products, identified the need to replace a parallel board, which had been overheating. The batteries of both UPSes were found to be in a good condition.
- In October 2019 it was discovered that the backup library was not functioning optimally in that tape drives were failing. The replacement of the tape library's disk drives was performed by the Deputy Director: Infrastructure Specialist.
- During 2020, the tape library was repaired in-house; delays were experienced in respect of the procurement of tape drives, and the robotic arm of the tape library was also found to be faulty and in need of repairs.
- In July 2020, more frequent EVA (enterprise virtual array) hard-drive failures were reported. The infrastructure specialist made a request for disks to be replaced, and the director of operations highlighted the need for the migration to be effected, which, at that point, had been delayed for close to two years. Records show that hard-drive failures had been occurring since 2015. An application for a firmware upgrade to the HP EVA was made by the then CIO.
The panel also found that a server was procured in 2017 to migrate data from the damaged Hewlett Packard Enterprise (HPE) Virtual Array (EVA) server.
"The panel was informed by the EVA's original equipment manufacturer (OEM) that the damaged unit had been installed at the GPW in April 2011 and that support and the provision of patches had been terminated on May 31, 2017.
"After this date, Hewlett Packard Enterprise (HPE) informed the panel, the GPW had engaged HP partners on a time and material support basis when support was needed for the maintenance of the EVA.
"Though the ICT team had not been part of the decision-making around procuring the Hyperconverged Infrastructure (HCI), the period from 2017 presented an opportunity for the ICT team or the migration project team to decide on a Structured Query Language (SQL) database technology or SQL configuration supported by the new HCI," Malavi said.
During 2018, talk of migrating services from the HP EVA server to the Hyperconverged Infrastructure (HCI) began, he said.
"The last-known HPE support and maintenance contract had expired in August 2018. In October 2018, HPE provided a budgetary quote for data centre maintenance, which would be inclusive of all HPE assets. The panel was informed by the ICT team that this quote was rejected on the basis that it was too expensive," he added.
In November 2018, Malavi said, the CIO submitted requirements to the then chief financial officer for a tender process to secure a new hardware maintenance and support contract.
"The CIO submitted a reminder in January 2019. CIO response was 'noted'. Supply chain management indicated that no response was received from suppliers," he said.
The CIO resigned in January 2022.