Testing in the production environment: risks, security controls and accountability
Testing in the production environment
One of the first lessons I learned after entering the test profession was that the production environment is sacred. When any difficulties were met during testing, a common joke in the circles of local heroes of the IT organisation would be “just test that in production!”, revealing just how much of a sin it actually was considered when functionality had to be proven working in the production environment.
Recently I’ve started to notice that testing in the production environment is a phenomenon that occurs on a continuous basis. One organisation I’ve talked to has reported to have over one thousand test clients in their production database, created by many different departments and projects. On this scale, test clients need to be filtered out of production reporting to avoid that incorrect figures are being reported to the outside world. Discussing this example with other organisations, it appears that functional testing in the production environment (not counting performance or disaster recovery testing) is more trend than incident. These tests can occur as a part of defect/incident analysis, or as a concluding test before go-live, aiming to eliminate any remaining risks or uncertainty.
Two separate drivers for this trend can be identified. First, organisations are having increasing difficulty in creating test environments that represent the full functionality that the production environment contains. This is due to the fact that system landscapes are becoming increasingly complex, with budget lacking for improvement initiatives and legacy applications being kept alive. Meanwhile business processes are becoming more and more integrated, increasing both the demand for test environments that are connected end-to-end and the challenge to create a test environment equal to production.
Second, new technologies are driving system developments that span across multiple organisations. An example is the development of mobile payments functionality utilizing near field communication that involves banks, telecom providers, merchants, mobile phone manufacturers, payment terminal vendors and end customers. Any initiative to test new functionality within the telecom provider or the bank would require an integrated test environment that will be present only in full (without stubs) in production.
This testing can take place using test clients, but also internal employees or external volunteers can be used. Regardless of the type of client, testing in production obviously comes with a lot of risk:
Testing on production can cause production incidents. The testing is being done because an incident needs to be investigated (or even reproduced), or to eliminate remaining risk/uncertainty. By definition, the possibility exists that a technical defect will be uncovered that can disturb production.
Security controls are not in place within the organization. Measures need to be taken to make sure that no issues arise from the use of test clients or volunteers. For example, it should be avoided that a volunteer will be impacted by an actual credit registration as a consequence of testing. Another example is a technical defect where an amount that is transferred is multiplied by 100; not an issue on a test environment, but can be problematic in the real world. The organization can suffer reputation damage when testing on production impacts real clients, for example with unintended mail delivery. With regards to external auditors, the organization has to achieve accountability for all actions performed with test clients and/or volunteers.
Additionally, anti-fraud measures need to be in place. When using test clients (dummy clients), testers are usually given access to a large number of test clients, with all the possibilities that the client will usually have in the real world. Can the client withdraw money? One hundred test clients can withdraw one hundred times that amount. In test efforts, the element of fraud and corresponding measures will often not have been addressed.
Security controls are not in place across partner organizations. Even when security measures are in place internally, any testing that involves functionality across multiple organizations makes an organization dependant on the degree of control in the other participating organizations
Issues caused by deviating from regular production processes. By deviating from regular production processes, issues can arise for example when not creating formal contracts for client products, or bypassing identification requirements. Possibly, regular production workers will be required to perform operations on test clients that they normally should not perform
Implications for the IT governance function
Understanding both the necessity of testing in production and the risks involved, the IT governance and IT management function within the organization should define their position on this topic. Ultimately they should be able to state which security control measures should be taken, and which requirements will be posed to the test organization to achieve accountability.
From an organizational perspective, the central IT risk management department will have to set up and maintain a security control framework for production testing in cooperation with the organizations’ Testing Center of Excellence. This framework should include high level security controls, requirements for accountability, and a high level way of working aligned with relevant stakeholders in the organisation. Any test efforts by departments or projects will have to be discussed with the local IT risk manager, within the overall framework. The IT risk management function (as part of IT governance) should actively monitor all departments to maintain an integral overview of any production testing taking place.
Alternatively, the organization can choose to permanently allocate specific resources that are exclusively allowed to test on production, under continuous monitoring by the IT risk department. In this case, it is necessary that it is ensured that all testing departments or projects can find this central department.
Best practices for security controls and accountability
Because the risks involved will differ depending on the circumstances in the departments or project involved, the IT risk management should work with testing departments or projects to define security controls for each product and/or process (or groups of similar product and processes). Although of course not exhaustive, the following examples come from existing centralized production testing departments:
· Legal requirements. In many organizations, client identification is a must on production. IT risk management should define minimum requirements for testers to adhere to, for example to put documented management approval for deviation of policy on file instead of client identification. Another security control can be to set up all regular contractual agreements during testing
· Measures regarding transaction size or position. To avoid issues with any financial transactions that may go wrong, clear limits should be set for allowed transaction size and volume. Additionally, any total exposure across test clients should be controlled by creating standards for maximum credit facility, maximum account balance etc. This can also mean that in case certain minimum balances are required for the test case, additional measures such as account blockades will need to be taken
· Naming conventions. Standards should be set for naming test clients, and for example commentary lines within their transactions. These commentary lines can be coded to contain date of the transaction, initials of the person performing the transaction and the test case, use case or incident number to which the transaction is related. This will increase the traceability of the transaction
· Allowed workarounds. In some cases, following the regular production process in full will be too labour intensive. Deviations from the regular process can be agreed to in cooperation with IT risk management. An example can be that a customer request through the call centre will be skipped and the processing step by the call centre agent will be the actual first step in the business process to be executed
· Clear cleanup requirements. After testing it is more rule than exception test data is not cleaned up. On production, cleaning up test data is far more complex and labour intensive. IT risk management should in advance of testing define clear requirements for the cleanup process and the parties responsible for executing this
To ensure accountability towards IT risk management and external parties, the testing departments or projects will need to take action to document their activities in a sufficient manner, including at least:
· Documentation on all planned data setup and data mutations (this includes expected results from testing)
· Documentation on all actual data setup and data mutations. This should include contracts.
· Archiving of this documentation should be aligned with record retention requirements
The best practices outlined above will create additional effort for testing departments or projects and will undoubtedly increase costs for some initiatives. It is important for IT risk management to make the involved risks transparent and provide a framework that implements security controls to address the risk, while leaving sufficient flexibility to allow the testing community to work.
Author: Jeroen Yntema, owner Yntema Consulting
Please explore our other resources as well!