Using production data to populate test environments is without question the most comprehensive and economical method available. Assuming that the business has been operating for at least several years and/or the volume of business is very high, the data in production reflects real-world conditions. Using that data in the test environment provides the best possible conditions to identify regressions before they become the cause of production outages. By employing subset techniques data volume is limited while providing the same data coverage. The only issue, and it is a big one, is the protection of sensitive information (PII/PHI) associated with your customers.
When customers provide their sensitive information, there is an expectation that it remain private and used solely for the stated business purpose. Any other use is a violation of that trust and may violate data privacy laws. One of those violations is using the data to populate test systems where it is visible to developers, testers, analyst, and other individuals who do not have a business need to view the data.
The solution is to de-identify the sensitive information (PII/PHI) before populating it into a test environment. This is a phased process as described below.
Phase I – Defining Sensitive Data
This is not simply a list of data types that are declared sensitive. You can get that list from hundreds of sources. The key is, whatever that sensitive list is for your organization, you must create it and take it through an internal approval process. You cannot discover what you cannot define. Involve your auditors and regulators in the process of determining what your sensitive data truly are.
Phase II – Discovery
Once defined you must discover where your customers’ sensitive information is located. Organizations often presume that their Subject Matter Experts (SME) can use existing documentation and sift through their tables to identify where it is stored. In reality, even with highly knowledgeable SMEs, it is common that they only discover about 50% of the sensitive data. There are many reasons for this. One example is a field in a database has a name of USER_DEFINED_13. Because of its obscure name, it is overlooked, but actually contains a value that identifies an individual customer. Then there is the semi-structured and unstructured data, which are orders of magnitude harder to discover than the above example.
To find your sensitive data you need specialized discovery software. The software does the searching by analyzing column names and sampling the data looking for patterns that match data that could be classified as sensitive. Then the search results, combined with what the SMEs have already discovered, are used to create the final list. The final list is a combination of discovery with false positives removed and false negatives (SME discoveries) added. This sounds straight forward, but it can become complicated. For example, the same tool used to discover structured data contained in database tables typically would not discover semi-structured or unstructured data stored in file folders on servers. If you have applications running on zOS (mainframe) any sensitive data outside of DB2 requires special handling. Just because a discovery tool advertises that it can discover non-DB2 data does not mean it can do it effectively or without a lot of labor.
Selecting discovery software, allocating staff, arranging for help from outside consultants, training, etc. takes a significant amount of time. Start earlier and build a comprehensive project plan to aid in your path to success.
The right discovery tools in experienced hands typically yields a list of sensitive data that is better than 90% complete. The science is not exact. Any sensitive data missed are often discovered during testing or as part of an audit. This is to be expected.
When multiple applications are involved, the Discovery Phase needs to be at least one step ahead of the Remediation Phase so that the Remediation Team is not waiting on Discovery. For example, if there are four applications, make sure that the Discovery Phase for the second application is complete before the Remediation Phase is complete for the first application.
Phase III – Remediation
Like the Discovery Phase, the Remediation Phase requires specialized software. Depending on the type of fields remediated, Format Preserving Encryption (FPE), hashing, advanced mathematical algorithms and other techniques may all be used. In addition, data exceptions (bad data) may require special handling. Attempting to write your own remediation routines is complex and a wheel best not reinvented. Also, it is a much better audit position to use industry standard tools than defend an in-house tool that you spent significant resources to develop and also must maintain.
Remediation requires its own dedicated environment. That environment must persist until the remediation of all related applications is complete. One possible workflow is that Application A is remediated, loaded and tested. When related Application B is loaded, it is tested standalone, and then the two are tested together and so on for the remaining related applications.
Similar to the situation described in the Discovery Phase, more than one tool may be required for de-identification of different data types. Often the tool that de-identifies structured data in database tables does not support de-identification of XML, JSON, BLOB, etc. Likewise, it may not be able to de-identify unstructured/semi-structured data such as docx, xlsx, jpeg, etc. stored in file systems on servers. Select the tool(s) that fit your requirements and understand that staffing and training will take time.
Remediation is an area where consultants who are experts in Test Data Management and Data Privacy (TDM/DP) is money well spent. They can help with software choices, prevent false starts, and prevent rework. There are many decision points, questions and activities with which consultants can help. Some examples include:
- Full copy of production or subset?
- How to effectively subset?
- Can inherent data de-identification be leveraged?
- How to maintain referential integrity when masking keys?
- Which routines to use for which data types?
- Augmentation processes
- Custom tables for de-identification
- Setting expectations for management
If any of the above examples are not fully understood in detail, that punctuates the need for outside expert assistance. Having experience in TDM/DP is key to a successful project. Having an experienced consultant can make a huge positive difference to your project and its timeline.
Phase IV – Testing
When Definition, Discovery, and Remediation are all complete, User Testing begins. How is that done? The answer is surprisingly straightforward. Organizations already have test plans used to test application programming changes. Those same test plans are used to test the de-identified data. The only difference is that the data is being testing instead of the code. This means that the code base supporting the test applications must be a copy of current production. You can assume that any defects are the result of de-identification and react accordingly. Obviously, a repeat of this testing is only necessary if changes are made to any of the de-identification algorithms or new applications are remediated.
Phase V – Augmentation
During testing, it is often necessary to augment the existing data with additional data. It may also be advantageous to reset existing data for a few customers to a previous state. Make sure that the chosen remediation tool is capable of supporting those tasks.
Closing Thoughts
The process and the phases described herein are straightforward; however, during execution they expand into many tasks. The goal was to provide an understandable framework and to get you started. This topic is very complex and even an entire book would not likely cover all the facets and variables. Use this article to think about how to begin your successful journey to protecting the data with which you are entrusted.
Written by Steven David Alley, Senior Solution Delivery Engineer at ABM