Who is the best at disguising themselves with a mask? Zorro? Hannibal from the A-Team? Ethan Hunt from Mission Impossible?
From this brief list, you can see that the quality of masks varies enormously. Just because someone wears a mask, they won’t always be fully disguised. The same principle applies when applying a mask to your data.
Data masking is where techniques are used to disguise the original data. Some examples of these techniques are shown in the listing below:
a. Remove a substring from the data item.
b. Add a fixed number or substring to the data item.
c. Delete the data item, or replace it with a NULL.
d. Process the data item, or a subset, using a pre-defined procedure or function.
e. Add a randomly generated number, or replace an existing number with a random number. The random number could also be within a specified percentage higher or lower than the original value.
f. Re-arrange various components of the data item.
g. Replace the data item with data from a lookup column in another table.
The actual masking process could use one of these or similar techniques, or it could use a combination of different techniques.
Note: A “data item” is a column within a database table that has been identified as containing “sensitive” data.
The main reason for masking the data is security. Some people who access your data will not really need to see all the data in its original format. An example of this could be an external customer, or where you publish information to a website.
Another scenario would be where the security restrictions for your test, training and development environments are not as robust as those for your production environment. If these databases use a copy of production data, then they could be candidates for data masking.
Unfortunately, data masking is not always as straightforward as it might first appear. You should bear in mind the following:
1. You cannot disguise the data so much that it becomes unusable - masked data still needs to conform to application formatting requirements.
(e.g. if you mask credit card numbers, they will still need to comply with any checksums designed to check that the numbers are valid).
2. You need to have some way of tracing the data back to its original format.
If a developer or tester finds a problem, they need to be able to work out whether the issue was caused by application code, or an anomaly in the data.
3. Be aware of the impact on your reporting systems
If you have a column of sales figures and you just re-arrange the numbers, then your totals and averages could still be the same, but any drill-downs in your data could become inaccurate.
If you only have a few figures in a particular column then someone could easily work out what the original order was.
Ideally, you should try to maintain the sort order, length and data types of the original data.
If substituting items such as people’s first names, then double-check that you’re replacing female names with female names and male names with male names.
4. The referential integrity of your data needs to be maintained.
As a basic example - You have a table that stores details of cows in the FARM_ANIMAL table. This table is linked to the FARMER table via a foreign key. If you mask the ID column of the FARMER table, then make sure that you use the same format mask for the foreign key column of the FARM_ANIMAL table - otherwise the farmer will lose his cows.
Any related data items that may not have a formal foreign key constraint will also need to have the same format mask applied.
Check for any similar data in remote databases. If data in these remote columns is not masked in a similar fashion, then people could use these as a reference to figure out the original values of the masked items.
5. Be aware of the performance implications of masking your data.
If your database contains a large amount of data inconsistencies, then you may have to exclude a large amount of data from the format mask - with the associated performance impact, whilst you deal with these values.
In this situation, it would be useful to undertake a data cleansing exercise before trying to mask the data. Otherwise, you’ll be left with lots of unmasked data which could make it simple for someone to work out what format mask you have applied to the column. In the future you should probably think about restricting the acceptable data entry formats (e.g. by using check constraints etc.)
It is best to plan in advance which items you need to mask and what format mask will be applied to each data item. If you try to mask everything in the database, then that could take a very long time.
6. If any tables or data items are not essential outside of the production environment, then consider deleting them, truncating them, or replacing them with NULLs.
Oracle offers its own data masking tool in the form of the Oracle Enterprise Manager Data Masking Pack, which is part of Grid Control 10.2.0.4 or later. There is a useful tutorial for this tool at http://www.oracle.com/technology/obe/11gr1_db/security/datamask/datamask.htm.
Data Masking is a very large topic which can’t really be covered in a blog, but when carrying out any masking, you should always bear in mind the advice of Joseph Fieman from Gartner, when he wrote:
”Data-masking technologies should satisfy a simple, yet strict rule: The application that runs against masked data performs as if masked data is real.”
Make sure that your mask is a good one, otherwise people will be able to see behind it.