Strategies are described for sanitizing a data set, having the effect of
obscuring restricted data in the data set to maintain its secrecy. The
strategies operate by providing a production data set to a sanitizer. The
sanitizer applies a data directory table to identify the location of
restricted data items in the data set and to identify the respective
sanitization tools to be applied to the restricted data items. The
sanitizer then applies the identified sanitization tools to the
identified restricted data items to produce a sanitized data set. A test
environment receives the sanitized data set and performs testing, data
mining, or some other application on the basis of the sanitized data set.
Performing sanitization on a sanitized version of the production data set
is advantageous because it preserves the state of the production data
set. The data directory table also provides a flexible mechanism for
applying sanitization tools to the production data set.