Ever needed to work with an anonymized data set but hated having
to manually recode your variables? Well this function was built to address
that particular challenge. In essence, it takes at least two arguments a df
or data frame and a vars arguments, which represents a character vector of
variable names. From there, the function repetitively groups and updates the
values of the referenced columns (vars) with random numbers. Thus, masking
the initial data set and ensuring anonymity. And just like that, your data is
anonymized!
Arguments
- df
a data frame that will be encoded.
- vars
a character vector of variables with the data frame that will be recoded using a randomized numeric value.
- tag
a boolean, when
TRUEthe recoded value will be prefixed with the variable name and an underscore (e.g., 'varname_123'). Defaults toFALSE.
Details
While the are multiple ways to encode data, this approach uses label encoding to assign each variable a unique integer value. This is a simpler method than most (e.g., on hot encoding) but it has the potential drawback of creating misinterpretations of the data. Namely, by giving the impression that the encoded variables have an ordered relationship when the in fact do not.
Examples
if (FALSE) {
test <- data.frame(id = LETTERS[1:20], vals = 1:20)
encode_variables(df = test, 'id')
}