Skip to contents

Ever needed to work with an anonymized data set but hated having to manually recode your variables? Well this function was built to address that particular challenge. In essence, it takes at least two arguments a df or data frame and a vars arguments, which represents a character vector of variable names. From there, the function repetitively groups and updates the values of the referenced columns (vars) with random numbers. Thus, masking the initial data set and ensuring anonymity. And just like that, your data is anonymized!

Usage

encode_variables(df, vars, tag = FALSE)

Arguments

df

a data frame that will be encoded.

vars

a character vector of variables with the data frame that will be recoded using a randomized numeric value.

tag

a boolean, when TRUE the recoded value will be prefixed with the variable name and an underscore (e.g., 'varname_123'). Defaults to FALSE.

Details

While the are multiple ways to encode data, this approach uses label encoding to assign each variable a unique integer value. This is a simpler method than most (e.g., on hot encoding) but it has the potential drawback of creating misinterpretations of the data. Namely, by giving the impression that the encoded variables have an ordered relationship when the in fact do not.

Examples

if (FALSE) {
test <- data.frame(id = LETTERS[1:20], vals = 1:20)
encode_variables(df = test, 'id')
}