--- title: "How to recode variables" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How to recode variables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>") ``` We have examples to demonstrate how to recode variables with the _recodeflow_ function [`rec_with_table()`](../reference/rec_with_table.html) **Our examples use following packages:** Package _recodeflow_ Steps on how to install `recodeflow` are in [how to install](../how_to/install.html) ```{r results= 'hide', message = FALSE, warning=FALSE} #Load the package library(recodeflow) ``` Package [dplyr](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8) to combine datasets (function: `bind_rows`). ```{r results = 'hide', message = FALSE, warning =FALSE} library(dplyr) ``` **Our examples use example data** Our examples use the dataset `pbc` from the package [survival](https://cran.r-project.org/web/packages/survival/index.html). We've split this dataset in two (tester1 and tester2) to mimic real data e.g., the same survey preformed in separate years. For our examples, we've also added columns (`agegrp5` and `agegrp10`) to this dataset. ```{r} test1 <- survival::pbc[1:209,] test2 <- survival::pbc[210:418,] #Adapting the data for How To examples. Breaking cont age variable into categories - 5 and 10 year age groups. agegrp <- cut(test1$age, breaks = c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80), right = FALSE) agegrp <- as.numeric(agegrp) tester1 <- cbind(test1, agegrp) agegrp <- cut(test2$age, breaks = c(20, 30, 40, 50, 60, 70, 80), right = FALSE) agegrp <- as.numeric(agegrp) tester2 <- cbind(test2, agegrp) ``` ## Example 1. Recode a single variable from a single dataset In our example datasets, the variable `sex` contains the values: m for males and f for females. Using dataset tester1, we'll recode the variable `sex` into a harmonized `sex` variable. The harmonized `sex` variable has the values: 0 for males and 1 for females. 1) Recode the `sex` variable in tester1. ```{r} sex_1 <- rec_with_table(data = tester1, variables = "sex", variable_details = recodeflow::tester_variable_details, log = TRUE, var_labels = c(sex = "sex"), database_name = 'tester1' ) ``` ```{r, echo=FALSE} head(sex_1) ``` ## Example 2. Recode a single variable across multiple datasets We'll recode and combine the variable `sex` for our two datasets. 1) Recode the `sex` variable in tester1 and tester2. ```{r, warning = FALSE} sex_1 <- rec_with_table(data = tester1, variables = "sex", variable_details = recodeflow::tester_variable_details, log = TRUE, var_labels = c(sex = "Sex"), database_name = 'tester1' ) head(sex_1) sex_2 <- rec_with_table(data = tester2, variables = "sex", variable_details = recodeflow::tester_variable_details, log = TRUE, var_labels = c(sex = "Sex"), database_name = 'tester2' ) tail(sex_2) ``` 2) Combine the harmonized sex variable from tester1 to the harmonized sex variable in tester2. ```{r, warning=FALSE} sex_combined <- bind_rows(sex_1, sex_2) ``` ```{r, echo=FALSE} head(sex_combined) tail(sex_combined) ``` 3) Set labels Labels are lost during the database merging. Use `set_data_labels()` to label the variables in your final dataset. `set_data_labels()` sets the labels with the original information in `variables` and `variable_details`. ```{r} labeled_sex_combined <- set_data_labels( data_to_label = sex_combined, variable_details = recodeflow::tester_variable_details, variables_sheet = recodeflow::tester_variables ) ``` ## Example 3. Recode a single variable, with different categories, from multiple datasets {#example3} You could have a situation where a variable is the same across datasets but its categories change. In our example data the variable `agegrp` is different in tester1 and tester2. - In tester1 the `agegrp` variable is 5-year age groups: 20-24, 25-29, 30-34, etc. - In tester2 the `agegrp` variable is 10-year age groups: 20-29, 30-39, 40-49, etc. There are three options to facilitate the use of variables with inconsistent categories across datasets. ### Option 1: recode category `agegrp` variable into a common variable for only datasets with the same category responses Recode the `agegrp` variable into a common variable only in datasets were the categories are the same. If the categories are different between datasets, separate columns will be created. The categories in the `agegrp` variable in tester1 are different than the categories of `agegrp` in tester2. Therefore, it is not possible to have the same `agegrp` categories across our example data sets. 1) Recode `agegrp5` in tester1 and recode `agegrp10` in tester2. ```{r, warning=FALSE} agegrp_1 <- rec_with_table(data = tester1, variables = "agegrp5", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester1' ) head(agegrp_1) agegrp_2 <- rec_with_table(data = tester2, variables = "agegrp10", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester2') head(agegrp_2) ``` 2) Combine the harmonized variable `agegrp5` in tester1 with the harmonized `agegrp10` in tester2. ```{r, warning=FALSE} agegrp_combined <- bind_rows(agegrp_1, agegrp_2) ``` ```{r, echo=FALSE} head(agegrp_combined) tail(agegrp_combined) ``` ### Option 2: recode the categorical `agegrp` variable into a continuous `age_cont` variable Recode categorical variable `agegrp` into a single harmonized continuous variable `age_cont`. `age_cont` takes the midpoint age of each category for 'agegrp' across datasets. With this option, the categorical variable 'agegrp' from each dataset can be combined into a single dataset. 1) Recode variable `agegrp` in tester1 and `agegrp` in tester2 to the harmonized continuous variable `age_cont`. ```{r, warning=FALSE} agegrp_1_cont <- rec_with_table(data = tester1, variables = "age_cont", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester1') head(agegrp_1_cont) agegrp_2_cont <- rec_with_table(data = tester2, variables = "age_cont", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester2') head(agegrp_2_cont) ``` 2) Combine the harmonized continous variable `age_cont` from tester1 and tester2. ```{r, warning= FALSE} agegrp_cont_combined <- bind_rows(agegrp_1_cont, agegrp_2_cont) ``` ```{r, echo=FALSE} head(agegrp_cont_combined) tail(agegrp_cont_combined) ``` ### Option 3: recode the categorical `agegrp` variable into a harmonized categorical variable Dataset tester1 has 5-year age groups (e.g., 30-34, 35-39), and tester2 has 10-year age groups (e.g., 30-39). Therefore, we can collapse the 5-year age groups in dataset tester1 to the same 10-year age groups in dataset tester2. 1) Recode variable `agegrp` in tester1 into `agegrp10`. recode variable `agegrp` in tester2 into `agegrp10`. ```{r, warning=FALSE} agegrp10_1 <- rec_with_table(data = tester1, variables = "agegrp10", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester1') head(agegrp10_1) agegrp10_2 <- rec_with_table(data = tester2, variables = "agegrp10", variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester2') head(agegrp10_2) ``` 2) Combine the harmonized categorical variable `age_cat` from tester1 and tester2. ```{r, warning= FALSE} agegrp10_combined <- bind_rows(agegrp10_1, agegrp10_2) ``` ```{r, echo=FALSE} head(agegrp10_combined) tail(agegrp10_combined) ``` ## Example 4. Recode multiple variables from multiple datasets {#example4} The variables argument in `rec_with_table()` allows multiple variables to be recoded from a dataset. In this example, the `age` and `sex` variables from the tester1 and tester2 datasets will be recoded and labeled using `rec_with_table()`. We'll then combine the two recoded datasets into a single dataset and labeled using `set_data_labels()`. 1) Recode `age` and `sex` in dataset tester1 and tester2 ```{r, warning=FALSE} age_sex_1 <- rec_with_table(data = tester1, variables = c("age", "sex"), variable_details = recodeflow::tester_variable_details, log = TRUE, var_labels = c(age = "Age", sex = "Sex"), database_name = 'tester1') head(age_sex_1) age_sex_2 <- rec_with_table(data = tester2, variables = c("age", "sex"), variable_details = recodeflow::tester_variable_details, log = TRUE, var_labels = c(age = "Age", sex = "Sex"), database_name = 'tester2' ) head(age_sex_2) ``` 2) Combine the harmonized variables `age` and `sex` from tester1 and tester2. ```{r, warning=FALSE} combined_age_sex <- bind_rows(age_sex_1, age_sex_2) head(combined_age_sex) ``` 3) Set labels Use `set_data_labels()` to label the variables in your final dataset. `set_data_labels()` sets the labels with the original information in `variables` and `variable_details`. `var_labels` can be used all the variables in `variables.csv` or a subset of variables. ```{r, warning=FALSE} labeled_combined_age_sex <- set_data_labels( data_to_label = combined_age_sex, variable_details = recodeflow::tester_variable_details, variables_sheet = recodeflow::tester_variables ) ``` You can check if labels have been added to your recoded dataset by using `get_label()`. ```{r, warning=FALSE} library(sjlabelled) get_label(labeled_combined_age_sex) ``` For more information on `get_label()` and other label helper functions, please refer to the [sjlabelled](https://strengejacke.github.io/sjlabelled/reference/) package. ## Example 5. Recode all variables in the variables worksheet {#example5} All the variables listed in `variables` worksheet can be recoded with `rec_with_table()`. In this example, all variables specified in the `variables` worksheet will be recoded and combined for the datasets tester1 and tester2. ```{r, echo=FALSE} options(htmlwidgets.TOJSON_ARGS = list(na = "string")) ``` 1) Recode all variables listed in the variables worksheet, for dataset tester1 and dataset tester2 ```{r, warning=FALSE} recoded1 <- rec_with_table(data = tester1, variables = recodeflow::tester_variables, variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester1' ) recoded2 <- rec_with_table(data = tester2, variables = recodeflow::tester_variables, variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester2' ) ``` 2) Combine recoded datasets ```{r, warning=FALSE} combined_dataset <- bind_rows(recoded1, recoded2) ``` 3) Set labels for the combined recoded dataset ```{r} labeled_combined <- set_data_labels(data_to_label = combined_dataset, variable_details = recodeflow::tester_variable_details, variables_sheet = recodeflow::tester_variables ) ``` ## Example 6: Add the data origin in combined datasets {#example6} To know the origin of each row of data, you can use the `rec_with_table` argument `attach_data_name`. When the argument `attach_data_name` is set to true it will add a column with the name of the dataset the row is from. 1) Recode variables `age` and `sex` and attach dataset name for tester1 and tester2. ```{r, warning=FALSE} age_sex_1 <- rec_with_table(data = tester1, variables = c("age", "sex"), variable_details = recodeflow::tester_variable_details, var_labels = c(age = "Age", sex = "Sex"), log = TRUE, attach_data_name = TRUE, database_name = 'tester1' ) age_sex_2 <- rec_with_table(data = tester2, variables = c("age", "sex"), variable_details = recodeflow::tester_variable_details, var_labels = c(age = "Age", sex = "Sex"), log = TRUE, attach_data_name = TRUE, database_name = 'tester2' ) ``` 2) Combine the harmonized datasets ```{r, warning=FALSE} combined_age_sex <- bind_rows(age_sex_1, age_sex_2) head(combined_age_sex) tail(combined_age_sex) ``` ## Example 7. Recode derived variables {#example7} Derived variables are variables that are not in the original dataset; rather they are created using variables from the original dataset. Descriptions of derived functions are in the article [derived functions](../articles/derived_variables.html) To recode a derived variable, you must: - create a customized function, - defined the derived variable on the worksheets `variables` and `variable_details`, - recode the variables that make up the derived variable. Our example derived variable `example_der` equals `chol` times `bili`. 1) Recode the underlying variables: `chol` and `bili` and the derived variable `example_der` for tester1 and tester2. ```{r, warning=FALSE} derived1 <- rec_with_table(data = tester1, variables = c("chol", "bili","example_der"), variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester1') derived2 <- rec_with_table(data = tester2, variables = c("chol", "bili","example_der"), variable_details = recodeflow::tester_variable_details, log = TRUE, database_name = 'tester2') ``` 2) Combine the harmonized variables: `chol`, `bili`, and `exampler_der` ```{r, warning=FALSE} combined_der <- bind_rows(derived1, derived2) ```