Title: | Contains functions to interface with variable details sheets, including recoding variables and converting them to PMML |
---|---|
Description: | Recode and harmonize data using variable and details sheets. |
Authors: | Yulric Sequeira [aut, cre], Luke Bailey [aut], Rostyslav [aut] |
Maintainer: | Yulric Sequeria <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-02-14 05:58:22 UTC |
Source: | https://github.com/big-life-lab/recodeflow |
Returns the name of the table for a table start variable
get_table_name(table_feeder_var)
get_table_name(table_feeder_var)
table_feeder_var |
string The table variable start |
string The extracted table name
# Extract table names from table feeder variables get_table_name("$table:lookup_codes") # Returns "lookup_codes" get_table_name("$table:reference") # Returns "reference" get_table_name("$table:values") # Returns "values"
# Extract table names from table feeder variables get_table_name("$table:lookup_codes") # Returns "lookup_codes" get_table_name("$table:reference") # Returns "reference" get_table_name("$table:values") # Returns "values"
Compared to the base "==" operator in R, this function returns true if the two values are NA whereas the base "==" operator returns NA
is_equal(v1, v2)
is_equal(v1, v2)
v1 |
variable 1 |
v2 |
variable 2 |
boolean value of whether or not v1 and v2 are equal
is_equal(1,2) # FALSE is_equal(1,1) # TRUE 1==NA # NA is_equal(1,NA) # FALSE NA==NA # NA is_equal(NA,NA) # TRUE
is_equal(1,2) # FALSE is_equal(1,1) # TRUE 1==NA # NA is_equal(1,NA) # FALSE NA==NA # NA is_equal(NA,NA) # TRUE
Attaches labels to the data_to_label to preserve metadata
label_data(label_list, data_to_label)
label_data(label_list, data_to_label)
label_list |
the label list object that contains extracted labels from variable details |
data_to_label |
The data that is to be labeled |
Returns labeled data
The pbc dataset
pbc
pbc
A data frame with 418 observations and 20 variables.
case number
number of days between registration and the earlier of death, transplantation, or study analysis time
status at endpoint, 0/1/2 for censored, transplant, dead
1/2/NA for D-penicillamine, placebo, or not randomized
age in years
m/f
presence of ascites
presence of hepatomegaly or enlarged liver
blood vessel malformations in the skin
0 no edema, 0.5 untreated or successfully treated, 1 edema despite diuretic therapy
serum bilirubin (mg/dl)
serum cholesterol (mg/dl)
serum albumin (g/dl)
urine copper (ug/day)
alkaline phosphotase (U/liter)
aspartate aminotransferase (U/ml)
triglycerides (mg/dl)
platelet count
standardised blood clotting time
histologic stage of disease (1, 2, 3, or 4)
https://cran.r-project.org/web/packages/survival/survival.pdf
Metadata for the pbc dataset using the DCIM standard
pbc_metadata
pbc_metadata
A list containing DCMI metadata:
title
creator
subject
description
publisher
date
type
format
identifier
source
language
rights
references
Variable details sheet for the pbc dataset
pbc_variable_details
pbc_variable_details
A data frame with 69 rows and 16 columns:
variable name
dummy variable name
end type
database start
variable start
start type
record end
record start
category label
category long label
number of valid categories (numeric)
logical indicating presence of units
logical indicating presence of notes
category start label
variable start short label
variable start label
Variables sheet for the pbc dataset
pbc_variables
pbc_variables
A data frame with 24 rows and 11 columns:
variable name
variable label
variable label long
subject
section
variable type
database start
units
variable start
logical indicating presence of notes
logical indicating presence of description
Creates new variables by recoding variables in a dataset using the rules specified in a variables details sheet
rec_with_table( data, variables = NULL, database_name = NULL, variable_details = NULL, else_value = NA, append_to_data = FALSE, log = FALSE, notes = TRUE, var_labels = NULL, custom_function_path = NULL, attach_data_name = FALSE, id_role_name = NULL, name_of_environment_to_load = NULL, append_non_db_columns = FALSE, tables = list() )
rec_with_table( data, variables = NULL, database_name = NULL, variable_details = NULL, else_value = NA, append_to_data = FALSE, log = FALSE, notes = TRUE, var_labels = NULL, custom_function_path = NULL, attach_data_name = FALSE, id_role_name = NULL, name_of_environment_to_load = NULL, append_non_db_columns = FALSE, tables = list() )
data |
A dataframe containing the variables to be recoded. Can also be a named list of dataframes. |
variables |
Character vector containing the names of the new variables to recode to or a dataframe containing a variables sheet. |
database_name |
A String containing the name of the database containing the original variables which should match up with a database from the databaseStart column in the variables details sheet. Should be a character vector if data is a named list where each vector item matches a name in the data list and also matches with a value in the databaseStart column of a variable details sheet. |
variable_details |
A dataframe containing the specifications for recoding. |
else_value |
Value (string, number, integer, logical or NA) that is used to replace any values that are outside the specified ranges (no rules for recoding). |
append_to_data |
Logical, if |
log |
Logical, if |
notes |
Logical, if |
var_labels |
labels vector to attach to variables in variables |
custom_function_path |
string containing the path to the file containing functions to run for derived variables. This file will be sourced and its functions loaded into the R environment. |
attach_data_name |
logical to attach name of database to end table |
id_role_name |
name for the role to be used to generate id column |
name_of_environment_to_load |
Name of package to load variables and variable_details from |
append_non_db_columns |
boolean determening if data not present in this cycle should be appended as NA |
tables |
named list of data.frame A list of reference tables that can be passed as parameters into the function for a derived variable |
The variable_details dataframe needs the following columns:
Name of the new variable created. The name of the new variable can be the same as the original variable if it does not change the original variable definition
type the new variable cat = categorical, cont = continuous
Names of the databases that the original variable can come from. Each database name should be seperated by a comma. For eg., "cchs2001_p, cchs2003_p,cchs2005_p,cchs2007_p"
Names of the original variables within each database specified in the databaseStart column. For eg. , "cchs2001_p::RACA_6A,cchs2003_p::RACC_6A,ADL_01". The final variable specified is the name of the variable for all other databases specified in databaseStart but not in this column. For eg., ADL_01 would be the original variable name in the cchs2005_p and cchs2007_p databases.
variable type of start variable. cat = categorical or factor variable cont = continuous variable (real number or integer)
Value to recode to
Value/range being recoded from
Each row in the variables details sheet encodes the rule for recoding value(s) of the original variable to a category in the new variable. The categories of the new variable are encoded in the recTo column and the value(s) of the original variable that recode to this new value are encoded in the recFrom column. These recode columns follow a syntax similar to the sjmisc::rec() function. Whereas in the sjmisc::rec() function the recoding rules are in one string, in the variables details sheet they are encoded over multiple rows and columns (recFrom an recTo). For eg., a recoding rule in the sjmisc function would like like "1=2;2=3" whereas in the variables details sheet this would be encoded over two rows with recFrom and recTo values of the first row being 1 and 2 and similarly for the second row it would be 2 and 3. The rules for describing recoding pairs are shown below:
Each recode pair is a row
Multiple values from the old variable that should be recoded into a new category of the new variable should be separated with a comma. e.g., recFrom = "1,2"; recTo = 1
will recode values of 1 and 2 in the original variable to 1 in the new variable
A value range is indicated by a colon, e.g. recFrom= "1:4"; recTo = 1 will recode all values from 1 to 4 into 1
minimum and maximum values are indicated by min (or lo) and max (or hi), e.g. recFrom = "min:4"; recTo = 1 will recode all values from the minimum value of the original variable to 4 into 1
All other values, which have not been specified yet, are indicated by else, e.g. recFrom = "else"; recTo = NA will recode all other values (not specified in other rows) of the original variable to "NA")
the else token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. recFrom = "else"; recTo = "copy"
NA values are allowed both for the original and the new variable, e.g. recFrom "NA"; recTo = 1. or "recFrom = "3:5"; recTo = "NA" (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)
a dataframe that is recoded according to rules in variable_details.
var_details <- data.frame( "variable" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "dummyVariable" = c("NA", "status0", "status1","status2", "trt1","trt2","NA","sexM","sexF", "ascites0", "ascites1","hepato0","hepato1","spiders0","spiders1","edema0.0","edema0.5","edema1.0",rep("NA",times = 9), "stage1", "stage2","stage3","stage4"), "typeEnd" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)), "databaseStart" = rep("tester1, tester2", times = 31), "variableStart" = c("[time]", rep("[status]", times = 3), rep("[trt]", times = 2), "[age]", rep("[sex]", times = 2), rep("[ascites]", times = 2), rep("[hepato]", times = 2), rep("[spiders]", times = 2), rep("[edema]", times = 3), "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]", rep("[stage]", times = 4)), "typeStart" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)), "recEnd" = c("copy", "0", "1","2", "1","2","copy","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("copy",times = 9), "1", "2","3","4"), "catLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "catLabelLong" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "recStart" = c("else", "0", "1","2", "1","2","else","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("else",times = 9), "1", "2","3","4"), "catStartLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "variableStartShortLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "variableStartLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "units" = rep("NA", times = 31), "notes" = rep("This is sample survival pbc data", times = 31) ) var_sheet <- data.frame( "variable" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "label" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "labelLong" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "section" = rep("tester", times=19), "subject" = rep("tester",times = 19), "variableType" = c("cont", "cat", "cat", "cont","cat", "cat", "cat","cat", "cat", rep("cont", times = 9), "cat"), "databaseStart" = rep("tester1, tester2", times = 19), "units" = rep("NA", times = 19), "variableStart" = c("[time]","[status]", "[trt]", "[age]", "[sex]", "[ascites]","[hepato]","[spiders]","[edema]", "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]","[stage]") ) library(survival) tester1 <- survival::pbc[1:209,] tester2 <- survival::pbc[210:418,] db_name1 <- "tester1" db_name2 <- "tester2" rec_sample1 <- rec_with_table(data = tester1, variables = var_sheet, variable_details = var_details, database_name = db_name1) rec_sample2 <- rec_with_table(data = tester2, variables = var_sheet, variable_details = var_details, database_name = db_name2)
var_details <- data.frame( "variable" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "dummyVariable" = c("NA", "status0", "status1","status2", "trt1","trt2","NA","sexM","sexF", "ascites0", "ascites1","hepato0","hepato1","spiders0","spiders1","edema0.0","edema0.5","edema1.0",rep("NA",times = 9), "stage1", "stage2","stage3","stage4"), "typeEnd" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)), "databaseStart" = rep("tester1, tester2", times = 31), "variableStart" = c("[time]", rep("[status]", times = 3), rep("[trt]", times = 2), "[age]", rep("[sex]", times = 2), rep("[ascites]", times = 2), rep("[hepato]", times = 2), rep("[spiders]", times = 2), rep("[edema]", times = 3), "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]", rep("[stage]", times = 4)), "typeStart" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)), "recEnd" = c("copy", "0", "1","2", "1","2","copy","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("copy",times = 9), "1", "2","3","4"), "catLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "catLabelLong" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "recStart" = c("else", "0", "1","2", "1","2","else","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("else",times = 9), "1", "2","3","4"), "catStartLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"), "variableStartShortLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "variableStartLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)), "units" = rep("NA", times = 31), "notes" = rep("This is sample survival pbc data", times = 31) ) var_sheet <- data.frame( "variable" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "label" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "labelLong" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"), "section" = rep("tester", times=19), "subject" = rep("tester",times = 19), "variableType" = c("cont", "cat", "cat", "cont","cat", "cat", "cat","cat", "cat", rep("cont", times = 9), "cat"), "databaseStart" = rep("tester1, tester2", times = 19), "units" = rep("NA", times = 19), "variableStart" = c("[time]","[status]", "[trt]", "[age]", "[sex]", "[ascites]","[hepato]","[spiders]","[edema]", "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]","[stage]") ) library(survival) tester1 <- survival::pbc[1:209,] tester2 <- survival::pbc[210:418,] db_name1 <- "tester1" db_name2 <- "tester2" rec_sample1 <- rec_with_table(data = tester1, variables = var_sheet, variable_details = var_details, database_name = db_name1) rec_sample2 <- rec_with_table(data = tester2, variables = var_sheet, variable_details = var_details, database_name = db_name2)
Selects variables from variables sheet based on passed roles
select_vars_by_role(roles, variables)
select_vars_by_role(roles, variables)
roles |
a vector containing a single or multiple roles to match by |
variables |
the variables sheet containing variable info |
a vector containing the variable names that match the passed roles
sets labels for passed database, Uses the names of final variables in variable_details/variables_sheet as well as the labels contained in the passed dataframes
set_data_labels(data_to_label, variable_details, variables_sheet = NULL)
set_data_labels(data_to_label, variable_details, variables_sheet = NULL)
data_to_label |
newly transformed dataset |
variable_details |
variable_details.csv |
variables_sheet |
variables.csv |
labeled data_to_label
Example variable details sheet for vignettes
tester_variable_details
tester_variable_details
A data frame with 69 rows and 16 columns:
variable name
dummy variable name
end type
database start
variable start
start type
record end
record start
category label
category long label
number of valid categories (numeric)
logical indicating presence of units
logical indicating presence of notes
category start label
variable start short label
variable start label
Example variables sheet for vignettes
tester_variables
tester_variables
A data frame with 24 rows and 11 columns:
variable name
variable label
variable label long
subject
section
variable type
database start
units
variable start
logical indicating presence of notes
logical indicating presence of description