| Title: | Building Sets of Variables in a Probabilistic Framework |
|---|---|
| Description: | Create sets of variables based on a mutual information approach. In this context, a set is a collection of distinct elements (e.g., variables) that can also be treated as a single entity. Mutual information, a concept from probability theory, quantifies the dependence between two variables by expressing how much information about one variable can be gained from observing the other. Furthermore, you can analyze, and visualize these sets in order to better understand the relationships among variables. |
| Authors: | Nicolas Leenaerts [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2421-6845>), Aaron Fisher [aut, cph] (ORCID: <https://orcid.org/0000-0001-9754-4618>) |
| Maintainer: | Nicolas Leenaerts <[email protected]> |
| License: | CC BY 4.0 |
| Version: | 1.0.0 |
| Built: | 2026-06-05 09:25:42 UTC |
| Source: | https://github.com/nicolasleenaerts/setweaver |
Computes the conditional entropy for two binary
vectors 'y' (outcome) and 'x' (predictor).
ce(y, x)ce(y, x)
y |
A binary outcome vector (0/1 or logical). Must be the same length as 'x'. |
x |
A binary predictor vector (0/1 or logical). Must be the same length as 'y'. |
A numeric scalar giving .
ce(misimdata$y,misimdata$x1)ce(misimdata$y,misimdata$x1)
Computes the conditional probability for
two binary vectors 'y' and 'x'. Rows with missing values in either vector are
excluded.
cprob(y, x)cprob(y, x)
y |
A binary outcome vector (0/1 or logical). Must be the same length as 'x'. |
x |
A binary predictor vector (0/1 or logical). Must be the same length as 'y'. |
A numeric scalar giving the conditional probability that 'y = 1' given 'x = 1'.
cprob(misimdata$y,misimdata$x1)cprob(misimdata$y,misimdata$x1)
Computes the conditional probability
for two binary vectors 'y' and 'x'. Rows with missing values in either vector
are excluded.
cprob_inv(y, x)cprob_inv(y, x)
y |
A binary outcome vector (0/1 or logical). Must be the same length as 'x'. |
x |
A binary predictor vector (0/1 or logical). Must be the same length as 'y'. |
A numeric scalar giving the conditional probability that 'y = 1' given 'x = 0'.
cprob_inv(misimdata$y,misimdata$x1)cprob_inv(misimdata$y,misimdata$x1)
Computes a set of descriptive diagnostics for a binary outcome 'y' against one or more predictors in 'x', including marginal probability, conditional probability, absolute and proportional differences between marginal and conditional probabilities, and analogous measures based on . entropy.
entfuns(y, x)entfuns(y, x)
y |
A binary outcome vector (0/1 or logical). Length 'n'. |
x |
A data frame of binary predictors (columns). Must have 'n' rows; each column is analyzed separately against 'y'. |
Inputs are treated as binary (0/1 or logical). Missing values are removed pairwise for each predictor (rows with 'NA' in either the outcome or the predictor are excluded for that predictor's calculations).
A data frame with one row per predictor and the following columns:
Predictor name.
Marginal probability computed on complete cases
for that predictor.
Marginal probability .
Conditional probability .
Absolute difference .
Percent difference relative to .
Entropy .
Conditional entropy .
Absolute difference .
Percent difference in entropy relative to .
entfuns(misimdata$y,misimdata[,2:5])entfuns(misimdata$y,misimdata[,2:5])
Returns marginal entropy for binary variables
entropy(x)entropy(x)
x |
A binary vector (numeric coded as 0/1 or logical). Must be length >= 1. |
A numeric scalar giving the entropy of 'x'.
entropy(misimdata$x1)entropy(misimdata$x1)
Given a character vector of sets (each set encoded as variable names joined by a separator), returns the subset of sets that are minimal: no returned set is a strict superset of another. Duplicates and ordering differences are handled according to the implementation.
find_minimal_sets(str_vec, sep = "_")find_minimal_sets(str_vec, sep = "_")
str_vec |
Character vector of set strings for which to find minimally sufficient sets (e.g., 'c("x1_x2", "x1_x2_x3")'). |
sep |
Character string used as the separator between variables in each set. Defaults to '"_"'. |
A character vector containing the minimally sufficient sets (i.e., sets that are not strict supersets of any other set in 'str_vec').
pairmiresult = pairmi(misimdata[,2:6]) results_probstat <- probstat(misimdata$y,pairmiresult$expanded.data,nfolds=5) find_minimal_sets(results_probstat$xvars[results_probstat$cprob >= 0.20])pairmiresult = pairmi(misimdata[,2:6]) results_probstat <- probstat(misimdata$y,pairmiresult$expanded.data,nfolds=5) find_minimal_sets(results_probstat$xvars[results_probstat$cprob >= 0.20])
Computes the likelihood-ratio test statistic (G statistic) from the mutual information and the joint count of two variables:
where is the joint sample size and is the mutual information.
gstat(mi, count)gstat(mi, count)
mi |
Numeric scalar; the mutual information between two variables. |
count |
Integer scalar; the joint count (sample size) used in computing
|
A numeric scalar giving the G statistic value.
gstat(mi(misimdata$y,misimdata$x1),jtct(misimdata$y,misimdata$x1))gstat(mi(misimdata$y,misimdata$x1),jtct(misimdata$y,misimdata$x1))
Computes the joint probability for two binary vectors
'x' and 'y'. Rows with missing values in either vector are excluded.
joint(y, x)joint(y, x)
y |
A binary outcome vector (0/1 or logical). Must be the same length as 'x'. |
x |
A binary predictor vector (0/1 or logical). Must be the same length as 'y'. |
A numeric scalar giving the joint probability that both 'x = 1' and 'y = 1', calculated as the joint count divided by the number of complete cases.
joint(misimdata$y,misimdata$x1)joint(misimdata$y,misimdata$x1)
Counts the number of complete observations where both a binary outcome 'y' and a binary predictor 'x' equal 1. Missing values are excluded pairwise (rows with 'NA' in either 'x' or 'y' are ignored).
jtct(y, x)jtct(y, x)
y |
Outcome vector (binary: 0/1 or logical). Must be the same length as 'x'. |
x |
Predictor vector (binary: 0/1 or logical). Must be the same length as 'y'. |
An integer scalar giving the number of observations where 'x == 1' and 'y == 1', after excluding missing values.
cprob_inv(misimdata$y,misimdata$x1)cprob_inv(misimdata$y,misimdata$x1)
Computes the mutual information (MI) between an outcome 'y' and a predictor 'x', using the standard definition:
mi(y, x)mi(y, x)
y |
Outcome vector (binary: 0/1 or logical). |
x |
Predictor vector (binary: 0/1 or logical). Must be the same length as 'y'. |
A numeric scalar giving the mutual information between 'x' and 'y'
mi(misimdata$y,misimdata$x1)mi(misimdata$y,misimdata$x1)
A data set with 10 predictors and 1 outcome that can be used to try out the functions of the setweaver package
misimdatamisimdata
A data frame with 2500 rows and 11 variables:
Outcome
First binary predictor
Second binary predictor
Third binary predictor
Fourth binary predictor
Fifth binary predictor
Sixth binary predictor
Seventh binary predictor
Eighth binary predictor
Ninth binary predictor
Tenth binary predictor
A function that calculates the mutual information for sets of variables, calculates the G statistic, determines the significance of the sets, and only keeps those that are significant.
pairmi(data, alpha = 0.05, MI.threshold = NULL, n_elements = 5, sep = "_")pairmi(data, alpha = 0.05, MI.threshold = NULL, n_elements = 5, sep = "_")
data |
A data frame containing the variables to be paired/combined. Columns should be binary. |
alpha |
Numeric p-value threshold for significance (default used by the implementation if not supplied). |
MI.threshold |
Numeric mutual information threshold. If provided, it overrides 'alpha'-based filtering. |
n_elements |
Integer giving the maximum size of sets to evaluate (e.g., '2' for pairs, '3' for triplets). Must be >= 2. |
sep |
String used to join variable names when forming set identifiers (e.g., '"_"'). |
A list with the following components:
A data frame containing the original variables and the columns for significant sets (e.g., pair/triplet indicators).
Character vector of the original variable names.
A data frame describing significant sets, including their members, size, MI, G statistic, p-value, and constructed name.
pairmi(misimdata[,2:6])pairmi(misimdata[,2:6])
Creates a network-style graph showing how a set of predictors ('x_vars') are related to an outcome ('y_var'). Relationships can be displayed either as conditional probabilities or as effects estimated by logistic regression.
plot_prob( data, y_var, x_vars, var_labels = NULL, prob_digits = 2, method = "conditional", title = NULL, vertex_color = "lightblue", vertex_frame_color = "darkblue", vertex_label_color = "black", edge_color = "darkgrey", edge_label_color = "black", min_arrow_width = 1, max_arrow_width = 10, node_size = 45, label_cex = 0.8 )plot_prob( data, y_var, x_vars, var_labels = NULL, prob_digits = 2, method = "conditional", title = NULL, vertex_color = "lightblue", vertex_frame_color = "darkblue", vertex_label_color = "black", edge_color = "darkgrey", edge_label_color = "black", min_arrow_width = 1, max_arrow_width = 10, node_size = 45, label_cex = 0.8 )
data |
A data frame containing the outcome ('y_var') and predictors ('x_vars'). |
y_var |
Character string giving the name of the outcome variable in 'data'. |
x_vars |
Character vector of predictor variable names in 'data'. |
var_labels |
Optional character vector of display labels for the predictors. Must match the length of 'x_vars'. |
prob_digits |
Integer; number of decimal places to round conditional probabilities. Defaults to '2'. |
method |
Character string indicating how to quantify associations: '"prob"' for conditional probabilities or '"logistic"' for logistic regression effects. |
title |
Character string; title of the plot. |
vertex_color |
Character string giving the fill color of nodes. |
vertex_frame_color |
Character string giving the color of node borders. |
vertex_label_color |
Character string giving the color of node labels. |
edge_color |
Character string giving the color of edges. |
edge_label_color |
Character string giving the color of edge labels. |
min_arrow_width |
Numeric value for the minimum edge width. |
max_arrow_width |
Numeric value for the maximum edge width. |
node_size |
Numeric value controlling the size of nodes. |
label_cex |
Numeric value controlling the size of node labels. |
A graph object (typically an ['igraph::igraph'] object or similar) is returned and plotted. Nodes represent variables and edges represent associations. Node labels include variable names and marginal probabilities. Edge labels display either conditional probabilities or logistic regression effects.
plot_prob(misimdata,'y',colnames(misimdata[,3:6]),method='logistic')plot_prob(misimdata,'y',colnames(misimdata[,3:6]),method='logistic')
Computes the marginal probability for a binary
vector 'x', ignoring missing values.
prob(x)prob(x)
x |
A numeric or logical vector coded as 0/1 (or 'FALSE'/'TRUE'). Values other than 0, 1, 'FALSE', 'TRUE', or 'NA' will be ignored. |
A numeric scalar giving the proportion of entries equal to 1 among the non-missing values of 'x'.
prob(c(0, 1, 1, 0, 1))prob(c(0, 1, 1, 0, 1))
Computes marginal, conditional, and information-theoretic summaries for a binary outcome 'y' against one or more predictors in 'x'. Performs either Fisher's exact test or a generalized linear mixed model (GLMM) for inference.
probstat(y, x, test = "Fisher", ri, nfolds, seed = 10101)probstat(y, x, test = "Fisher", ri, nfolds, seed = 10101)
y |
A binary outcome vector (logical or numeric coded as 0/1). Length 'n'. |
x |
A data frame of predictors (typically the expanded data returned by [pairmi()]). Must have 'n' rows; columns are treated as candidate predictors. |
test |
Character string selecting the inferential method; one of 'c("fisher", "glmm")'. Defaults to '"fisher"' if missing. |
ri |
Optional vector/factor giving the grouping variable for a random intercept in the GLMM. Must be length 'n'. Ignored if 'test = "fisher"'. |
nfolds |
Integer; number of folds used for cross-validation. |
seed |
Integer seed for fold randomization. |
A data frame with one row per evaluated predictor (or pair) and the following columns:
Marginal probability of .
Marginal probability of .
Conditional probability .
Conditional probability .
Inverse conditional probability .
Difference .
Percent difference relative to .
Entropy of .
Entropy of .
Conditional entropy of .
Difference between marginal and conditional entropy of .
Percent difference in entropy.
p-value from Fisher's exact test or the GLMM (as applicable).
pairmiresult = pairmi(misimdata[,2:6]) probstat(misimdata$y,pairmiresult$expanded.data,nfolds=5)pairmiresult = pairmi(misimdata[,2:6]) probstat(misimdata$y,pairmiresult$expanded.data,nfolds=5)
Creates a set map visualization from the output of [pairmi()], showing which original variables compose the derived sets at a specified depth.
setmapmi(original_variables = NULL, sets = NULL, n_elements = NULL)setmapmi(original_variables = NULL, sets = NULL, n_elements = NULL)
original_variables |
Character vector of names for the original variables that were paired (typically 'pairmi_result$original.variables'). |
sets |
A data frame returned by [pairmi()] describing the sets. Must contain the columns required by 'setmapmi()' (e.g., identifiers for sets and their constituent variables). |
n_elements |
Integer scalar giving the set size (depth) to visualize (e.g., '2' for pairs, '3' for triplets). Must be >= 1 and present in 'sets'. |
A setmap showing which original variables make up the sets at a certain depth
pairmiresult = pairmi(misimdata[,2:6]) setmapmi(pairmiresult$original.variables,pairmiresult$sets,2)pairmiresult = pairmi(misimdata[,2:6]) setmapmi(pairmiresult$original.variables,pairmiresult$sets,2)
Computes the z-score for testing whether the proportion (probability) of successes in 'x' differs from zero.
zprob(x)zprob(x)
x |
A numeric or logical vector representing binary outcomes (e.g., 0/1 or TRUE/FALSE), from which the proportion is calculated. |
A numeric value giving the z-score for the observed proportion.
zprob(misimdata$x1)zprob(misimdata$x1)