Quick introduction to foodingraph

library(foodingraph)

Introduction

A simple R package to infer food networks from categorical and binary variables.

Displays a weighted undirected graph from an adjacency matrix. Can perform confidence-interval bootstrap inference with mutual information or maximal information coefficient.

How it works

From an adjacency matrix, the package can infer the network with confidence-interval (CI) bootstraps of the distribution of mutual information1 values or maximal information coefficients (MIC)2for each pairwise association. The CI bootstrap calculated is compared to the CI bootstraps from simulated independent pairwise associations. The CI bootstrap from simulated independent pairwise variables is used to define a threshold of non-significance in the network. Our approach is to use a threshold for each pairwise variable type : two ordinal variables, two binary variables, one ordinal variable and one ordinal variable.

For example, For each pairwise association, if the 99th percentile of the simulated CI is higher than the 1th percentile of the sample bootstrap distribution, the edge is removed.

From the inferred adjacency matrix, the package can then display the graph using ggplot23, igraph4 and ggraph5.

See R documentation for more information.

Example data set

For the purpose of this example, I invented some food intakes data on n = 13 subjects and f = 8 food groups : o = 6 ordinally-encoded (from 0 to 13) and b = 2 binary-encoded (0 or 1). Therefore, do not expect these examples to reflect reality.

# Food intakes (ordinaly- or binary-encoded)
obs_data <- data.frame(
#| Foods | Subject 1   2   3   4   5   6   7   8   9  10  11  12  13  |
#|-------|------------------------------------------------------------|
  alcohol_cat  = c(8,  1,  3,  0, 10,  5,  1, 10,  2,  8,  1,  3,  9),
  bread_cat    = c(7,  4,  3,  4,  0,  9,  4,  5,  7,  3,  4,  0,  9),
  coffee_cat   = c(3,  6,  6,  6,  2,  3,  5,  8,  8,  6,  6,  2,  3),
  duck_cat     = c(0,  3,  1,  0,  0,  2, 13,  1,  0,  0,  2, 13,  1),
  eggs_cat     = c(5,  5,  4,  5,  8,  8,  6,  9,  6,  8,  2,  3,  1),
  fruit_cat    = c(1,  7,  5,  8,  2,  3,  1,  0,  7,  7,  5,  8,  2),
  gin_bin      = c(1,  0,  1,  0,  1,  0,  0,  1,  0,  0,  1,  0,  1),
  ham_bin      = c(1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  1,  0,  1)
)

head(obs_data)
#>   alcohol_cat bread_cat coffee_cat duck_cat eggs_cat fruit_cat gin_bin ham_bin
#> 1           8         7          3        0        5         1       1       1
#> 2           1         4          6        3        5         7       0       1
#> 3           3         3          6        1        4         5       1       1
#> 4           0         4          6        0        5         8       0       1
#> 5          10         0          2        0        8         2       1       1
#> 6           5         9          3        2        8         3       0       1
# The legend for the graph
legend <- data.frame(
  name   = colnames(obs_data),
  title  = c("Alcohol", "Bread",   "Coffee",    "Duck",    "Eggs", "Fruit", "Gin",     "Ham"),
  family = c("Alcohol", "Cereals", "Beverages", "Poultry", "Eggs", "Fruit", "Alcohol", "Meats")
)

# Transform family intro factors?

Now let’s calculate the maximal information coefficient6 adjacency matrix, with the foodingraph function mic_adj_matrix.

adjacency_matrix <-  mic_adj_matrix(obs_data)

Network inference

This step is optional. If you want to visualize the network, jump to Network visualization

Arbitrary threshold

Foodingraph allows to select edges on the basis of a threshold value in the adjacency matrix. It can either be applied to the adjacency matrix by the functions graph_from_matrix() or links_nodes_from_mat(), with two parameters:

  1. threshold (default is 0) : the threshold value
  2. abs_threshold (bool, default TRUE) : if the threshold should apply to the absolute values of the edges or not. If TRUE, it will not convert the values of the adjacency matrix to absolute values, only compare the threshold to the absolute values.

Confidence-interval bootstrap inference

Foodingraph allows to perform confidence-interval (CI) bootstrap inference, by comparing the CI bootstrap of simulated independent data to the CI bootstrap of each pairwise association of the dataset. Two methods to calculate the CI bootstrap exist : mutual information7 or maximal information coefficient8.

NOTE If you want to use mutual information, be sure to install the minet package available on Bioconductor. It will not be automatically downloaded when installing foodingraph.

CI bootstrap of independent simulated data

Let’s start by simulating independent data. As our dataset is comprised of ordinal and binary variables, we will simulate independent :

  • pairwise ordinal variables
  • pairwise binary variables
  • pairwise ordinal & binary variables.

This will allow to compare each pairwise association of the dataset to the corresponding type of threshold.

For this example, we will use MIC.

# Ordinal vs. ordinal
thresh_ord_ord <- boot_simulated_cat_bin("cat", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: cat
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 9161471730066462894213020125400185610007401010412000000120000100000
#> Simulation 1 : 0.0325299899129651
#> Simulation 2 : 0.0323083797191614
#> Simulation 3 : 0.032432277027209
#> Simulation 4 : 0.0319265107370138
#> Simulation 5 : 0.0323957860888818
#> Simulation 6 : 0.0327634852271982
#> Simulation 7 : 0.0326330065387264
#> Simulation 8 : 0.0329643699517739
#> Simulation 9 : 0.0315401581257434
#> Simulation 10 : 0.0318476115790957
#> Mean of the percentiles: 0.0323341574907769
# Binary vs. binary
thresh_bin_bin <- boot_simulated_cat_bin("bin", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: bin
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 2051925026
#> Simulation 1 : 0.00972426379492631
#> Simulation 2 : 0.00989299752571164
#> Simulation 3 : 0.00937051852871288
#> Simulation 4 : 0.00972871782283215
#> Simulation 5 : 0.00986344038036663
#> Simulation 6 : 0.0102193594612705
#> Simulation 7 : 0.00964083533948696
#> Simulation 8 : 0.00964160371333008
#> Simulation 9 : 0.0103418659240403
#> Simulation 10 : 0.00981311454186768
#> Mean of the percentiles: 0.00982367170325451
# Ordinal vs. binary
thresh_ord_bin <- boot_simulated_cat_bin("bincat", method = "mic", size = 500)
#> Confidence-interval bootstrap on simulated independent variables of type: bincat
#> Number of simulations: 10
#> Number of bootstraps per simulations: 5000
#> Sample size for each simulation: 500
#> Contigency table of the simulated data:
#> 171512420019114671241071
#> Simulation 1 : 0.0239665945183287
#> Simulation 2 : 0.0223714028126321
#> Simulation 3 : 0.0237178455590487
#> Simulation 4 : 0.0228545853060847
#> Simulation 5 : 0.0222746800652629
#> Simulation 6 : 0.0227178692065114
#> Simulation 7 : 0.022871385647938
#> Simulation 8 : 0.023434137980914
#> Simulation 9 : 0.0221290536412375
#> Simulation 10 : 0.023291194508422
#> Mean of the percentiles: 0.022962874924638

CI bootstrap inference

Now let’s perform the CI bootstrap inference on the observed data. To do this, foodingraph needs a list of the ordinal (a.k.a. categorical) and binary variables, so it can accurately compare the correct threshold to the correct pairwise variables.

As the computations can take some time, a progress bar is built into the function. You can deactivate it by setting the parameter show_progress to FALSE (function boot_cat_bin). Recommended if the output is in a Rmarkdown document.

cat_var <- c("alcohol_cat", "bread_cat", "coffee_cat", "duck_cat", "eggs_cat",
             "fruit_cat")
bin_var <- c("gin_bin", "ham_bin")

inferred_adj_matrix <- boot_cat_bin(obs_data,
                                    list_cat_var = cat_var,
                                    list_bin_var = bin_var,
                                    method = "mic",
                                    threshold_cat = thresh_ord_ord,
                                    threshold_bin = thresh_bin_bin,
                                    threshold_bin_cat = thresh_ord_bin,
                                    boots = 5000,
                                    show_progress = FALSE)
#> Performing boostrap inference with method :  mic
# Print how many edges have been removed
n_null_before <- (length(which(adjacency_matrix==0))-ncol(obs_data))/2
n_null_after <- (length(which(inferred_adj_matrix==0))-ncol(obs_data))/2
print(paste(n_null_after - n_null_before, "edges have been removed"))
#> [1] "9 edges have been removed"

Network visualization

Quick start: directly from the adjacency matrix

graph1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "My graph", layout = "graphopt")
graph1
#> $igraph
#> IGRAPH 8f5cf50 UNW- 8 36 -- 
#> + attr: name (v/c), title (v/c), family (v/c), size (v/n), weight
#> | (e/n), width (e/n), sign (e/c), alpha (e/n)
#> + edges from 8f5cf50 (vertex names):
#>  [1] alcohol_cat--alcohol_cat alcohol_cat--bread_cat   alcohol_cat--coffee_cat 
#>  [4] alcohol_cat--duck_cat    alcohol_cat--eggs_cat    alcohol_cat--fruit_cat  
#>  [7] alcohol_cat--gin_bin     alcohol_cat--ham_bin     bread_cat  --bread_cat  
#> [10] bread_cat  --coffee_cat  bread_cat  --duck_cat    bread_cat  --eggs_cat   
#> [13] bread_cat  --fruit_cat   bread_cat  --gin_bin     bread_cat  --ham_bin    
#> [16] coffee_cat --coffee_cat  coffee_cat --duck_cat    coffee_cat --eggs_cat   
#> [19] coffee_cat --fruit_cat   coffee_cat --gin_bin     coffee_cat --ham_bin    
#> + ... omitted several edges
#> 
#> $net

#> 
#> $deg
#>      name degrees
#> 1 Alcohol       9
#> 2   Bread       9
#> 3  Coffee       9
#> 4    Duck       9
#> 5    Eggs       9
#> 6   Fruit       9
#> 7     Gin       9
#> 8     Ham       9
#> 
#> attr(,"class")
#> [1] "list"        "foodingraph"

Save the graph in a file

save_graph(graph1)

Customization

Many options and layouts exist to customize the graph.

library(ggplot2)

custom1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node label as name",
                             layout = "graphopt", node_label_title = F, node_label_size = 5)
custom2 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node type as label",
                             layout = "graphopt", node_type = "label")
custom3 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Grid layout",
                             layout = "grid", node_label_size = 5)
custom4 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Circle layout",
                             layout = "circle", node_label_size = 5)
custom1$net
custom2$net
custom3$net
custom4$net

Compare graphs

Foodingraph provides a useful graph comparison function, which harmonizes the graphs’ weights and node degree sizes, in order to facilitate the visual comparison.

First, let’s generate a second graph.

# New set of observation data
obs_data_2 <- matrix(c(round(runif(78, 0, 13)), round(runif(26))), nrow = 13, ncol = 8)
colnames(obs_data_2) <- colnames(obs_data)

# Compute the MIC adjacency matrix
adjacency_matrix_2 <- mic_adj_matrix(obs_data_2)

graph2 <- graph_from_matrix(adjacency_matrix_2, legend, main_title = "My graph 2",
                            layout = "graphopt")

Then let’s compare the first graph and this one on a single, unified plot using compare_graphs().

comp1_2 <- compare_graphs(graph1, graph2, position = "horizontal")
#> Warning in get_plot_component(plot, "guide-box"): Multiple components found;
#> returning the first one. To return all, use `return_all = TRUE`.
comp1_2

You can also save this new graph. It will automatically have a bigger size.

save_graph(comp1_2)

References


  1. Meyer, Patrick E, Frédéric Lafitte, and Gianluca Bontempi. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9, no. 1 (December 2008). https://doi.org/10.1186/1471-2105-9-461.↩︎

  2. Albanese, Davide, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. “Minerva and Minepy: A C Engine for the MINE Suite and Its R, Python and MATLAB Wrappers.” Bioinformatics 29, no. 3 (February 1, 2013): 407–8. https://doi.org/10.1093/bioinformatics/bts707.↩︎

  3. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.↩︎

  4. Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. http://igraph.org↩︎

  5. Thomas Lin Pedersen, https://ggraph.data-imaginist.com/↩︎

  6. Albanese, Davide, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. “Minerva and Minepy: A C Engine for the MINE Suite and Its R, Python and MATLAB Wrappers.” Bioinformatics 29, no. 3 (February 1, 2013): 407–8. https://doi.org/10.1093/bioinformatics/bts707.↩︎

  7. Meyer, Patrick E, Frédéric Lafitte, and Gianluca Bontempi. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9, no. 1 (December 2008). https://doi.org/10.1186/1471-2105-9-461.↩︎

  8. Reshef, D. N., Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. “Detecting Novel Associations in Large Data Sets.” Science 334, no. 6062 (December 16, 2011): 1518–24. https://doi.org/10.1126/science.1205438.↩︎