--- title: "Quick introduction to foodingraph" author: "Victor Gasque, Cecilia Samieri, Boris Hejblum" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick introduction to foodingraph} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.width = 7, fig.height = 6, comment = "#>") ``` ```{r} library(foodingraph) ``` ## Introduction A simple R package to infer food networks from categorical and binary variables. Displays a weighted undirected graph from an adjacency matrix. Can perform confidence-interval bootstrap inference with mutual information or maximal information coefficient. #### How it works From an adjacency matrix, the package can infer the network with confidence-interval (CI) bootstraps of the distribution of mutual information[^1] values or maximal information coefficients (MIC)[^2]for each pairwise association. The CI bootstrap calculated is compared to the CI bootstraps from simulated independent pairwise associations. The CI bootstrap from simulated independent pairwise variables is used to define a threshold of non-significance in the network. Our approach is to use a threshold for each pairwise variable type : two ordinal variables, two binary variables, one ordinal variable and one ordinal variable. For example, For each pairwise association, if the 99th percentile of the simulated CI is higher than the 1th percentile of the sample bootstrap distribution, the edge is removed. From the inferred adjacency matrix, the package can then display the graph using `ggplot2`[^3], `igraph`[^4] and `ggraph`[^5]. See R documentation for more information. ## Example data set For the purpose of this example, I invented some food intakes data on $n=13$ subjects and $f=8$ food groups : $o=6$ ordinally-encoded (from 0 to 13) and $b=2$ binary-encoded (0 or 1). Therefore, do not expect these examples to reflect reality. ```{r} # Food intakes (ordinaly- or binary-encoded) obs_data <- data.frame( #| Foods | Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 | #|-------|------------------------------------------------------------| alcohol_cat = c(8, 1, 3, 0, 10, 5, 1, 10, 2, 8, 1, 3, 9), bread_cat = c(7, 4, 3, 4, 0, 9, 4, 5, 7, 3, 4, 0, 9), coffee_cat = c(3, 6, 6, 6, 2, 3, 5, 8, 8, 6, 6, 2, 3), duck_cat = c(0, 3, 1, 0, 0, 2, 13, 1, 0, 0, 2, 13, 1), eggs_cat = c(5, 5, 4, 5, 8, 8, 6, 9, 6, 8, 2, 3, 1), fruit_cat = c(1, 7, 5, 8, 2, 3, 1, 0, 7, 7, 5, 8, 2), gin_bin = c(1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1), ham_bin = c(1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1) ) head(obs_data) # The legend for the graph legend <- data.frame( name = colnames(obs_data), title = c("Alcohol", "Bread", "Coffee", "Duck", "Eggs", "Fruit", "Gin", "Ham"), family = c("Alcohol", "Cereals", "Beverages", "Poultry", "Eggs", "Fruit", "Alcohol", "Meats") ) # Transform family intro factors? ``` Now let's calculate the maximal information coefficient[^2] adjacency matrix, with the `foodingraph` function `mic_adj_matrix`. ```{r} adjacency_matrix <- mic_adj_matrix(obs_data) ``` ## Network inference This step is optional. If you want to visualize the network, jump to [Network visualization](#netviz) ### Arbitrary threshold Foodingraph allows to select edges on the basis of a threshold value in the adjacency matrix. It can either be applied to the adjacency matrix by the functions `graph_from_matrix()` or `links_nodes_from_mat()`, with two parameters: 1. `threshold` (default is 0) : the threshold value 2. `abs_threshold` (bool, default TRUE) : if the threshold should apply to the absolute values of the edges or not. If TRUE, it will *not* convert the values of the adjacency matrix to absolute values, only compare the threshold to the absolute values. ### Confidence-interval bootstrap inference Foodingraph allows to perform confidence-interval (CI) bootstrap inference, by comparing the CI bootstrap of simulated independent data to the CI bootstrap of each pairwise association of the dataset. Two methods to calculate the CI bootstrap exist : mutual information[^1] or maximal information coefficient[^6]. **NOTE** If you want to use mutual information, be sure to install the `minet` package available on Bioconductor. It will not be automatically downloaded when installing foodingraph. #### CI bootstrap of independent simulated data Let's start by simulating independent data. As our dataset is comprised of ordinal and binary variables, we will simulate independent : - pairwise ordinal variables - pairwise binary variables - pairwise ordinal & binary variables. This will allow to compare each pairwise association of the dataset to the corresponding type of threshold. For this example, we will use MIC. ```{r} # Ordinal vs. ordinal thresh_ord_ord <- boot_simulated_cat_bin("cat", method = "mic", size = 500) # Binary vs. binary thresh_bin_bin <- boot_simulated_cat_bin("bin", method = "mic", size = 500) # Ordinal vs. binary thresh_ord_bin <- boot_simulated_cat_bin("bincat", method = "mic", size = 500) ``` #### CI bootstrap inference Now let's perform the CI bootstrap inference on the observed data. To do this, foodingraph needs a list of the ordinal (a.k.a. categorical) and binary variables, so it can accurately compare the correct threshold to the correct pairwise variables. As the computations can take some time, a progress bar is built into the function. You can deactivate it by setting the parameter `show_progress` to FALSE (function `boot_cat_bin`). *Recommended if the output is in a Rmarkdown document.* ```{r} cat_var <- c("alcohol_cat", "bread_cat", "coffee_cat", "duck_cat", "eggs_cat", "fruit_cat") bin_var <- c("gin_bin", "ham_bin") inferred_adj_matrix <- boot_cat_bin(obs_data, list_cat_var = cat_var, list_bin_var = bin_var, method = "mic", threshold_cat = thresh_ord_ord, threshold_bin = thresh_bin_bin, threshold_bin_cat = thresh_ord_bin, boots = 5000, show_progress = FALSE) # Print how many edges have been removed n_null_before <- (length(which(adjacency_matrix==0))-ncol(obs_data))/2 n_null_after <- (length(which(inferred_adj_matrix==0))-ncol(obs_data))/2 print(paste(n_null_after - n_null_before, "edges have been removed")) ``` ## Network visualization {#netviz} ### Quick start: directly from the adjacency matrix ```{r} graph1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "My graph", layout = "graphopt") graph1 ``` ### Or from a list of links and nodes Useful to alter the links ```{r} # Extract the links and nodes from the adjacency matrix links_nodes <- links_nodes_from_mat(adjacency_matrix, legend) # Transform negative weights into positive ones links_nodes$links <- transform(links_nodes$links, weight = abs(weight)) # Display the graph graph2 <- graph_from_links_nodes(links_nodes, main_title = "My graph") graph2 ``` ### Save the graph in a file ```{r eval=F} save_graph(graph1) ``` ### Customization Many options and layouts exist to customize the graph. ```{r message=F} library(ggplot2) custom1 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node label as name", layout = "graphopt", node_label_title = F, node_label_size = 5) custom2 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Node type as label", layout = "graphopt", node_type = "label") custom3 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Grid layout", layout = "grid", node_label_size = 5) custom4 <- graph_from_matrix(adjacency_matrix, legend, main_title = "Circle layout", layout = "circle", node_label_size = 5) ``` ```{r eval=F} custom1$net custom2$net custom3$net custom4$net ``` ```{r echo=F} # Cookbook for R, simplified here multiplot <- function(..., cols=2) { library(grid) plots <- list(...) numPlots = length(plots) layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) for (i in 1:numPlots) { matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } multiplot(custom1$net + theme(legend.position="none"), custom2$net + theme(legend.position="none"), custom3$net + theme(legend.position="none"), custom4$net + theme(legend.position="none")) ``` ### Compare graphs Foodingraph provides a useful graph comparison function, which harmonizes the graphs' weights and node degree sizes, in order to facilitate the visual comparison. First, let's generate a second graph. ```{r} # New set of observation data obs_data_2 <- matrix(c(round(runif(78, 0, 13)), round(runif(26))), nrow = 13, ncol = 8) colnames(obs_data_2) <- colnames(obs_data) # Compute the MIC adjacency matrix adjacency_matrix_2 <- mic_adj_matrix(obs_data_2) graph2 <- graph_from_matrix(adjacency_matrix_2, legend, main_title = "My graph 2", layout = "graphopt") ``` Then let's compare the first graph and this one on a single, unified plot using `compare_graphs()`. ```{r, fig.width = 7, fig.height=5} comp1_2 <- compare_graphs(graph1, graph2, position = "horizontal") comp1_2 ``` You can also save this new graph. It will automatically have a bigger size. ```{r eval=F} save_graph(comp1_2) ``` ## References [^1]: Meyer, Patrick E, Frédéric Lafitte, and Gianluca Bontempi. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9, no. 1 (December 2008). https://doi.org/10.1186/1471-2105-9-461. [^2]: Albanese, Davide, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. “Minerva and Minepy: A C Engine for the MINE Suite and Its R, Python and MATLAB Wrappers.” Bioinformatics 29, no. 3 (February 1, 2013): 407–8. https://doi.org/10.1093/bioinformatics/bts707. [^3]: H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. [^4]: Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. http://igraph.org [^5]: Thomas Lin Pedersen, https://ggraph.data-imaginist.com/ [^6]: Reshef, D. N., Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. “Detecting Novel Associations in Large Data Sets.” Science 334, no. 6062 (December 16, 2011): 1518–24. https://doi.org/10.1126/science.1205438.