--- title: "mixpoissonreg in the tidyverse" author: "Alexandre B. Simas" date: '`r format(Sys.time(), "%Y-%m-%d")`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{mixpoissonreg in the tidyverse} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # `augment`, `glance` and `tidy` methods The `mixpoissonreg` package provides the `augment`, `glance` and `tidy` methods following the same spirit as the corresponding methods found in the `broom` package. We will now provide a few examples of outputs and applications of these methods to `mixpoissonreg` objects. ## `augment` The `augment` method for `mixpoissonreg` objects provides a `tibble::tibble` with several important quantities such as residuals, Cook's distance, generalized Cook's distance, the model's matrix, confidence intervals for the means, etc. ```{r warning=FALSE} library(mixpoissonreg) fit <- mixpoissonreg(daysabs ~ gender + math + prog | prog, model = "PIG", data = Attendance) augment(fit) ``` One can also set the type of fitted values by changing the `type.predict` argument, where the options are "response" (the default), "link", "precision", "variance". ```{r} augment(fit, type.predict = "link") ``` If `type.predict` is "link", one can set `se_fit` to `TRUE` to obtain standard errors for the mean parameters in the linear predictor scale. ```{r} augment(fit, type.predict = "link", se_fit = TRUE) ``` By default the `augment` method returns the Pearson's residuals. To return the score residuals one simply has to set the `type.residuals` argument to "score": ```{r} augment(fit, type.residuals = "score") ``` It is possible to obtain predictions for newdata by using the `newdata` argument. To such an end, the `newdata` argument must be set to a `data.frame` object containing the values of the covariates for which the predictions are desired. For example: ```{r} augment(fit, newdata = data.frame(gender = c("male", "female"), math = c(34, 85), prog = factor(c("General", "Academic"), levels = c("General", "Academic", "Vocational")))) ``` Notice that in the above example the confidence intervals for the means were returned by default. However, it is important to obtain prediction intervals, i.e., intervals for which the future observations will fall with the prescribed probability. It is important to notice that since we did not obtain an approximate distribution for the response variable, we obtain the prediction intervals by simulation, which is computationally intensive. To obtain prediction intervals, one must set the `pred_int` argument to `TRUE`. The prescribed probability can be set by entering the `level` argument, the default level is 0.95. With respect to the simulation to obtain prediction intervals, one can change the number of mean and prediction parameters generated by setting the `nsim_pred` argument to the desired value, the default is 100, and also can change the number of response variables *y* generated for each pair of mean and precision parameters by setting the `nsim_pred_y` to the desired value, the default is 100. Finally, we can also set `conf_int` to `FALSE` to remove the confidence intervals for the mean. ```{r} augment(fit, newdata = data.frame(gender = c("male", "female"), math = c(34, 85), prog = factor(c("General", "Academic"), levels = c("General", "Academic", "Vocational"))), pred_int = TRUE, level = 0.99, nsim_pred = 50, nsim_pred_y =50, conf_int = FALSE) ``` With the `augment` method it is simple to sort the observations by their Cook's distance or by their generalized Cook's distance along with their indexes and the model's matrix in order to facilitate the study of their influence: ```{r message=FALSE} library(dplyr) augment(fit) %>% mutate(.index = row_number()) %>% arrange(desc(.cooksd)) %>% select(.index, daysabs, gender, math,prog,.cooksd) ``` and ```{r} augment(fit) %>% mutate(.index = row_number()) %>% arrange(desc(.gencooksd)) %>% select(.index, daysabs, gender, math,prog,.gencooksd) ``` Finally, it is easy to use the `augment` method to build useful plots. For instance, let us consider a new model containing only "math" as covariate and let us create confidence intervals for the means and prediction intervals for the response variables. First, the plot with confidence intervals for the means: ```{r message=FALSE, warning=FALSE} library(ggplot2) fit_math <- mixpoissonreg(daysabs ~ math, data = Attendance) fit_data <- augment(fit_math) %>% dplyr::select(math, daysabs) %>% dplyr::rename("Math Score" = math, "Days Abscent" = daysabs) new_math <- seq(0, 100, by=0.25) fit_int <- augment(fit_math, newdata = data.frame(math = new_math)) %>% dplyr::rename("Math Score" = math) %>% mutate("Days Abscent" = 0) ggplot(fit_data, aes(x = `Math Score`, y = `Days Abscent`)) + geom_point() + geom_function(fun = function(x){exp(fit_math$coefficients$mean[1] + fit_math$coefficients$mean[2]*x)}, colour = "blue") + geom_ribbon(data = fit_int, aes(ymin = .fittedlwrconf, ymax = .fitteduprconf), fill = "grey70", alpha = 0.7) ``` Observe that the confidence intervals tend to be narrow. They are confidence intervals for the means, not for the response variables. The intervals that play the role of "confidence intervals" for response variables are the prediction intervals, which are much wider and provide estimate of intervals for which future observations will fall with the prescribed probability. Now the plot containing the predictions intervals for the response variables: ```{r message=FALSE, warning=FALSE} fit_pred <- augment(fit_math, newdata = data.frame(math = new_math), pred_int = TRUE, nsim_pred = 50, nsim_pred_y = 50) %>% dplyr::rename("Math Score" = math) %>% mutate("Days Abscent" = 0) ggplot(fit_data, aes(x = `Math Score`, y = `Days Abscent`)) + geom_point() + geom_function(fun = function(x){exp(fit_math$coefficients$mean[1] + fit_math$coefficients$mean[2]*x)}, colour = "blue") + geom_ribbon(data = fit_pred, aes(ymin = .fittedlwrpred, ymax = .fitteduprpred), fill = "grey70", alpha = 0.7) ``` ## `glance` The `glance` method for `mixpoissonreg` objects only contains one argument, which is the fitted `mixpoissonreg` object. It returns a `tibble::tibble` with useful information for constructing tests and also for comparing different fits: ```{r} glance(fit) ``` Notice that it also contains Efron's pseudo-$R^2$, which is essentially an indicator of how well the future predictions will typically be. Basically, if the variance of the response is very high, even though the model may be able to provide a good estimate of the mean, the mean may be a poor estimate of the response variable. ## `tidy` The `tidy` method for `mixpoissonreg` objects provide the parameter's estimates along with their standard errors, *z*-statistics and *p*-values. It has two arguments, a logical `conf.int` indicating whether confidence intervals for the parameters should be returned and `conf.level` indicating the confidence level, the default level is 0.95. For example: ```{r} tidy(fit) ``` and ```{r} tidy(fit, conf.int = TRUE) ``` One can also use the `tidy` method to build plots of the mean-related coefficients and precision-related coefficients with confidence intervals: ```{r} tidy(fit, conf.int = TRUE) %>% ggplot(aes(x = term, y = estimate)) + geom_point() + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) + theme(axis.text.x = element_text(angle=45)) + scale_x_discrete(name = "Coefficient") + scale_y_continuous(name = "Estimate") + facet_wrap(~ component) ``` or separately (since the precision regression structure does not contain all the covariates) ```{r message=FALSE} library(gridExtra) pmean <- tidy(fit, conf.int = TRUE) %>% filter(component == "mean") %>% ggplot(aes(x = term, y = estimate)) + geom_point() + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) + theme(axis.text.x = element_text(angle=45)) + scale_x_discrete(name = "Coefficient") + scale_y_continuous(name = "Estimate") + facet_wrap(~ component) pprecision <- tidy(fit, conf.int = TRUE) %>% filter(component == "precision") %>% ggplot(aes(x = term, y = estimate)) + geom_point() + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) + theme(axis.text.x = element_text(angle=45)) + scale_x_discrete(name = "Coefficient") + scale_y_continuous(name = element_blank()) + facet_wrap(~ component) grid.arrange(pmean, pprecision, ncol=2) ``` # `tidy_local_influence` and `local_influence_benchmarks` methods The `tidy_local_influence` method for `mixpoissonreg` objects has, besides the `model` argument, all the arguments of the `local_influence` method for `mixpoissonreg` objects. For details on these arguments we refer the reader to the [Global and local influence analysis with the *mixpoissonreg* package](influence-mixpoissonreg.html) vignette. When we call the `tidy_local_influence` method it returns a nice `tibble::tibble` containing the curvatures under the chosen perturbation schemes. For example: ```{r} tidy_local_influence(fit) ``` One can, for instance, change the curvature and the direction: ```{r} tidy_local_influence(fit, curvature = "normal", direction = "max.eigen") ``` Useful plots of the local influence measures can be obtained by using the `local_influence_autoplot` method. The `local_influence_benchmarks` method for `mixpoissonreg` objects returns the benchmarks for the different perturbation schemes. We use the benchmarks suggested by [Zhu and Lee (2001)](https://doi.org/10.1111/1467-9868.00279) for conformal normal curvatures and the benchmarks suggested by [Verbeke and Molenberghs (2000, sect. 11.3)](https://www.springer.com/gp/book/9781441902993) for normal curvatures. For `direction = "max.eigen"`, no benchmark is provided. When we call the `local_influence_benchmarks` method it returns a `tibble::tibble` containing the benchmarks for the chosen perturbation schemes. For example: ```{r} local_influence_benchmarks(fit) ``` One possible application is the following: select the observations whose curvature are above the benchmarks, then `left_join` this `tibble` with a subset of the `tibble` returned by the `augment` method containing the model's matrix. This facilitates the study of the influential observations. ```{r warning=FALSE, message=FALSE} library(tidyr) inf_tbl <- tidy_local_influence(fit) %>% mutate(.index = row_number()) %>% pivot_longer(!.index, names_to = "perturbation", values_to = "curvature") bench_tbl <- local_influence_benchmarks(fit) %>% pivot_longer(everything(), names_to = "perturbation", values_to = "benchmark") inf_bench_tbl <- left_join(inf_tbl, bench_tbl, by = "perturbation") %>% mutate(influential = curvature > benchmark) %>% filter(influential == TRUE) %>% select(-influential, -benchmark, -curvature) data_tbl <- augment(fit) %>% mutate(.index = row_number()) %>% select(.index, daysabs, gender, math, prog) influential <- left_join(inf_bench_tbl, data_tbl, by = ".index") influential ``` It is easy to see that the above `tibble` is very helpful in the study of influential observations. One can also easily count how many observations have curvature above the benchmark considering all the perturbation schemes: ```{r} influential %>% select(.index) %>% unique() %>% count() ``` # `autoplot` and `local_influence_autoplot` methods The `autoplot` method for `mixpoissonreg` objects provides 6 useful plots for diagnostic analysis. By default, it provides the following plots: "Residuals vs Obs. number", "Normal Q-Q", "Cook's dist vs Generalized Cook's dist" and "Response vs Fitted means". These plots are obtained by simply calling the `autoplot` method: ```{r} autoplot(fit) ``` Notice that it automatically identify the `label.n` most extreme points. The default value for `label.n` is 3. If the model is fitted with simulated envelopes, then the envelopes are automatically plotted. In the example below we will consider the estimation obtained by direct maximization of likelihood function to reduce computational cost. ```{r warning=FALSE} fit_env <- mixpoissonregML(daysabs ~ gender + math + prog | prog, model = "PIG", data = Attendance, envelope = 100) autoplot(fit_env) ``` We can easily plot Cook's distance and generalized Cook's distance by including in the `which` argument the numbers 3 (Cook's distance) and 4 (generalized Cook's distance): ```{r warning=FALSE} autoplot(fit, which = c(3,4)) ``` There are also several options to customize the appearance of the plots. Several of these options are presented in the [Building and customizing ggplot2-based diagnostic plots with the mixpoissonreg package](https://vpnsctl.github.io/mixpoissonreg/articles/ggplot2-plots-mixpoissonreg.html) vignette. The `local_influence_autoplot` provides plots of the curvatures under the perturbation schemes for all the possible combinations of arguments of the `local_influence` method. We refer the reader to the [Global and local influence analysis with the *mixpoissonreg* package](influence-mixpoissonreg.html) vignette for further details on these arguments. If the `direction` argument is set to "canonical" (the default), then the `n.influential` points above the *benchmarks* are automatically identified, where the default value for `n.influential` is 5. Recall that we use the benchmarks suggested by [Zhu and Lee (2001)](https://doi.org/10.1111/1467-9868.00279) for conformal normal curvatures and the benchmarks suggested by [Verbeke and Molenberghs (2000, sect. 11.3)](https://www.springer.com/gp/book/9781441902993) for normal curvatures. For `direction = "max.eigen"`, no benchmark is provided. In this case, the `local_influence_autoplot` method automatically identifies the `n.influential` most extremes points. Let us build these plots. First, the standard arguments provide the plots of conformal normal curvature in the canonical directions for the "case_weights", "hidden_variable", "mean_explanatory" and "precision_explanatory" perturbation schemes with "case_weights" and "hidden_variable" being computed for all parameters, and the explanatory perturbations being computed for all covariates: ```{r} local_influence_autoplot(fit) ``` To change to normal curvature simply set the `curvature` argument to "normal": ```{r} local_influence_autoplot(fit, curvature = "normal") ``` We can change the perturbations schemes to be displayed by providing a list or vector containing the numbers relative to the wanted perturbations. The number 1 is the *case weights* perturbation, number 2 is the *hidden variable* perturbation, number 3 is *mean explanatory* perturbation, number 4 is *precision explanatory* perturbation and number 5 is *simultaneous explanatory* perturbation. ```{r} local_influence_autoplot(fit, which = c(1,2)) ``` We can draw the benchmark line (when `direction` is "canonical") by setting the `draw.benchmark` argument to `TRUE`: ```{r} local_influence_autoplot(fit, draw.benchmark = TRUE) ```