--- title: "visStatistics" author: "Sabine Schilling" date: "`r Sys.Date()`" output: html_vignette: number_sections: yes toc: yes toc_depth: 3 bibliography: visstat.bib vignette: > %\VignetteIndexEntry{visStatistics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, out.width = "100%" ) ``` ```{r setup} library(visStatistics) ``` # Abstract {.unnumbered} The R package `visStatistics` provides means to quickly visualise and analyse raw data by selecting, based on a decision tree, the statistical hypothesis test with the highest statistical power between the dependent variable (response) named `varsample` and the independent variable (feature) named `varfactor` in a `data.frame` named `dataframe`. A minimal function call has the structure: `visstat(dataframe,varsample,varfactor)` The data in the provided `dataframe` must be structured by column, where `varsample` and `varfactor` are the character strings of the column names of the dependent (response) and independent (feature) variables, respectively. The choice of statistical tests performed by the core function `visstat()` depends on wether the data are numerical or categorical, the number of levels of categorical data and the data distributions (normal versus non-normal). Data of `class` `"numeric"` or `"integer"` will be referred in the reminder of this vignette as "numerical", data of `class` `"factor"` will be referred in the remainder of this vignette as "categorical". The function returns the corresponding test statistics, including any post-hoc-analysis, and generates a graph showing the key statistics of the underlying test. `visStatistics` provides a fully automated workflow. It has been successfully used for the unbiased analysis of raw medical data. The remainder of this vignette focuses on the the algorithm underlying the decision tree of `vistat()`, while a call to `?visstat` documents all parameter settings of the function. All implemented statistical tests are called with their default parameter sets, except for `conf.level`, which can be adjusted in the `visstat()` function call. As a consequence, no paired tests are available. # Comparing central tendencies If the feature consists of data of `class` `"factor"` with two or more levels and the response consists of data of `class` `"numeric"` or `"integer"` (both of mode `"numeric"`), tests are applied to compare the central tendencies. ## Two sample tests: Welch's t-test or Wilcoxon rank sum test If the feature has exactly two levels,Welch's t-test or its non-parametric alternative, the Wilcoxon rank sum test, is performed. Welch's t-test tests the null hypothesis that the two samples have the same mean. While Student's t-test requires homoscedasticity, Welch's t-test does not. It loses little robustness compared to Student's t-test when assumptions of Student's t-test are met [@Moser:1992fp; @Delacre:2017iv]. Therefore the implementation of Studen't-test has been omitted. The null hypothesis of the Wilcoxon rank sum test is that both samples (levels) are drawn from the same population or have identical distributions. The test choice follows the algorithm below: - If the sample size for both levels is greater than 30, always perform Welch's t-test (`t.test()`) [@ @Rasch:2011vl; @Lumley2002dsa]. - If the sample size of at least one of the levels is smaller than 30, first check for normality of both levels with the Shapiro-Wilk normality test (`shapiro.test()`): - If the p-values of the `shapiro.test()` of both levels are greater than the error probability $\alpha = 1-$`conf.level`, perform the Welchs' t-tes (`t.test()`). - If the p-value of at least one of the levels in the `shapiro.test()` is smaller than the error probability $\alpha=1-$`conf.level`, a Wilcoxon rank sum test (`wilcox.test()`) is executed. The graphical representation consists of box plots overlaid with jitter plots showing each data point. In the case of Welch's t-test, the`conf.leve`$\cdot 100 \%$ - confidence intervals are also shown. The test statistics of the chosen test as well as the summary statistics of the generated box plots are returned as a list. ### Examples #### Welch's t-test As an example we use the motor trend car road test data set (`mtcars)`, which consists of 32 observations. In the example below `mpg` denotes miles per US gallon, `am` the transmission type ((0 = automatic, 1 = manual)). ```{r} mtcars$am <- as.factor(mtcars$am) t_test_statistics <- visstat(mtcars, "mpg", "am") # t_test_statistics # Uncomment this line to print out the test statistics ``` Increasing the confidence level `conf.level` from the default 0.95 to 0.99 results in wider confidence intervals: ```{r} mtcars$am <- as.factor(mtcars$am) t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99) ``` #### Wilcoxon rank sum test ```{r} grades_gender <- data.frame( sex = as.factor(c(rep("girl", 21), rep("boy", 23))), grade = c( 19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4, 20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0, 16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4, 7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1 ) ) wilcoxon_statistics <- visstat(grades_gender, "grade", "sex") ``` ## One-way test, ANOVA or Kruskal-Wallis test If the feature consists of data of `class` `"factor"` with more than two levels and the response is of `mode` `"numeric"`, `visstat()` performs an analysis of variance (ANOVA),if both of the following null hypotheses cannot be rejected on the chosen error of probability of error $\alpha=1-$`conf.level` 1. Normality of the standardised residuals and 2. homoscedasticity. If only the first condition for the normality of the residuals is met, `visstat()` performs a one-way test (see `oneway.test()`). If the normality of the residuals cannot be assumed, a Kruskal-Wallis test (`kruskal.test()`) is used. These assumptions are tested by the `visAnovaAssumptions()` function. ### Checking the ANOVA assumptions #### Residual analysis {#aov_residuals} The function `visAnovaAssumptions()` checks the standardised residuals of the ANOVA fit for normality using both the Shapiro-Wilk-test `shapiro.test()` and the Anderson-Darling test `ad.test()`. It plots the standardised residuals against the fitted means of the linear model for each level of the feature `varfactor`and generates the Q-Q plot of the standardised residuals. #### Homoscedasticity: homogeneity of variances in each level: Bartlett test Both `aov()` and `oneway.test()` test whether two or more samples from normal distributions have the same mean. While `aov()` requires homogeneity of variances in each level (group), the `oneway.test()` does not require that the variances in each level are necessarily equal. Homoscedasticity is assessed using the Bartlett test, see `bartlett.test()`, under the null hypothesis that the variances in each of the levels are equal. ### One-way test and ANOVA Depending on the p-value of the `bartlett.test()`, the corresponding test is shown in the figure title: - If the p-value of the `bartlett.test()` is greater than `1-conf.level`, we assume homogeneity of variances in each level (group) and the p-values of `aov()` is displayed. - Otherwise homoscedasticity cannot be assumed and the p-value of `oneway.test()` is reported. #### Post-hoc analysis: Tukey\`s honestly significant differences (HSD) and Sidak corrected confidence intervals Simple multiple comparisons of the means for the factor levels in an analysis of variance inflate the probability of declaring a significant difference when, in fact, there is none . The family-wise error rate (also called the probability of a type I error) is the probability of at least one false positive comparison, in which the null hypothesis is falsely rejected, when multiple comparisons are performed. ##### Tukey\`s honestly significant differences (HSD) `visstat()` reduces this probability of a type I error by using Tukey's honestly significant differences (HSD, see `TukeyHSD()`). It creates a set of confidence intervals on the differences between the means per factor level with the specified family-wise probability `conf`. If a confidence interval does not include the zero, then there is a significant difference between the pair. The set of confidence intervals for all pairwise comparisons is returned along with the Tukey HSD adjusted p-values. In the graphical representation of One-way test and ANOVA, the green letters between two levels differ only, if the Tukey's HSD corrected p-value for these two levels is smaller than $\alpha=1-$`conf.int`. #### Sidak corrected confidence intervals Tukey\`s HSD procedure is based on **pairwise** comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for **all** comparisons of the levels of the independent, categorical variable. The Sidak corrected acceptable probability of error [@Sidak] is defined as $\alpha_{Sidak}=1$-`conf.int`$^{1/M}$, where $M=\frac{n\cdot (n-1)}{2}$ is the number of pairwise comparisons of the $n$ levels of the categorical variable. In the graphical display of One-way test and ANOVA, `visstat()` displays both the `conf.level` $\cdot\; 100 \%$ confidence intervals alongside the larger, Sidak-corrected $(1-\alpha_{Sidak})\cdot 100\;\%$ confidence intervals. #### Limitations Note that the current structure of `visstat()` does not allow the study of interactions between the different levels of an independent variable. ### Kruskal-Wallis test If the p-value of the standardised residuals computed by `shapiro.test()` is smaller than the error probability `1-conf.level`, `visstat()` chooses a non-parametric alternative, the Kruskal-Wallis rank sum test. `kruskal.test()` tests the null that the medians are equal at each group level. As post-hoc-analysis the pairwise Wilcoxon rank sum test `pairwise.wilcox.test()` is used, applying the default Holm method for multiple comparisons[@Holm1979ASS]. If the Holm-adjusted p-value for a pair is smaller than `1-confint`, the green letters under the corresponding two box plots will differ. Otherwise the graphical representation of the Kruskal-Wallis test is similar to the Wilcoxon rank sum test described above. A list with the test statistics of the Kruskal-Wallis rank sum test and the p-values of the pairwise comparisons adjusted by the Holm method is returned. ### Examples #### One-way test:{-} The `npk` dataset reports the yield of peas in pounds/block on six blocks, where the application of nitrogen (N), phosphate (P) or potassium (K) fertilisers was varied. Either no, one, two or three different fertilisers were applied on the blocks. ```{r} oneway_npk <- visstat(npk, "yield", "block") ``` We can assume that the residuals are normally distributed based on the scatterplots of the standardised residuals, the normal quantile-quantile plot (Q-Q plot), and the p-values of both the Shapiro-Wilk test and the Anderson-Darling test. But at the given confidence level the homogeneity of variances cannot be assumed ($p< \alpha$ as calculated with the `bartlett.test()`), and the p-value of the `oneway.test()` is displayed. Post-hoc analysis with `TukeyHSD()` shows no significant difference between the yield between the different blocks (all green letters are equal). #### ANOVA {-} The InsectSprays data gives the counts of insects in agricultural experimental units treated with six different insecticides. To stabilise the variance in counts, we transform the count data of the `InsectSprays` data set by the square root. ```{r} insect_sprays_tr <- InsectSprays insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count) visstat(insect_sprays_tr, "count_sqrt", "spray") ``` After the transformation, the homogeneity of variances can be assumed ($p> \alpha$ as calculated with the `bartlett.test()`), and the p-value of the `aov()` is displayed. #### Kruskal-Wallis rank sum test {-} The iris data set gives the measurement of the petal width in cm for three different iris species. ```{r} visstat(iris, "Petal.Width", "Species") ``` In the iris data example, the graphical analysis of the scatter plots of the standardised residuals as well as the Q-Q plot suggest that the residuals are not normally distributed. This visual inspection is confirmed by the very small p-values of the implemented tests of normality, the Shapiro-Wilk test and the Anderson-Darling test. If **both** p-values of the Shapiro-Wilk test and the Anderson-Darling test are smaller than $\alpha=1-$`conf.int` (as in the example above ), `visstat` switches to the non-parametric alternative `kruskal.test()`. Post-hoc analysis with `pairwise.wilcox.test()` reveals significant differences between the petal width of all three species (all green letters differ). # Linear Regression If the feature `varfactor` and the response `varsample` have only one level of type `numerical` or `integer`, `visstat()` performs a simple linear regression. ## Residual analysis `visstat()` checks the normal distribution of the standardised residuals derived from `lm()` both graphically and with the Shapiro-Wilk and Anderson test (analogue to section [Residual analysis](#aov_residuals)). If the p-values of the null that the standardised residuals are normally distributed of both Shapiro-Wilk and Anderson test are smaller than 1-`conf.int`, the title of the residual plot will display the message "Requirement of normally distributed residuals not met". Regardless of the result of the residual analysis, `visstat()` performs the regression itself in the next step. The title of the graphical output indicates the chosen confidence level `conf.level`,the regression parameter with their confidence intervals and p-values, and the adjusted $R^2$. The graph shows the raw data, the regression line and both the confidence and prediction bands corresponding to the chosen conf.level. `visstat()` returns a list with the test statistics of the linear regression, the p-values of the normality tests of the standardised residuals and the pointwise estimates of the confidence and prediction bands. ## Examples ### Data set: cars The cars data set reports the speed of cars in mph and the distance (dist) in ft taken to stop. ```{r} linreg_cars <- visstat(cars, "dist", "speed") ``` Increasing the confidence level `conf.level` from the default 0.95 to 0.99 results in wider confidence and prediction bands: ```{r} linreg_cars <- visstat(cars, "dist", "speed", conf.level = 0.99) ``` p-values greater than `conf.level` in both Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals indicate that the normality assumption of the residuals underlying the linear regression is met. ### Data set: trees The trees data set provides measurements of the diameter (called "Girth") in inches and height in feet of black cherry trees. ```{r} linreg_cars <- visstat(trees, "Height", "Girth", conf.level = 0.9) ``` Both the graphical analysis of the standardised residuals and p-values less than `conf.level` in the Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals suggest that the condition of normally distributed residuals of the regression model is not met. Furthermore the linear regression model explains only 24% of the total variance of the dependent variables "Height" of the cherry trees. The user might consider other regression models. These further tests are not provided by `visstat()`. # ${\chi}^2$- and Fisher Test If both the feature `varfactor` and the response are `varsample` are categorical of type `factor`, `visstat` tests the null hypothesis, that the feature and the response are independent of each other. Categorical data are usually presented as multidimensional arrays, called contingency tables. The values in each cell of the contingency table are the observed frequencies for each unique combination of feature and response. Based on the observed frequencies, `visstat()` calculates the expected frequencies under the null hypothesis. If the expected frequencies are large, where large is defined as at least 80% of the expected frequencies being greater than 5 and none of the expected frequencies being less than 1, the `chisqu.test()` is performed, otherwise the `fisher.test() [@Cochran]. In the case of 2-by-2 contingency tables, the continuity correction is applied to the the `chisqu.test()`. `visstat()` prints a grouped column plot with the p-value of the corresponding test in the title and a mosaic plot with Pearson's residuals (for details see documentation of function `mosaic()` in the `vcd` package ) are generated. ## Converting a contingency tables to a `data.frame`. `visstat()` needs a `data.frame` with a column structure as input. Contingency tables can be transformed into this structure with the helper function `counts_to_cases()`. ## Examples based on data set: HairEyeColor ### Creating a `data.frame` from a contingency table `counts_to_cases()` transforms the contingency table `HairEyeColor` into `data.frame` named `HairEyeColorDataFrame`. ```{r} HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor)) ``` ### Pearson's Chi-squared test --- ```{r} hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor)) visstat(hair_eye_color_df, "Hair", "Eye") ``` ### Pearson's Chi-squared test with Yate's continuity correction For 2 by 2 contingency tables, `vistat()` applies the continuity correction to the test statistics of the Chi-squared test. In the following example, we select only participants with black or brown hair and brown or blue eyes resulting in a 2 by 2 contingency table. ```{r} hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ] #Transform to data frame hair_black_brown_eyes_brown_blue_df<- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue)) #Chi-squared test visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye") ``` ### Fisher's exact test and mosaic plot with Pearson residuals Again, we cut a 2 x2 contingency table from the full data set, this time only keeping only male participants with black or brown hair and hazel or green eyes. Pearson's Chi-squared on this 2 x2 contingency table would give to an expected value less than 5 in one of the four cells (corresponding to 25% of all cells), violating the requirement of the Chi-squared test that the expected value should be at least 5 for the majority (80%) of the cells [@Cochran]. Therefore the Fisher exact test is chosen. ```{r} hair_eye_color_male <- HairEyeColor[, , 1] # Slice out a 2 by 2 contingency table black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4] #Transform to data frame black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male)) # Fisher test fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye") ``` # Saving the graphical output The generated graphics can be saved in the file formats supported by `Cairo():` "png", "jpeg", "pdf", "svg", "ps" and "tiff". In the following example we store the graphics files in the output format "png" to the `plotDirectory` `tempdir()`. The naming convention used for the graphics file reflects the chosen statistical test and the variable names. ```{r} visstat(black_brown_hazel_green_male, "Hair", "Eye", graphicsoutput = "png", plotDirectory = tempdir() ) ``` Remove the graphical output from `plotDirectory`: ```{r} file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png")) file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png")) ``` # Overview of implemented tests `t.test()`, `wilcox.test()`, `aov()`, `kruskal.test()`, `lm()`,`fisher.test()`, `chisqu.test()` ## Implemented tests to check the normal distribution of standardised residuals `shapiro.test()` and `ad.test()` ## Implemented post-hoc tests `TukeyHSD()` for `aov()` and `pairwise.wilcox.test()` for `kruskal.test()`. # Bibliography