Category: Tips

"What Stat...?"

2/7/2017

Another great resource for the analyst is What Statistical Analysis Should I Use? courtesy of the Institute for Digital Research and Education (IDRE) at UCLA. The linked document outlines different tests very clearly, with working examples in Stata. They also have examples for SAS and SPSS which you can look up in this handy table if you are so inclined.

Each statistical test gets a short paragraph describing what it tests, for which types of variables it's appropriate, and how the test relates to other methods. I especially like that important assumptions, which can easily be overlooked, are pointed out explicitly.

Two independent samples t-test

An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups.

Take for example the two-sample t-test above. For any range-based variable you can calculate the t-statistic and then use the limiting distribution to estimate the confidence interval/p-value. However, the limiting distribution is only applicable for range-based variables that are normally distributed. So if your variable of interest cannot be assumed to be normal, a t-test is absolutely inappropriate.

As with most things Stata, the document is geared towards causal analysis. This means that terms like "dependent" and "independent" variable are thrown around. In a two-sample example, the "dependent" variable is the variable of interest and the "independent" variable is an indicator of which sample a given observation belongs to.

I came across this document when trying to answer the question: how do I test if two non-normal samples arise from the same distribution? In my case, the Wilcoxon-Mann-Whitney test (or the Kruskal Wallis test on two samples) would be more appropriate than a two-sample t-test.

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal).

Kruskal Wallis test

The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a generalized form of the Mann-Whitney test method since it permits 2 or more groups.

The list of methods provided by "What Stat...?" is far from exhaustive. Other tests that I have found and believe would also be appropriate are Kolmogorov-Smirnov and Anderson-Darling. Which yields a new question: how can the analyst reconcile contradictory results from different non-parametric tests? The subject of a later post I believe.

The R Inferno

9/17/2016

I recently came across The R Inferno written by Patrick Burns. In a satirical style following Dante's Inferno, he discusses pretty much every stack exchange question I've ever looked up.

The R Inferno

Among the many new things I learned and immediately implemented:

the amazing "drop = FALSE" argument that prevents subscripted matrices from dropping a dimension and becoming vectors

As Burns so helpfully repeats twice on page 67:

"NOTE: Failing to use drop=FALSE inside functions is a major source of bugs. "

Suppose you have a matrix my.Matrix and you want to take some subset my.subset of the rows in M.
> M[ my.subset, ]
If my.subset has length 1, then the above code will return a single vector.
> M[ my.subset, , drop = FALSE]
Even if my.subset has length 1, the above will return a matrix with a single row.

"What Stat...?"

Two independent samples t-test

Wilcoxon-Mann-Whitney test

Kruskal Wallis test

The R Inferno

Author

Categories