69

I have a nominal variable (different topics of conversation, coded as topic0=0 etc) and a number of scale variables (DV) such as the length of a conversation.

How can I derive correlations between the nominal and scale variables?

142
+50

The title of this question suggests a fundamental misunderstanding. The most basic idea of correlation is "as one variable increases, does the other variable increase (positive correlation), decrease (negative correlation), or stay the same (no correlation)" with a scale such that perfect positive correlation is +1, no correlation is 0, and perfect negative correlation is -1. The meaning of "perfect" depends on which measure of correlation is used: for Pearson correlation it means the points on a scatter plot lie right on a straight line (sloped upwards for +1 and downwards for -1), for Spearman correlation that the ranks exactly agree (or exactly disagree, so first is paired with last, for -1), and for Kendall's tau that all pairs of observations have concordant ranks (or discordant for -1). An intuition for how this works in practice can be gleaned from the Pearson correlations for the following scatter plots (image credit):

Pearson correlation for various scatter plots

Further insight comes from considering Anscombe's Quartet where all four data sets have Pearson correlation +0.816, even though they follow the pattern "as x increases, y tends to increase" in very different ways (image credit):

Scatter plots for Anscombe's Quartet

If your independent variable is nominal then it doesn't make sense to talk about what happens "as x increases". In your case, "Topic of conversation" doesn't have a numerical value that can go up and down. So you can't correlate "Topic of conversation" with "Duration of conversation". But as @ttnphns wrote in the comments, there are measures of strength of association you can use that are somewhat analogous. Here is some fake data and accompanying R code:

data.df <- data.frame(
    topic = c(rep(c("Gossip", "Sports", "Weather"), each = 4)),
    duration  = c(6:9, 2:5, 4:7)
)
print(data.df)
boxplot(duration ~ topic, data = data.df, ylab = "Duration of conversation")

Which gives:

> print(data.df)
     topic duration
1   Gossip        6
2   Gossip        7
3   Gossip        8
4   Gossip        9
5   Sports        2
6   Sports        3
7   Sports        4
8   Sports        5
9  Weather        4
10 Weather        5
11 Weather        6
12 Weather        7

Box plots for fake data

By using "Gossip" as the reference level for "Topic", and defining binary dummy variables for "Sports" and "Weather", we can perform a multiple regression.

> model.lm <- lm(duration ~ topic, data = data.df)
> summary(model.lm)

Call:
lm(formula = duration ~ topic, data = data.df)

Residuals:
   Min     1Q Median     3Q    Max 
 -1.50  -0.75   0.00   0.75   1.50 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.5000     0.6455  11.619 1.01e-06 ***
topicSports   -4.0000     0.9129  -4.382  0.00177 ** 
topicWeather  -2.0000     0.9129  -2.191  0.05617 .  
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.291 on 9 degrees of freedom
Multiple R-squared: 0.6809,     Adjusted R-squared: 0.6099 
F-statistic:   9.6 on 2 and 9 DF,  p-value: 0.005861 

We can interpret the estimated intercept as giving the mean duration of Gossip conversations as 7.5 minutes, and the estimated coefficients for the dummy variables as showing Sports conversations were on average 4 minutes shorter than Gossip ones, while Weather conversations were 2 minutes shorter than Gossip. Part of the output is the coefficient of determination R2=0.6809. One interpretation of this is that our model explains 68% of variance in conversation duration. Another interpretation of R2 is that by square-rooting, we can find the multiple correlation coefficent R.

> rsq <- summary(model.lm)$r.squared
> rsq
[1] 0.6808511
> sqrt(rsq)
[1] 0.825137

Note that 0.825 isn't the correlation between Duration and Topic - we can't correlate those two variables because Topic is nominal. What it actually represents is the correlation between the observed durations, and the ones predicted (fitted) by our model. Both of these variables are numerical so we are able to correlate them. In fact the fitted values are just the mean durations for each group:

> print(model.lm$fitted)
  1   2   3   4   5   6   7   8   9  10  11  12 
7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5 

Just to check, the Pearson correlation between observed and fitted values is:

> cor(data.df$duration, model.lm$fitted)
[1] 0.825137

We can visualise this on a scatter plot:

plot(x = model.lm$fitted, y = data.df$duration,
     xlab = "Fitted duration", ylab = "Observed duration")
abline(lm(data.df$duration ~ model.lm$fitted), col="red")

Visualise multiple correlation coefficient between observed and fitted values

The strength of this relationship is visually very similar to those of the Anscombe's Quartet plots, which is unsurprising as they all had Pearson correlations about 0.82.

You might be surprised that with a categorical independent variable, I chose to do a (multiple) regression rather than a one-way ANOVA. But in fact this turns out to be an equivalent approach.

library(heplots) # for eta
model.aov <- aov(duration ~ topic, data = data.df)
summary(model.aov)

This gives a summary with identical F statistic and p-value:

            Df Sum Sq Mean Sq F value  Pr(>F)   
topic        2     32  16.000     9.6 0.00586 **
Residuals    9     15   1.667                   
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 

Again, the ANOVA model fits the group means, just as the regression did:

> print(model.aov$fitted)
  1   2   3   4   5   6   7   8   9  10  11  12 
7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5 

This means that the correlation between fitted and observed values of the dependent variable is the same as it was for the multiple regression model. The "proportion of variance explained" measure R2 for multiple regression has an ANOVA equivalent, η2 (eta squared). We can see that they match.

> etasq(model.aov, partial = FALSE)
              eta^2
topic     0.6808511
Residuals        NA

In this sense, the closest analogue to a "correlation" between a nominal explanatory variable and continuous response would be η, the square-root of η2, which is the equivalent of the multiple correlation coefficient R for regression. This explains the comment that "The most natural measure of association / correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta". If you are more interested in the proportion of variance explained, then you can stick with eta squared (or its regression equivalent R2). For ANOVA, one often comes across the partial eta squared. As this ANOVA was one-way (there was only one categorical predictor), the partial eta squared is the same as eta squared, but things change in models with more predictors.

> etasq(model.aov, partial = TRUE)
          Partial eta^2
topic         0.6808511
Residuals            NA

However it's quite possible that neither "correlation" nor "proportion of variance explained" is the measure of effect size you wish to use. For instance, your focus may lie more on how means differ between groups. This question and answer contain more information on eta squared, partial eta squared, and various alternatives.

  • 4
    @Zhubarb the hard part was getting R0.82 for the bogus data... – Silverfish Nov 19 '14 at 23:19
  • +1 for a very nicely explained answer! Here you argue that the sign of η or R is always positive, because of course any decently fit model will result in fitted values positively (rather than negatively) correlated with the DV. Maybe I could add that in some cases the sign can be meaningfully assigned to η, e.g. if IV is ordered (I believe this is then called "ordinal" instead of "nominal"), or at least partially ordered. Imagine that topics in the OP range from arts to math; then we could use the sign of correlation between nerdiness and DV and assign it to η. – amoeba Dec 23 '14 at 12:13
  • @amoeba Herein I think lies a subtle point. Suppose we run a simple linear regression and obtain PMCC r=0.9 - then as x increases, y tends to decrease (this is the sort of directional effect you are talking about). Nevertheless the multiple correlation coefficient for such a regression is still R=0.9 (as fitted value of y increases, observed value tends to increase). Now η is more like R than r... – Silverfish Dec 23 '14 at 15:00
  • That is correct, but I guess what I am saying is that sometimes it can make sense to consider "signed η" that is more like r than like R. – amoeba Dec 23 '14 at 15:09
  • @amoeba, you could just multiply eta2 by 1, but it really is creating a new measure that you will have to explain every time, & I don't see how it has really done anything meaningful for you. – gung Dec 23 '14 at 16:38

protected by gung Feb 11 '18 at 14:28

Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

Not the answer you're looking for? Browse other questions tagged or ask your own question.