Correlation between a nominal (IV) and a continuous (DV) variable

Question

I have a nominal variable (different topics of conversation, coded as topic0=0 etc) and a number of scale variables (DV) such as the length of a conversation.

How can I derive correlations between the nominal and scale variables?

The most natural measure of assosiation/correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta. — ttnphns, Oct 13 '14 at 6:49
If I understand correctly you want to say something about the relation beteen topic of conversation (as IV?) and conversation duration (DV). ''e.g. hypo= topic 1 means significantly shorter conversation than topic 2'', if this example is what you meant: You would use a ANOVA for this (if more DV's MANOVA, or several anova's) Is this what you mean? the sentence with your question is quite ambiguous.. — Steven Peutz, Oct 13 '14 at 15:08
Related question: How do I study the “correlation” between a continuous variable and a categorical variable? — Silverfish, Jan 16 '15 at 19:30

Community · Accepted Answer · 2017-04-13 12:44:37Z

The title of this question suggests a fundamental misunderstanding. The most basic idea of correlation is "as one variable increases, does the other variable increase (positive correlation), decrease (negative correlation), or stay the same (no correlation)" with a scale such that perfect positive correlation is +1, no correlation is 0, and perfect negative correlation is -1. The meaning of "perfect" depends on which measure of correlation is used: for Pearson correlation it means the points on a scatter plot lie right on a straight line (sloped upwards for +1 and downwards for -1), for Spearman correlation that the ranks exactly agree (or exactly disagree, so first is paired with last, for -1), and for Kendall's tau that all pairs of observations have concordant ranks (or discordant for -1). An intuition for how this works in practice can be gleaned from the Pearson correlations for the following scatter plots (image credit):

Pearson correlation for various scatter plots

Further insight comes from considering Anscombe's Quartet where all four data sets have Pearson correlation +0.816, even though they follow the pattern "as $x$ increases, $y$ tends to increase" in very different ways (image credit):

Scatter plots for Anscombe's Quartet

If your independent variable is nominal then it doesn't make sense to talk about what happens "as $x$ increases". In your case, "Topic of conversation" doesn't have a numerical value that can go up and down. So you can't correlate "Topic of conversation" with "Duration of conversation". But as @ttnphns wrote in the comments, there are measures of strength of association you can use that are somewhat analogous. Here is some fake data and accompanying R code:

data.df <- data.frame(
    topic = c(rep(c("Gossip", "Sports", "Weather"), each = 4)),
    duration  = c(6:9, 2:5, 4:7)
)
print(data.df)
boxplot(duration ~ topic, data = data.df, ylab = "Duration of conversation")

Which gives:

> print(data.df)
     topic duration
1   Gossip        6
2   Gossip        7
3   Gossip        8
4   Gossip        9
5   Sports        2
6   Sports        3
7   Sports        4
8   Sports        5
9  Weather        4
10 Weather        5
11 Weather        6
12 Weather        7

Box plots for fake data

By using "Gossip" as the reference level for "Topic", and defining binary dummy variables for "Sports" and "Weather", we can perform a multiple regression.

> model.lm <- lm(duration ~ topic, data = data.df)
> summary(model.lm)

Call:
lm(formula = duration ~ topic, data = data.df)

Residuals:
   Min     1Q Median     3Q    Max 
 -1.50  -0.75   0.00   0.75   1.50 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.5000     0.6455  11.619 1.01e-06 ***
topicSports   -4.0000     0.9129  -4.382  0.00177 ** 
topicWeather  -2.0000     0.9129  -2.191  0.05617 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.291 on 9 degrees of freedom
Multiple R-squared: 0.6809,     Adjusted R-squared: 0.6099 
F-statistic:   9.6 on 2 and 9 DF,  p-value: 0.005861

We can interpret the estimated intercept as giving the mean duration of Gossip conversations as 7.5 minutes, and the estimated coefficients for the dummy variables as showing Sports conversations were on average 4 minutes shorter than Gossip ones, while Weather conversations were 2 minutes shorter than Gossip. Part of the output is the coefficient of determination $R^2 = 0.6809$ . One interpretation of this is that our model explains 68% of variance in conversation duration. Another interpretation of $R^2$ is that by square-rooting, we can find the multiple correlation coefficent $R$ .

> rsq <- summary(model.lm)$r.squared
> rsq
[1] 0.6808511
> sqrt(rsq)
[1] 0.825137

Note that 0.825 isn't the correlation between Duration and Topic - we can't correlate those two variables because Topic is nominal. What it actually represents is the correlation between the observed durations, and the ones predicted (fitted) by our model. Both of these variables are numerical so we are able to correlate them. In fact the fitted values are just the mean durations for each group:

> print(model.lm$fitted)
  1   2   3   4   5   6   7   8   9  10  11  12 
7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5

Just to check, the Pearson correlation between observed and fitted values is:

> cor(data.df$duration, model.lm$fitted)
[1] 0.825137

We can visualise this on a scatter plot:

plot(x = model.lm$fitted, y = data.df$duration,
     xlab = "Fitted duration", ylab = "Observed duration")
abline(lm(data.df$duration ~ model.lm$fitted), col="red")

Visualise multiple correlation coefficient between observed and fitted values

The strength of this relationship is visually very similar to those of the Anscombe's Quartet plots, which is unsurprising as they all had Pearson correlations about 0.82.

You might be surprised that with a categorical independent variable, I chose to do a (multiple) regression rather than a one-way ANOVA. But in fact this turns out to be an equivalent approach.

library(heplots) # for eta
model.aov <- aov(duration ~ topic, data = data.df)
summary(model.aov)

This gives a summary with identical F statistic and p-value:

            Df Sum Sq Mean Sq F value  Pr(>F)   
topic        2     32  16.000     9.6 0.00586 **
Residuals    9     15   1.667                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Again, the ANOVA model fits the group means, just as the regression did:

> print(model.aov$fitted)
  1   2   3   4   5   6   7   8   9  10  11  12 
7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5

This means that the correlation between fitted and observed values of the dependent variable is the same as it was for the multiple regression model. The "proportion of variance explained" measure $R^2$ for multiple regression has an ANOVA equivalent, $\eta^2$ (eta squared). We can see that they match.

> etasq(model.aov, partial = FALSE)
              eta^2
topic     0.6808511
Residuals        NA

In this sense, the closest analogue to a "correlation" between a nominal explanatory variable and continuous response would be $\eta$ , the square-root of $\eta^2$ , which is the equivalent of the multiple correlation coefficient $R$ for regression. This explains the comment that "The most natural measure of association / correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta". If you are more interested in the proportion of variance explained, then you can stick with eta squared (or its regression equivalent $R^2$ ). For ANOVA, one often comes across the partial eta squared. As this ANOVA was one-way (there was only one categorical predictor), the partial eta squared is the same as eta squared, but things change in models with more predictors.

> etasq(model.aov, partial = TRUE)
          Partial eta^2
topic         0.6808511
Residuals            NA

However it's quite possible that neither "correlation" nor "proportion of variance explained" is the measure of effect size you wish to use. For instance, your focus may lie more on how means differ between groups. This question and answer contain more information on eta squared, partial eta squared, and various alternatives.

@Zhubarb the hard part was getting $R \approx 0.82$ for the bogus data... — Silverfish, Nov 19 '14 at 23:19
+1 for a very nicely explained answer! Here you argue that the sign of $\eta$ or $R$ is always positive, because of course any decently fit model will result in fitted values positively (rather than negatively) correlated with the DV. Maybe I could add that in some cases the sign can be meaningfully assigned to $\eta$ , e.g. if IV is ordered (I believe this is then called "ordinal" instead of "nominal"), or at least partially ordered. Imagine that topics in the OP range from arts to math; then we could use the sign of correlation between nerdiness and DV and assign it to $\eta$ . — amoeba, Dec 23 '14 at 12:13
@amoeba Herein I think lies a subtle point. Suppose we run a simple linear regression and obtain PMCC $r=-0.9$ - then as x increases, y tends to decrease (this is the sort of directional effect you are talking about). Nevertheless the multiple correlation coefficient for such a regression is still $R=0.9$ (as fitted value of y increases, observed value tends to increase). Now $\eta$ is more like $R$ than $r$ ... — Silverfish, Dec 23 '14 at 15:00
That is correct, but I guess what I am saying is that sometimes it can make sense to consider "signed $\eta$ " that is more like $r$ than like $R$ . — amoeba, Dec 23 '14 at 15:09
@amoeba, you could just multiply $\sqrt{eta^2}$ by $-1$ , but it really is creating a new measure that you will have to explain every time, & I don't see how it has really done anything meaningful for you. — gung♦, Dec 23 '14 at 16:38

asked	4 years, 3 months ago
viewed	135,964 times
active	12 months ago

Stack Exchange Network

current community

your communities

more stack exchange communities

Correlation between a nominal (IV) and a continuous (DV) variable

1 Answer 1

protected by gung♦ Feb 11 '18 at 14:28

Not the answer you're looking for? Browse other questions tagged correlation continuous-data categorical-data or ask your own question.

Linked

Hot Network Questions

Correlation between a nominal (IV) and a continuous (DV) variable

1 Answer 1

protected by gung♦ Feb 11 '18 at 14:28

Not the answer you're looking for? Browse other questions tagged correlation continuous-data categorical-data or ask your own question.

Linked

Related

Hot Network Questions