FAQs

Organization-related FAQ

How to ask a question so we can easily help?

You’re probably quite competent at asking questions so that we can easily help. But here are a few tips that can help you be even better:

If you’re asking a question in the OLAT Forum and include R code, please try to ensure it is easy to read and copy into R.
If you’re asking for help about some code that doesn’t work, please consider making a “minimal, reproducible example”. This web page describes what one is and how to make one (https://stackoverflow.com/help/minimal-reproducible-example).

Is practical attendance mandatory?

You are free to do the practicals whenever you want, but we encourage that you do attend the practical afternoons, especially if you have questions!

Can I change which practical afternoon I attend?

In general it’s better that you stick to the afternoon you registered for, but if you cannot attend it you can of course join the practical on the other day.

Are the sections named Extras part of the exam?

No, they are just for your interest!

Device with enough battery for exam

If you are not certain that the battery of your laptop can last for the duration of the final examination, please let us know using the form at the end of the course content here on OLAT.

About the Discussion Forum

Use the discussion forum to ask questions, give tips, make suggestions about how the course could be improved… or anything else that springs to mind while you’re taking the course.

However, before posting please check out the Index in the “Wiki: FAQ” section in this course to see whether your question has already been answered there and also look at the other forum posts here to see whether there is already a discussion thread for your topic. If you cannot find what you are looking for, go ahead and post something in this forum, we will respond as soon as possible!

Other potential uses of the forum are that you may make a list of questions that you want to ask during the class, or a list of things that you’d like us to go over again in class, or in more detail in lectures or practicals. Then we can look at these in lectures or practicals.

One rule: do not post answers to assessment or solutions to exercises. Help others find and realise for themselves the answer, but don’t post the actual answer. This may include deleting or editing your post. Please ask about GAs during a practical or email.

Another rule / guideline. Open a new post / thread for a new issue.

R-related FAQ

What version of R is needed?

It is good practice to keep R up to date, so it is recommended that you have the most recent version of R installed. In any case, you should at least have the R version 4.0 or newer installed. You can check the version by running “version” or “sessionInfo()” in R.

What packages are needed?

Note: this list is incomplete and will be updated in the future.

The packages you need in each unit are usually in the lecture and/or in the homework material.

An incomplete list (work in progress):

tidyverse (contains ggplot2, readr, dplyr, tidyr, etc.)
skimr
ggfortify
…

I am having issues installing a package

Problems with installing packages are somewhat common. Often, these problems be avoided by using

install.packages( "NAME", type = "binary")

instead of simply

install.packages( "NAME", )

The reason is that quite often, packages are available a source packages, which must be compiled locally, which is often not possible on standard R machines. Therefore, by specifying type = “binary”, you tell R to install the already compiled version, even though it might be an older version.

Different output of summary()/str()/etc. function than what is shown in video

During the course it will happen that the output that you get from function such as summary(), glimpse(), str(), …, will look a bit different from the output that you see in the (homework) videos.

Usually, the reason for this is that since R version 4.0.0 (and newer) the function read.csv() no longer reads in non-numerical variables as factor variables, but instead these variables are read in as character variables and functions such as summary(), glimpse(), str(), … handle character variables differently than they handle factor variables.

What does rm(list=ls()) do?

While working in R we assign stuff to variables, load in data, etc. These things are stored in the R “environment”. The line rm(list=ls()) deletes everything that is stored in this environment. The reason for this is basically that by freeing up space we are on top of things (when many different things are stored in the environment it can get messy) and are sure that we are not using things from (for instance) a previously run script.

The function rm() stand for “remove” so it removes everything that we give it. In this case we give it a list of things: ls() lists every name that is stored in the environment. We usually use it as the first thing in a script.

How do I set the number of decimal places shown?

Use the round() function to tell R how many decimal places to report and what rounding to do. You otherwise cannot know for sure how a number displayed in the console has been rounded or not.

What does the predict() function do?

In general, the predict() function takes as argument a model object, for instance one fitted with lm(). It then predicts/estimates/calculates the response variable based on the values of the covariates (explanatory variables) and the estimated coefficients according to the fitted model. You can do this by hand (good for understanding).

If you do not specify the data to use for the prediction, it will automatically use the data that was used in fitting the model. You can also specify the dataset used, which is useful, for instance when you have multiple covariates and you want to visualize the relation of one specific explanatory variable with the response variable. In that case you create a new dataset in which you keep every covariate at constant values (e.g. their mean) with the exception for the one in you are interested in. Then you can use this dataset in the predict() function and then, if you want, you can plot the relation of that variable with the response.

Note: predict() does not fit a model, it evaluates a model with some given data

What does the expand.grid() function do?

The expand.grid() function is a handy function to create new data, as it returns every possible combinations of the input provided. For instance expand.grid(a = 1:2, b = 4:5) returns

We often use this function to create new data because we want to visualize the effect of one particular variable on the response variable according to the fitted model. When we have a model with more than 1 explanatory variable, these variables can vary together and if they do the effect of one particular variable on the response variable is more difficult to see. Hence to best show the importance of a particular variable, we keep all other variables at constant values. These constant values should ideally be sensible values, i.e. keeping the mass at 0 does not make sense, but it makes more sense to keep at the mean value.

AIC(model) versus extractAIC(model)

You can find an explanation of the difference here (see the accepted answer)

It does not matter which of the two functions you use, but be sure to always use the same function within an analysis/script!

The autoplot function does not show the leverage plot

When there is one categorical and one continuous variable (and in other cases as well), the forth plot shown by the autoplot() function (in the ggfortify package) is not the leverage plot that we would usually get. See also here. It has to do with how this function is implemented.

If you want to see the usual leverage plot, see the above link to see how to achieve that.

The other three plots shown by the function are more important, so it is ok that we do not see the leverage plot sometimes.

Cooks distance in leverage plot

I am not aware that autoplot() has a method to show Cook’s distance. This is not so important. The important thing about that plot is to see whether there are data points that have both a large leverage (a big impact on the regression fit result) and large residual value. These points would be in the top and bottom right corner of the fourth plot. If there are such points it can be interesting to repeat the analysis without these points to see how the result changes. Note that this plot is not needed to check the model assumptions, so for now at least the other three plots are more important.

gather() vs pivot_longer() and spread() vs pivot_wider()

This course was written when gather and spread were still the go-to tidyr functions for these tasks (there are base R functions for them as well). Since then, pivot\_wider and pivot\_longer have been released, which have more functionalities and are more flexible. For such simple tasks both work perfectly fine, with the advantage that spread and gather have slightly more intuitive names and are those more easily understood. But of course, as recommended in the help page, the newer functions should be used moving forward (at some point this course will switch as well).

unique() vs levels()

The function levels() only works with factors. In newer versions of R variables are read in by read\_csv as characters instead of factors (see also here). Use the function unique() instead of levels()

Statistics-related FAQ

What is the difference between fitted and predicted values?

Fitted values: The response values you get when you use the observed values of the explanatory variables, i.e. the ones you used to fit the model: the original data

Predicted values: In general, the response values you get when you use new observations of the explanatory variables (i.e. new data). Note: if the new data is the same as the original data, then predicted values == fitted values.

This being said, it happens that the terms fitted values and predicted values are used interchangeably, even though there is a (small) difference.

Interpretation of p-values

The p-value is best interpreted as:

The p-value is the probability of having a test statistics value at least as extreme as the value actually observed (e.g. the t-value of the t-test or F-value of the F-test) given that the Null hypothesis is correct (often H0: parameter=0). If the p-value is below 0.05 we reject H0 and go with H1 (often H1: parameter =/= 0). The p-value is not the probability of H0 being correct.

Interpretation of confidence intervals

Let’s assume we have a simple linear regression with body-fat as the response and BMI as the explanatory variable. We find that the slope of BMI is 1.75 with a corresponding 95% confidence interval [1.55,1.97]. This is interpreted as follows:

“All true values for beta1 (slope for BMI) between 1.55 and 1.97 are compatible with the observed data”.
To rephrase this slightly, this means: the true slope for BMI could have any value between 1.55 and 1.97 and we would not be surprised to see the data that we have (and thus, if the true slope was outside the interval, it would be surprising to see the data that we have: the estimated slope based on the data does not match the true slope.

Note: I want to stress that it is not correct to say that there is a 95% probability that the true slope is between 1.55 and 1.96 (see below).

So, with the interpretation given above, if the confidence interval of the slope does not cover 0, then a slope of 0 is not compatible with the data. In fact, in the case of the 95% confidence interval this means that the p-value is below 0.05, which means that we reject the Null hypothesis H0 that the slope for BMI is 0. Note: this is neither “good” nor “bad”, it is just what it is: we are just interested in the truth.

There is an alternative interpretation of confidence intervals which is arguably more precise (but you can use the above for this course). To explain it, here are a couple of things that I personally think are good to keep in mind:

The parameter of interest (e.g. the slope of BMI) is estimated based on the available data, meaning that the estimate is a function of the data. If for instance the experiment is repeated the data will look a little bit different and the estimate will be a little bit different. However: the true value of the parameter of interest (the slope) is the same in both experiments!
The same thing goes for confidence intervals: they are estimated based on the data, so if the data changes a bit, the upper and the lower bound of the confidence interval will be a bit different as well.
An example of wrong interpretation: continuing the example above, let’s say we know the true slope of BMI and it is 2. What is the probability that the confidence interval includes the true slope? 0 (!), because 2 is not in [1.55,1.97]. And what would the probability be if the true slope is 1.6? 1 (!), because 1.6 is included in [1.55,1.97]. This is why it is incorrect to say that there is a 95% probability that the estimated confidence interval contains the true slope: the estimated confidence interval either contains or it does not contain the true slope.

Now, because of probability theory reasons, a confidence interval is best interpreted like this:

“If we repeat the entire experiment many times (and thus collect slightly different data each time) and for each experiment we calculate the 95% confidence interval, then 95% of the calculated confidence intervals will contain the true parameter value”.
This definition is a bit harder to get, but it basically tells us that it is a measure of uncertainty of the estimated slope.

Model diagnostics residuals versus fitted values

This entry is not complete

The central assumption of linear regression is that the errors are independent (from each other) and identically distributed (i.i.d.) following a normal distribution with mean = 0 and variance = sigma^2. We cannot measure the errors directly, but we have estimates of them: the residuals. This means that we can check these assumptions based on the residuals, because given that they are estimates of the errors and that the model is valid the residuals have to be i.i.d. following a normal distribution as well. In the figure residuals vs. fitted values we can check for some of these assumptions:

Mean = 0: the mean has to be zero across all fitted values. So the points in the figure have to be spread approximately symmetrically above and below 0, so that the mean is approximately 0 across all fitted values. If for instance we see that for larger fitted values the mean of the residuals increases or decreases clearly away from 0, then this assumption might be violated.
Variance == sigma^2: constant variance. Similarly to above, the variance has to be independent from the fitted values, meaning that the spread of point below and above 0 needs to be approximately the same across all fitted values. If this is not the case this results in a pattern (e.g. larger fitted values result in a higher variance in the residuals) and this assumption might be violated.

Difference between binomial and count data

For count data we use Poisson regression and for binomial data we use binomial regression (commonly called logistic regression)

Count data come from questions like “how many times have you voted in your life?” and the answer can be a positive integer (including 0), while binary data come from questions like “did you vote last year?” to which the answer can only be yes (1) or no (0).
We get binomial data if we have n (=number of persons) independent Bernoulli trials (i.e. binary trials), so we would basically count how many people have voted last year out of the n we asked. In other words, if we have 7 yes and 3 no to the question “did you vote last year?”, they come from 10 different people (sample size n=10), while if to the question “how many times have you voted in your life?” we get an answer like “I’ve voted 10 times in my life” this comes from a single person (sample size n=1), so keep this in mind.

Dispersion parameter

Why is the value of the estimated dispersion parameter different to residual deviation divided by the residual degrees of freedom?

The actual dispersion is based on a weighted version of the residuals:

object is the output of a glm( … “pseudopoisson”) call:

dispersion <- sum(object$weights*object$residuals^2)/object$df.residual

residuals are not:

deviance.resid <- sum(residuals(object, type = “deviance”)^2)

E.g.

counts <- c(18,17,15,20,10,20,25,13,12)

outcome <- gl(3,1,9)

treatment <- gl(3,3)

object <- glm(counts ~ outcome + treatment, family = “quasipoisson”)

summary(object)

sum(object$weights*object$residuals^2)/object$df.residual

sum(residuals(object, type = “deviance”)^2)

Next question: why weighted residuals, and how are they weighted :)

Attenuation in linear regression

In the measurement error lecture in unit 12 there is a method for calculating the true slope from the naive (including measurement error) slope given on the following slide. Please see the “Motivating example” section of the https://en.wikipedia.org/wiki/Errors-in-variables_models page on Wikipedia for a derivation of the equation show on the slide.

Other student questions from previous years (unsorted)

Gender Imbalance Effect on t-test

Student question (2021): I wanted to ask a general question related to both the lecture and practical session: how much does gender imbalance influence the results of a t-test? Is there a technique to test for this possible effect and, in this case, how could one account for the imbalance?

Answer given:

In general, an imbalanced dataset can be used for a t-test as long as the assumptions of the test are met and there are enough datapoints in each of the two compared groups (but in theory the test also works with very small sample sizes).

The assumptions of the t-test are that the datapoints are independently and identically distributed following a normal distribution. So as long as there is no evidence for a violation of these assumptions, you can use the test (you will see in the coming weeks how to check whether the assumptions are met or not!).

By the way, usually the t-test assumes equal variances between the two groups. If they are not (and you set var.equal=FALSE in R) the test will have to estimate the variance in both groups and hence there will be fewer degrees of freedom left.

Regarding how to counteract data imbalance, one possible option could be to create a balanced dataset by randomly subsampling from the more abundant group. But as I’ve said, it is not needed in this case.

More stuff to think about: it is maybe even more important to think about whether the data is representative about the population that we want to infer something from. I do not know what the gender ratio is in the class, but it could be that the data imbalance is there because one gender was more willing to participate than the other. If that is the case, then there could be a confounding variable that influenced both the outcome (the reaction time) and the willingness to participate. For instance it could be that in one gender only persons with a fast reaction time participated, while in the other group everyone participated regardless of reaction times and thus the dataset would not be representative of the actual class and the t-test result no longer valid.

Imbalanced: means that the two groups that are compared consist of an unequal number of datapoints

How to get extreme values

Student question (2021):

In Unit2 Question3 Exercise1 it is written that it would be complicated to progamatically check if the extreme values belong to one single individual. I think I have found a way to do so:

extremes <- slice(df,-c(1:252)) #generate empty data frame for (i in colnames(df)) { print(filter(df,df[i]==max(df[i]))) } df is my dataframe.

So my question is if this is a correct approach and if it would lead to the correct answer. And my second question is, how can I add the output of the statement to the empty data frame I created above? In python I would solve it with a list or something similar but I am not quite sure how I should or can do it in R.

Looking forward to an answer.

Answer given:

You found a good start to do it! Here is how you could add it to the dataframe:

extremes <- df %>% slice(0) for (i in colnames(df)) { max <- filter(df,df[i]==max(df[i])) extremes <- rbind(extremes, max) } rbind() stands for “row bind” and does just that: it binds rows together. There is also cbind(), for columns.

Note: It is worth pointing out that in this way you “only” get the maximum value of every given variable. However, It is a matter of how extremes are defined: it could be that the max value of a variable is not an extreme because it is perfectly in line with the other values. Similarly, there could be more than 1 extreme value for a given variable, with respect to the other values. The same goes for small values, as those can be extremes as well.

More info: a common definition of extreme values is: smaller than the first quartile minus 1.5IQR (the interquartile range) OR larger than the third quartile plus 1.5IQR (the interquartile range). The outliers in boxplots are for instance calculated in this way. This is not needed for what we are doing in the course and what you did is perfectly appropriate!

Lecture 3 linear regression Hypothesis and p value

Student question (2021): I am a bit confused… In lecture 3 on slide 31 it is written: „In included is the assumption that the data follow the simple linear regression model.“ Then on slide 35 is is written that such a little p-value suggests that it is very unlikely to see such an slope if there would not be a correlation. But I have heard several times before this course that a small p-value suggests to reject the Null Hypothesis. Therefore I am confused because shouldn’t the Null Hypothesis be that there is no correlation and because of the small p-value we accept because there is a correlation?

Can somebody help me?

Answer given: I think that you understood it correctly, but you just got confused by that sentence on slide 31.

What is meant with “In H0 included is the assumption that the data follow the simple linear regression model” is that in addition to H0 there is the assumption that the data can be analyzed with the chosen regression model. It can be analyzed like this if the modelling assumptions (see slide 25 in lecture 3) are met (you will see in the coming weeks how to check whether they are met or not). Only if the modelling assumptions are met it makes sense to test the actual H0.

As you correctly understood, the actual Null hypothesis H0 is that the slope beta = 0. At the same time, if the slope is 0 it also means that the correlation would be 0. But for completeness sake I also want to mention that there is a difference between the regression slope and the correlation, which is nicely explained here.

Plotting Regression line

Student question (2021): dear BIO144 team, I tried to graph my results using code below. However, when I tried to draw the regression line using “geom_smooth” ,it gives me 5 different regression line in different colors for each country (could not upload the graph here). What if I want only one regression line for all my data, and at the same time different colors for the continents points?

ggplot(health\_filtered, aes(x=logExp, y=logMort, colour=continent))+
   geom\_point(size=0.8, alpha=0.9)+
   geom\_smooth(method="lm", se= FALSE)+
   xlab("Log of Child Mortality")+
   ylab("Log of healthcare Expenditure")+
   ylim(0,3)+
   xlim(1,4)+
   theme\_light()

Answer given: You have specified the aesthetic mapping colour = continent in the ggplot function. Any aesthetic mapping specified in the ggplot function are inherited by all other geoms. I.e. they apply automatically to all other geoms, unless we say otherwise. So your geom_smooth is also using the colour = continent aesthetic mapping, and so is doing a separate regression for each.

If we want different colours for each point, but one regression, then here are two solutions:

Remove the colour = continent aesthetic mapping from the ggplot function, and put it only in the geom_point function.

Remove the inherited colour = continent aesthetic mapping from the geom_smooth by adding mapping = aes(colour = NULL). i.e. use geom_smooth(method=“lm”, mapping = aes(colour = NULL))

Interpreting the table lecture 4 page 29 2021

Student question (2021):

The conclusions I get from interpreting the table is that

bmi has a strong correlation with bodyfat (steep slope)
age probably doesn’t have that strong of an influence on bodyfat (only 0.13 slope)
the p-values all show significance but in the case of age it might not play an important biological role
the confidence interval of the intercept is rather wide so it could be useful to generate more data points (?)

Did I miss anything or did I get it wrong?

Answer given:

Hi, solid interpretation! Some comments:

You rise a good point of statistical significance versus effect size (and corresponding clinical significance, in this case). It can happen that a coefficient is estimated to be significantly different from the null effect, but at a same time the estimated coefficient is so small (i.e. small effect size) that it does not matter.
However, it is not always so easy to determine whether an effect size is big or small and in the case of the covariates age and bmi I argue that both might be important, and here is why. At first glance the two estimated slopes are very different, with the one for bmi being roughly 14 times as big as the one for age. Yet, age is a continuous variable and the slope means that for every year that passes the bodyfat percentage is estimated to increase by 0.13. Hence, after roughly 14 years the bodyfat percentage is predicted to increase by 1.82, roughly the same as if the bmi would have increased by 1. What I want to say is that 0.13 might not seem much, but it adds up over the years. On the other hand, an increase of 1 of the bmi is actually not so little and thus it might be less likely to happen.

Regarding the intercept:

In general more data would increase the precision of the estimates (i.e. decrease the standard errors), but it is often not possible to have more data.
Think about what the intercept is in this model. Are we interested in it at all? The answer is “probably not”: it is the bodyfat percentage for when age=0 and bmi=0, both of which are not possible. This is also why the estimated intercept does not make sense (-31 bodyfat percentage?). In this model we are only interested in the estimated slopes.

In any case, good start with interpretation! You could further try to interpret the confidence intervals (in the sense: what do they mean?).

More info: related to the point above, another thing to keep in mind when interpreting coefficients are the units of the variables. For age it’s years, so as I wrote above one unit increase in age (i.e. 1 more year) results in an expected increase of bodyfat percentage by 0.13. If however the unit of the variable age was decades instead of years, then the slope would have been estimated to be (about) 100.13=1.3 and in this case it would have roughly been the same as for bmi* and our first “impression” of its size would have been different.

Overdispersion Index

Student question (2021): In the GSWR book it says only to worry about dispersion if the dispersion index is above 2, but in the mock exam it said that there is overdispersion even though the index was below 2. Does that mean there is always overdispersion if the index is above 1 but you just don’t worry about it?

Answer given: Hi. Yes if the the dispersion parameter is >1 it is overdispersion, but how much larger than 1 it has to be to become problematic is up for debate. I cannot put it better than what is written in the book:

“What if the dispersion index had been 1.2, 1.5, 2.0, or even 10? When should you start to worry about overdispersion? That is a tricky question to answer. One common rule of thumb is that when the index is greater than 2, it is time to start worrying. Like all rules of thumb, this is only meant to be used as a rough guide. In reality, the worry threshold depends on things like sample size and the nature of the overdispersion.”

IC10 Abalone - quasipoisson and cor

Student question (2021): I was wondering why there’s no AIC when doing the dropterm function for a full model with family=quasipoisson? dropterm(full_glm_quasi, sorted=TRUE)

Also I did not quite understand the last line in the sample solution script:

 cor(cbind(x=fitted(noViscera), y=fitted(lm\_model)))

how does this work and what does the result mean?

Answer given: The short answer is this: you might remember that to be calculated, the AIC needs the likelihood. I don’t think that we formally defined the likelihood in this course, so it’s ok if you do not really know what it is. What is important here is that for the GLM with quasi-Poisson the likelihood is no longer defined and thus the AIC cannot be calculated. I do not actually know whether dropterm is useful in this case (note that the exercise does not ask you to use it with the quasi-Poisson model, it only asks you to change the selected Poisson model to a quasi-Poisson model).

Regarding the second question: with

`cor(cbind(x=fitted(noViscera), y=fitted(lm\_model)))`

a matrix with two columns is passed to cor, and from the help page of cor we know that in this case the correlations between the columns are calculated (x vs x, x vs y, y vs x and y vs y, hence 4 values). The values on the diagonal of the produced correlation matrix are obviously 1, because we correlated respectively x vs x and y vs y. The values on the off-diagonals are the one of interest and are the same because x vs y and y vs x is the same. As there are only 2 columns in this case, I find it easier to just use cor(fitted(noViscera), fitted(lm\_model)), i.e. to just pass the 2 vectors separately, in which case only the correlation between the two vectors is calculated. As the correlation between the fitted values of the two models is almost 1 (it is 0.9733156) you rightfully state that there is no practical difference in the predicted values between the two models (but the glm is still the model that should be used because the response is count data).

Exercise 7c correlation between parameters

Student question (2021): the scatter plot of the Betas indicates a possible correlation between B0 and B1 as well as between B0 and B2. On the other hand, B1 and B2 do not seem to correlate. can you help me understand why.

Is it correct to interpret this the following way: keeping one point fixed (say B2), then the variation in the intercept will define the variation in B1?

Answer given: This is what I think is going on:

First, B1 and B2 are not correlated because x1 and x2 are not correlated either, so their coefficients are independent from each other.
Second, by adding random noise to the response y it can happen that the apparent relation between e.g. x1 and x weakens (slope B1 closer to 0) or strengthens (the opposite). At the same time because the data (x1) did not change if the slope (B1) changes then the intercept (B0) changes as well, otherwise the regression line would not fit the data. It’s not so easy to explain, maybe it helps if you try to draw it (i.e. add more noise and see what happens to the intercept and the slope). The same goes for B2 and B0.

Question about Linear Model from Practical 4

Student question (2021):

I have a question about the practical we did in week 4, the one with the milk data set.

So there we used the following linear model:

lm(formula = kcal.per.g ~ neocortex.perc + mass, data = milkdat)

Now inf we check the table you gave us in the “useful table” section from Lesson 7, this would constitute Case D: same slope and different intercepts.

Now in the Practical from week 4 one question is:

What is the estimated slope of the relationship between kcal.per.g and mass?

And the answer is -0.0054

Second question is: What is the estimated slope of the relationship between kcal.per.g and neocortex.perc?

And the answer is 0.018

So obviously there’s a contradiction here. Because in the table it says if we use the “+” in the linear model, so a model without interaction, the slopes for the two relationships between explanatory variables and response variable should be the same. But this is not the case here.

So is there a mistake I do in thinking? I would be very glad if you could clear this up!

Answer given:

The “useful table” is based on an ANCOVA (topic of week 7), i.e. there is a continuous variable (density) and a categorical variable (season). The slope is for density and in the case of no interaction (i.e. a “+” between the explanatory variables) the slope associated with density is the same for all values of season. For the categorical variable, k-1 (k being the number of levels of the variable) intercepts are estimated.

In the milk example, neocortex.perc andmass are both continuous variables, meaning that for both a slope is estimated separately. Hence the questions are “What is the estimated slope of the relationship between kcal.per.g and mass?” and “What is the estimated slope of the relationship between kcal.per.g and neocortex.perc?”

Note that an interaction between two continuous variables would also be possible. It would mean that the slope of one continuous variable on the response variable changes as the values of a second continuous change.

Follow-up question:

So does this mean that if I have both continuous variables, if I do a linear model with “+” (no interaction) in the summary table I get one intercept for the alphabetically first explanatory variable, then the 2 slopes of the 2 explanatory variables? And in the case of ANCOVA (hence one continuous variable and one categorical variable) I get the intercept for the alphabetically first explanatory variable, and then the slope which is the same, and then the difference between the first intercept (the reference) and the intercept of the 2nd explanatory variable (alphabetically second)? Is this correct?

Answer given:

Almost… The intercept is not for the continuous variables. It would be what you say it is if you would have a categorical variable. I suggest that you try to write down the model to help you interpret the summary table until you’re more confident with it. In this case with y=kcal.per.g and x1=neocortex.perc and x2=mass the model is:

y_i =_0 + _1x_i^{(1)} + _2x_i^{(2)}+_i So you are estimating 1 intercept (beta0) and 2 slopes (one for each explanatory variable, beta1 and beta2). Note that the intercept is basically the value of y_i (minus the error) when both x1=0 and x2=0, that is when both the neocortex.perc and the mass is =0.

You could try to write down the model in the case you also have a categorical variable

ANOVA vs Multiple linear Regression

Student question (2021):

Hello I’m a bit confused when I have to test with an ANOVA and when with a multiple linear regression. For example: The Yield of Hybrid Mais example in lecture 6 compares the differences of means of the groups(categorical variable). The Earthworm example is also a categorical (explanatory) variable and there we test with a multiple linear regression.

(we convert the categorical variable into a dummy variable but still could I test there also with an ANOVA?)

Answer given:

Hi. ANOVA and (multiple) linear regression are both linear models that investigate the relation between explanatory variables and a response variable. In fact, ANOVA can be seen as a special case of linear regression in which the all explanatory variables are categorical.

In general, if there is a categorical variable (for instance, in a ANOVA) we do not directly look at the summary() table (which gives coefficient estimates and corresponding t-test results) but we look at the anova() table which gives the estimated SS, MS and corresponding F-test results. What is the difference? In the summary() table each coefficient is separately tested with a t-test against the Null hypothesis that it is 0. For a categorical variable with k levels this means k-1 tests (k-1 dummy variables) and if $k$ is large we run into the multiple comparison problem. So, instead of testing each level of that categorical variable separately we run 1 single test (the F-test) that tells us whether at least 2 levels are different from each other (see slide 17 lecture 6). Afterwards we can look at the estimated coefficients with summary(). For continuous variables we can directly look at the summary() output. In the earthworm example, there were tow explanatory variables, 1 categorical (Gattung) and 1 continuous (Magenumf). For the former you first look at the anova()table, for the latter you can directly look at the summary() table.

Self test question 8 Spread and shape of distributions

Student question (2021):

I find the question 8 of Spread and shape of distributions in the self test not well formulated. The tail is the thinner part of the curve, so if there is a long tail of high values it could mean that the data are more concentrate in the lower part and therefore have a mena lower then the median.

Answer given:

Hi! I can understand that it can be confusing. In statistics, what is meant with “long tail of high values” (sometimes also called “fat tail” or “heavy tail”) is that there is a bigger probability of getting (very) large values when compared to the reference distribution (often a normal or an exponential distribution). See for instance here. So the sentence “A distribution of data with a long tail of high values” always means that high values are more frequent than they would be in the distribution of reference.

As you correctly identified, the mean is not a robust measure as its value is influenced by extreme values in the data, while the median is more robust. So in the case of a distribution of data with a long tail of high values, the mean is bigger than the median.

ggpair() graph exclusions

Student question (2021):

Dear BIO144 teammates. In unit 2, excercise 1, Question number 7, for looking at relationships among variables we use the code ggpairs(). However, as you all have seen, it gives us all the graphs. It has some cons including some of them are not needed for answering this question and also the graphs are so small that it is hard to see the relationship. Therefore I was wondering is there a way to exclude all the graphs rather than the ones that we need?(the ones that show the relationship of bodyfat with all other variables)

Answer given:

Hi! You could do it like this:

bodyfat\_dataset %>%
 select(bodyfat, age, abdomen, height, weight) %>% 
 ggpairs()

Lecture4 page32

Student question (2021):

Could someone please look at our answers for the questions on this page? we are not sure if our answers are correct and which other interpretations are possible.

x1 is an important explanatory variable because if only x1 is used, then R2 and the adjusted R2 are high and the p-value is small (what means that the slope for x1 is differs significantly from zero(?))
same answer for x2
In this model, we only need x1 or x2, not both. Reason: the R2 becomes not much higher if we use both compared to the situation where we only use one of them. x1 and x2 are also positively correlated: if x1 increases, then x2 will increase too.
Interpretation: There is a positive correlation between y and each of the two x’s: if x1 or x2 increases, then the catheter length will increase too.

Thank you very much in advance for your answer.

Answer given:

I think that your answer for questions 1 and 2 are fine (but see answer to question 3, which is relevant here as well).
Question 3: you will see this later in the course (so for now it is completely fine to just answer these questions as good as you can and if you want to you can ask, like you did), but the answer to this question is less straightforward and depends on our goals. In general, in regression there are 2 goals: to predict and to explain.
To predict means that we want to be able to predict the response variable y as good as we can, and we do not really care how we achieve this (i.e. which variables we use) and we might just look at the adjusted r-squared value and pick the model with the highest value. There is a mistake on slide 31 and the adjusted r-squared for the model with both x1 and x2 is not shown, but it is 0.76. So in this case we would probably just take the model with just x2 (adjusted r-squared: 0.78)
To explain means that we are interested in the relation between the explanatory variables and the response variable (e.g. what does a unit increase in x1 mean for y?). In this context, we can for instance fit two separate models for the two explanatory variables, or if we are only interested in one of the two, we just use that one. As I said, you will see this again later in the course.
Question 4: with this new information and what I already wrote above please try again on your own to interpret the model with both x1 and x2 (e.g. are the slopes estimated to be significant?).

I hope this clears things up a little bit. You will see that things will get clearer as you progress through the course as these things will come up more than once. But feel free to ask for further clarifications.

Anova Degrees of Freedom

Student question (2021):

Hello, I am quite confused about the degrees of freedom in Anova. In the BC reading it says that the degree of freedom is the number of groups one has. For the one-way anova example in the lecture, this would mean 4 so 20-4 = 16. However, in the slides it say the degree of freedom is n-1, which would result in 19. When should I use which method or in other words, when asked for the degrees of freedom of an Anova, which number would be expected at an exam? Thanks for your help!

Answer given: There is more than one type of degree of freedom involved, so first some theory: If you look at slide 18 lecture 6, you can see that it is the total variability

SS_{total} that has n-1 degrees of freedom (i.e. we need 1 degree of freedom to calculate it). In Anova, we partition this total variability into the explained variability by the model

SS_{between~groups} and into the residual variability

SS_{within~groups} That is:

SS_{total}= SS_{between_{groups}+SS_{within}groups} To calculate the explained variability we need g-1 degrees of freedom (with g the number of groups) which leaves us with n-1 -(g-1)=n-g degrees of freedom for the residual variability. We then test with the F-test whether the explained variability is significant. Under H0 the calculated test statistics

F= follows a F-distribution with degrees of freedom g-1 and n-g, i.e.

FF_{g-1,n-g} so that’s why we need those degrees of freedom.

Now, notice that the degrees of freedom used for the explained variability is g-1, i.e. one less than there are groups. In addition to this, the intercept of the model is estimated as well, which uses up another degree of freedom. Hence the total number of degrees of freedom used by the anova model is g-1+1=g, and the remaining degrees of freedom are n-g (or as in your example 20-4=16). Note that the intercept does not come into play in the variance decomposition for the F-test, hence it is not listed in ANOVA table on slide 18.

I hope this clears things up a little.

Unit 10 Abalone Age 1

Student question (2021):

I have a question concerning the interpretation of the data and I was a bit confused with the terms of over-dispersion and under-dispersion. In the interpretation of the data it says:

“So, interestingly, and rather unusually for biological data, the data is under-dispersed, and”This is another example of a model being anti-conservative.”

And in the summary table we can see quite low p-values.

But in the lecture slides (lecture 10, slide 38) we learned that “When there is unaccounted over-dispersion, the p-values that are calculated are usually too small!” And on slide 41, which talks about under-dispersion, it says “In that case, your p-values are usually too large, that is, the results are conservative”

So I don’t understand the correct connection between over/under-dispersion, low/high p-values and (anti-)/conservative.

Answer given:

It is just a matter of slightly unlucky placement: the sentence “This is another example of a model being anti-conservative.” comes after a horizontal line (denoting a new topic/paragraph/etc.) and is preceded by the sentence “So the poisson glm had fewer significant terms than the lm.” (Note: this refers to the old version on openedx, and not to the version on Olat). Hence it refers to that sentence and has nothing to do with whether the model is over- or underdispersed. In fact, there are many reasons for why something (p-value, confidence interval, etc.) can be (anti-)conservative, one being that a wrong model is used: the lm is the wrong model (because it’s count data) and it produced smaller p-value than the glm, thus in this case (!) the lm is anti-conservative. What is written about the dispersion parameter in the slides is correct.

Unit 10 Abalone Age 2

Student question (2021):

I have a question concerning the interpretation of the data and I was a bit confused with the terms of over-dispersion and under-dispersion. In the interpretation of the data it says:

“So, interestingly, and rather unusually for biological data, the data is under-dispersed, and”This is another example of a model being anti-conservative.”

And in the summary table we can see quite low p-values.

So I don’t understand the correct connection between over/under-dispersion, low/high p-values and (anti-)/conservative.

Answer given:

Chr vs Factors

Student question (2021):

When I import my data, I often have “chr” rather than “factor”. This was the case for country and continent in the healthcare_financing.csv.

Should I do something specific to get the data directly as factor? If not, is there a way to change all them at once, rather than to change them one after the other using as.factor?

Answer given:

As of R 4.0.0. variables are read into R as characters instead of factors. Ahead of this course it was decided that we keep it like this because linear models can be fitted with both types. So it is suggested that you do not change the variables to factors (in fact, the only time you might/will want to change them is when you want to change the levels within a factor… and for now at least this is not needed!). If nevertheless you want to change characters to factors, it might actually be good to do that one variable at a time as you will only convert only want you need to convert and it gives you more control over it.

PS: if you really want to convert all characters to factors, you can do this (but again, no need to do it!)

dd <- dd %>% mutate(across(where(is\_character),as\_factor))

Changing reference in earthworm video example

Student question (2021):

In the earthworm analysis of the correlation between log.gewicht and Gattung. Is it correct to think that if the reference Gattung had been “N” , which seems to have a similar mean than Oc, then the p-value in the linear regression model would only have been significant for L but would not have been significant for Oc? I am assuming that the means for N and Oc are statistically similar.

Answer given:

Yes, exactly. You can change the reference level with the following code, then run the model with that new Gattung variable, and check the summary table. (You must install the forcats package to use the fct_relevel function.)

library(forcats)

 dd <- dd %>%
     mutate(Gattung\_refN\_ = fct\_relevel(Gattung, "N", after = 0))

Decomposing R^2

Student question (2021):

I have a question concerning the decomposition of R^2. In the lecture we learned that we should calculate the relative importance with the package relaimpo and the function calc.relimp. However, in the IC material, we learned how to calculate it more manually (R^2 of model_both - R^2 of model_weight). These two methods do not produce the same result so when should we use which method? Or did I understand something incorrectly in general?

Answer given:

There is more than one way to calculate relative importances of variables and the various methods differ in the the produced results. If you will be asked to calculate them, you will be told which approach to use.