# Darren’s revision essay on regression..

Regression

Predicting future outcomes is something that fascinates even the simplest of laypeople. The ability to take certain factors and create a mathematical model that significantly predicts the likelihood of something occurring is of extreme interest in many areas, including business, molecular biology and the social sciences.

In order to create a statistical model that is useful for creating such predictions, it’s important to be aware of the different factors that come into play in terms of influencing an outcome, this relationship between predictor variables and an outcome variable, will define how effective such a model is likely to be. We shall call this a co-relation between predictors and an outcome.

If there is some kind of strong relationship between predictors and an outcome, then various methods can be applied in order to create a model which best fits the data, allowing us to have a standard model that can be used to predict future outcomes. This type of model will be referred to from here on as a regression model.

Now, it’s a known given, that all straight lines have an equation, this equation is Y = Mx + c.

Y= the outcome

M = the ‘slope’ of the line

X = the predictor

C = the height at which the line crosses the Y axis

A real world example to explain this might suffice at this stage. Imagine that you are an events promoter that organises dance music raves for people and you want to create a model that predicts how successful an event will be in the future if you go ahead and hold it. You might as a promoter feel that the best predictor of whether your planned event goes well is how many flyers you print in order to promote it. I mean, it makes sense that printing 20 flyers and passing them out to your friends is significantly different to printing 40,000 flyers and distributing them out into the world to thousands of clubbers as they leave other venues

With these factors in mind, you might have an equation in your mind that reads:

Success of event = (flyers_printed x .43) + 5.67

(NB. The numerical figures are random and have been thrown in o illustrate the point)

Which would graphically look like, where the plotted red dots are the success rates of past events(y) and their relation with the amount of flyers printed(x)

Multiple regression

Up until this point, the explanation has contained just one predictor, the number of flyers that could be printed to promote an event and it’s relation to how successful the event will be. Now its obvious that staging an event has more to it than printing flyers, several other factors will significantly predict whether an event will be successful. Things that an event promoter has to consider include:

 Size of promotional team

 Time of year

 How many competitors there are hold similar types of event in the same month.

 Level of confidence that the promoter has

 And much much more.

If the promoters wished to expand on the original model, they would have to include these additional variables and give the variables a value that implies the strength of the relationship with the success of the event. The formula for this looks something like:

Y = m1x1 + m2x2 + m3x3 + C

Or in plain English, something more like

Success of event = (flyers_printed x .43) + (teamsize x .64) + (expenditure x .32) + 5.67

(NB. The numerical figures are random and have been thrown in o illustrate the point)

Enter(or forced entry)

This is where all predictors are put into the model at the same time. This can be problematic if variables share the variance that they have in influencing the outcome variable as the variance that is shared, is thrown out of the model and discarded.

Hierarchical regression

Sometimes the variables involved in the success of an event might be heavily related with each other, creating an issue, because if two predictor variables are related, then how do you know which one should be attributed with as having an effect on the outcome when a proportion of the effect is shared by both predictors?

In this type of regression, the general idea of regression is the same, but each predictor is entered into the regression equation in turn. This way, each predictor can attribute its effect on the outcome, before the other predictors have a chance to contribute their subsequent effects. Predictors can be entered into the model at different times in order to ascertain how much each one contributes towards the outcome. The different methods of inclusion my include:

Stepwise

o Forward

 Enters one variable at a time starting with the one that accounts for most of the variance in the outcome variable. If predictor falls below a certain threshold, they are excluded from the model

o Backward

 Does the opposite to Forward where all variables are entered simuoltaneuosly and then if any variable that affects the model ve

Logistic regression

This type of regression is ultimately doing the same thing as a linear model in terms of aiming to predict an outcome. The key difference is that the outcome variable is categorical. What this means, using the example of the event promoter, is that maybe a model is needed that predicts whether it’s a good idea whether to actually hold an event or not.

With logistic regression, the researcher is predicting a dichotomous outcome between 1 and 0. This situation poses problems for the assumptions of linear regression that the error variances (residuals) are normally distributed. The categorical variable is expressed as a probability value.

Instead, they are more likely to follow a logistic distribution. When using the logistic distribution, we need to make an algebraic conversion to arrive at our usual linear regression equation (Y = Mx + C).

With logistic regression, there is no standardized solution printed. And to make things more complicated, the unstandardized solution does not have the same straight-forward interpretation as it does with linear regression.

One other difference between the linear model and logistic regression is that there is no R2 to gauge the variance accounted for in the overall model (at least not one that has been agreed upon by statisticians). Instead, a chi-square test is used to indicate how well the logistic regression model fits the data.the model can also be tested for goodness of fit using a log likelihood, which assesses the model as a residual sums of squares would in linear regression. In order to assess the use of a variable in a model, the wald statistic is calculated by dividing each coefficient with the standard error, pretty much in the same way that the t test does it in the linear regression/

Comparisons & Contrasts

Variables can, if necessary, be entered into the model in the order speciﬁed by the researcher in a stepwise fashion like regression.

Logistic regression can predict categorical variables, linear does not.

Assumptions of Logistic regression

Linearity: logistic regression uses categorical data (i.e. Non-linear) but logistic regression log transforms the data so the binary categorical data (0,1) becomes a linear probability (0 to 1). Therefore, a linear relationship is assumed between the predictor and the log of the outcome.

Independence of errors (same as for linear reg): Data is unrelated, e.g. A single person is sampled only once.

Multicollinearlity: If multiple predictors are used they should not be too highly correlated (otherwise their combination will be redundant!).

Wald statistic is used to assess how good a predictor is contributing to the model

The log likelihood is the logistic equivalent of the method of least squests, it compares the baseline log to the actual log of the model in order to find how a good a fit it is for the data

R2, F, t

-2LL(compares model to baseline), Wald(assesses contribution of each predictor)

Logistic Regression

Predicting a categorical outcome (Y) from a continuous variable (X).

Builds model to predict the probability of outcome: P(Y)

Model is non-linear and based on natural logarithms.

Log-likelihood is used to compare baseline model (most frequent outcome) to model based on predictors.

Derivation of R2 (e.g. Nagelkerke) used to assess goodness of fit.

Contribution of each predictor identified using Wald Statistic.