Regression analysis aims to summarise the relationship between a ‘dependent’ variable and one or more ‘independent’ variables. It shows how well we can estimate a respondent’s score on the dependent variable from knowledge of their scores on the independent variables. It is often undertaken to support a claim that the phenomena measured by the independent variables cause the phenomenon measured by the dependent variable. However, the causal ordering, if any, between the variables cannot be verified or falsified by the technique. Causality can only be inferred through special experimental designs or through assumptions made by the analyst.
All regression analysis assumes that the relationship between the dependent and each of the independent variables takes a particular form. In linear regression, it is assumed that the relationship can be adequately summarised by a straight line. This means that a one percentage point increase in the value of an independent variable is assumed to have the same impact on the value of the dependent variable on average, irrespective of the previous values of those variables.
Strictly speaking the technique assumes that both the dependent and the independent variables are measured on an interval-level scale, although it may sometimes still be applied even where this is not the case. For example, one can use an ordinal variable (e.g. a Likert scale) as a dependent variable if one is willing to assume that there is an underlying interval-level scale and the difference between the observed ordinal scale and the underlying interval scale is due to random measurement error. Often the answers to a number of Likert-type questions are averaged to give a dependent variable that is more like a continuous variable. Categorical or nominal data can be used as independent variables by converting them into dummy or binary variables; these are variables where the only valid scores are 0 and 1, with 1 signifying membership of a particular category and 0 otherwise.
The assumptions of linear regression cause particular difficulties where the dependent variable is binary. The assumption that the relationship between the dependent and the independent variables is a straight line means that it can produce estimated values for the dependent variable of less than 0 or greater than 1. In this case it may be more appropriate to assume that the relationship between the dependent and the independent variables takes the form of an S-curve, where the impact on the dependent variable of a one-point increase in an independent variable becomes progressively less the closer the value of the dependent variable approaches 0 or 1. Logistic regression is an alternative form of regression which fits such an S-curve rather than a straight line. The technique can also be adapted to analyse multinomial non-interval-level dependent variables, that is, variables which classify respondents into more than two categories.
The two statistical scores most commonly reported from the results of regression analyses are:
A measure of variance explained: This summarises how well all the independent variables combined can account for the variation in respondents’ scores in the dependent variable. The higher the measure, the more accurately we are able in general to estimate the correct value of each respondent’s score on the dependent variable from knowledge of their scores on the independent variables.
A parameter estimate: This shows how much the dependent variable will change on average, given a one-unit change in the independent variable (while holding all other independent variables in the model constant). The parameter estimate has a positive sign if an increase in the value of the independent variable results in an increase in the value of the dependent variable. It has a negative sign if an increase in the value of the independent variable results in a decrease in the value of the dependent variable. If the parameter estimates are standardised, it is possible to compare the relative impact of different independent variables; those variables with the largest standardised estimates can be said to have the biggest impact on the value of the dependent variable.
Regression also tests for the statistical significance of parameter estimates. A parameter estimate is said to be significant at the five per cent level if the range of the values encompassed by its 95 per cent confidence interval (see also section on sampling errors) are either all positive or all negative. This means that there is less than a five per cent chance that the association we have found between the dependent variable and the independent variable is simply the result of sampling error and does not reflect a relationship that actually exists in the general population.
Factor analysis is a statistical technique which aims to identify whether there are one or more apparent sources of commonality to the answers given by respondents to a set of questions. It ascertains the smallest number of factors (or dimensions) which can most economically summarise all of the variation found in the set of questions being analysed. Factors are established where respondents who gave a particular answer to one question in the set tended to give the same answer as each other to one or more of the other questions in the set. The technique is most useful when a relatively small number of factors are able to account for a relatively large proportion of the variance in all of the questions in the set.
The technique produces a factor loading for each question (or variable) on each factor. Where questions have a high loading on the same factor, then it will be the case that respondents who gave a particular answer to one of these questions tended to give a similar answer to each other at the other questions. The technique is most commonly used in attitudinal research to try to identify the underlying ideological dimensions which apparently structure attitudes towards the subject in question.
- Until 1991 all British Social Attitudes samples were drawn from the Electoral Register (ER). However, following concern that this sampling frame might be deficient in its coverage of certain population subgroups, a ‘splicing’ experiment was conducted in 1991. We are grateful to the Market Research Development Fund for contributing towards the costs of this experiment. Its purpose was to investigate whether a switch to PAF would disrupt the time-series – for instance, by lowering response rates or affecting the distribution of responses to particular questions. In the event, it was concluded that the change from ER to PAF was unlikely to affect time trends in any noticeable way, and that no adjustment factors were necessary. Since significant differences in efficiency exist between PAF and ER, and because we considered it untenable to continue to use a frame that is known to be biased, we decided to adopt PAF as the sampling frame for future British Social Attitudes surveys. For details of the PAF/ER ‘splicing’ experiment, see Lynn and Taylor (1995).
- This includes households not containing any adults aged 18 or over, vacant dwelling units, derelict dwelling units, non-resident addresses and other deadwood.
- In 1993 it was decided to mount a split-sample experiment designed to test the applicability of Computer-Assisted Personal Interviewing (CAPI) to the British Social Attitudes survey series. CAPI has been used increasingly over the past decade as an alternative to traditional interviewing techniques. As the name implies, CAPI involves the use of a laptop computer during the interview, with the interviewer entering responses directly into the computer. One of the advantages of CAPI is that it significantly reduces both the amount of time spent on data processing and the number of coding and editing errors. There was, however, concern that a different interviewing technique might alter the distribution of responses and so affect the year-on-year consistency of British Social Attitudes data.
Following the experiment, it was decided to change over to CAPI completely in 1994 (the self-completion questionnaire still being administered in the conventional way). The results of the experiment are discussed in the British Social Attitudes 11th Report (Lynn and Purdon, 1994).
- Interview times recorded as less than 20 minutes were excluded, as these timings were likely to be errors.
- An experiment was conducted on the 1991 British Social Attitudes survey (Jowell et al., 1992) which showed that sending advance letters to sampled addresses before fieldwork begins has very little impact on response rates. However, interviewers do find that an advance letter helps them to introduce the survey on the doorstep, and a majority of respondents have said that they preferred some advance notice. For these reasons, advance letters have been used on British Social Attitudes surveys since 1991.
- Because of methodological experiments on scale development, the exact items detailed in this section have not been asked on all versions of the questionnaire each year.
- In 1994 only, this item was replaced by: Ordinary people get their fair share of the nation’s wealth. [Wealth1]
- In constructing the scale, a decision had to be taken on how to treat missing values (“Don’t know” and “Not answered”). Respondents who had more than two missing values on the left–right scale and more than three missing values on the libertarian–authoritarian and welfarism scales were excluded from that scale. For respondents with fewer missing values, “Don’t know” was recoded to the mid-point of the scale and “Not answered” was recoded to the scale mean for that respondent on their valid items.