Regression

  1. The term ‘regression’ was used by Sir Frances Galton in connection with the studies he made on the statures fathers and sons.
  2. It is a technique which determines a relationship between two variables to estimate one of the variables (dependent) for a given value of the other variable (independent).
  3. The variable whose value is to be estimated is called dependent variable (y) whereas the variable whose value is given is called independent variable (x).
  4. Examples of dependent and independent variables are:

Independent

Dependent

Price

Demand

Rainfall

Yield

Credit sales

Bad debts

Volume of production

Manufacturing expenses

  1. The values of the independent variable are assumed to be fixed.  Hence it is not a random variable.  On the other hand, the dependent variable, whose values are determined on the basis of the independent variable, is a random variable.

  2. If x is the independent variable and y is the dependent variable then the relationship between x and y, described by a straight line (y = a + bx), is called ‘linear relationship’.

Regression Lines:

  1. If we plot the paired observations (X1Y1), (X2Y2), ……….., (XnYn) on a graph, the resulting set of points is called a ‘scatter diagram’.
  2. A scatter diagram indicates a relationship between the variables X and Y and the dots of the scatter diagram tend to cluster around a curve or a line.  Such a curve or line is known as ‘curve of regression’ or ‘line of regression’.

Linear Regression Model:

  1. For a fixed value of independent variable ‘x’, if the value of dependent variable ‘y’ is observed a large number of times, different values are possible each time because of the random error involved in the measurement process.  The mean of these y-values is called the ‘conditional mean of y given x’ and is denoted by .
  2. The linear relationship between  and x is called a ‘population regression equation of y on x’:

Where α and β are the parameters of the equation.

  1. An observation yi is the sum of a population mean  and a component called ‘Random Error ( ) (read as “epsilon”).

or

This equation is called a ‘linear regression model of y on x’ and  is the random variable with mean is equal to zero and variance .

  1. In the above diagram, the line represents the line of regression of Y on X.  The parameter α, which is the expected value of Y when X = 0, is called Y-intercept.  The parameter β is slope of the population regression line and is known as the ‘population regression coefficient’.  When the line slopes downward to the right, the value of β will be negative; it then represents the amount of decrease in Y for each unit increase in X.

  2. In practice, the population regression line is unknown.  Since the regression is defined by the Y-intercept α and the slope β, therefore, the task of estimating the population regression line involves obtaining the estimates of α and β (based on sample data).  Thus the ‘population regression line’ (μy/x = α + βx) is estimated by the ‘sample regression line’ or ‘sample regression equation’:

 ------------------------ (i)

The problem of estimating the regression parameters α and β can be considered as fitting the best model on the scatter diagram.  One method for this purpose is the ‘method of least squares’.

Method of Least Squares:

  1. According to the principle of least squares, a line or a curve is best fitted if the sum of squares of the deviations of estimated values of y from the observed values of y is minimum.  Such line or a curve is called the ‘least square curve’ or ‘least square line’.  And the sum of squares is called the ‘Error Sum of Squares (ESS)’.  Therefore, the ESS is to be minimised and is represented by:

ESS =

Where  ESS     :           Error sum of squares

            yi          :           observed values

                    :           estimated values, i.e., ( )

It is further elaborated as:

ESS = Σ(yi – a – bx)2

  1. As we know that the statistic b is an estimator of β, is known as ‘sample regression coefficient’.  It measures changes in y per unit change in x.  Therefore, it represents the slope of regression line.  Mathematically it is represented as below:

------------------------ (ii)(a)

 ------------------------ (ii)(b)

  1. The statistic a is the estimator of α, is called the ‘sample regression constant’, and it measures the y-intercept of the sample regression line:

 ------------------------ (iii)

  1. Now assume ‘y’ to be ‘independent’ and ‘x’ to be ‘dependent’.  The ‘regression equation of x on y’ is as follows:

 ------------------------ (i)

 ------------------------(ii)(a)

 ------------------------(ii)(b)

 ------------------------(iii)

Example:

A sample of paired observations is given as below:

X

2

4

6

7

9

10

11

Y

1

2

4

7

10

12

14

Required:

(a)    Fit a line of regression to the data in the above table.

(b)   Construct a scatter diagram and graph the fitted line on the scatter diagram, and

(c)    Calculate error sum of squares.

Solution:

(a):

Regression Line of Y on X

x

y

xy

x2

2

1

2

4

–0.438

1.438

2.068

4

2

8

16

2.594

–0.594

0.353

6

4

24

36

5.626

–1.626

2.644

7

7

49

49

7.142

–0.142

0.020

9

10

90

81

10.174

–0.174

0.030

10

12

120

100

11.69

0.31

0.096

11

14

154

121

13.206

0.794

0.630

49

50

447

407

49.994

≈ 50

0.006

≈ 0

5.841

-------------------- (i)

-------------------- (ii)

 ------------------------- (iii)

For       x = 2,

            x = 4,

            x = 6,

            x = 7,

            x = 9,

x = 10,

x = 11,

(b):

(c) Error Sum of Squares (ESS):

ESS =

            = 5.841 

Coefficient of Determination:

  1. A measure of variation in a sample of n values is given by the sample variance:

It measures the variation in y about the sample mean .  The term  is called ‘Total Sum of Squares (TSS)’.

  1. Another measure of variance in a sample of n paired values is called ‘variance of estimate’:

It measures the variation in y about the estimated regression line.  The term  is called the ‘Error Sum of Squares (ESS)’:

ESS ≤ TSS

  1. The ‘Regression Sum of Squares (RSS)’ is the difference or excess of TSS over ESS:

RSS = TSS – ESS

Therefore, the TSS is partitioned into two components, i.e., ESS and RSS:

TSS = RSS + ESS

  1. RSS is the variation in y reduced (or explained) by the regression equation and the ESS is the variation which remains (or unexplained) in y when regression line is filled.  Thus, the total variation is divided into two, i.e., explained variation and unexplained variation.
  2. RSS is used as a measure of reliability of the estimate obtained by the filled regression line.  For this purpose the proportion of variation explained by the regression equation, called ‘Coefficient of Determination’ denoted by r2, is calculated as:

Note that the minimum value of r2 is zero (when RSS = 0 and ESS = TSS), and the maximum value of r2 is +1 (when RSS = TSS and ESS = 0); therefore, r2 lies between 0 to 1:

0 ≤ r2 ≤ 1

  1. Another formula is:

  1. Coefficient of determination of two regression equations:

r2 = b × d

Example:

Take the previous example, and calculate the coefficient of determination.

Solution:

Coefficient of Determination

x

y

xy

x2

y2

2

1

2

4

1

4

2

8

16

4

6

4

24

36

16

7

7

49

49

49

9

10

90

81

100

10

12

120

100

144

11

14

154

121

196

49

50

447

407

510

Top

Home Page