Home Oral cavity Correlation coefficients. Multiple correlation coefficient and coefficient of determination

Correlation coefficients. Multiple correlation coefficient and coefficient of determination


  1. Evaluate the quality of the constructed model. Has the quality of the model improved compared to the single-factor model? Give an impact assessment significant factors on the result using elasticity coefficients, - and -coefficients.
To assess the quality of the selected multiple model (6), similar to paragraph 1.4 of this problem, we use the coefficient of determination R- square, medium relative error approximation and F-Fisher criterion.

Determination coefficient R-squared will be taken from the results of “Regression” (table “Regression statistics” for model (6)).

Consequently, the variation (change) in the price of an apartment Y According to this equation, 76.77% is explained by the variation of the city of the region X 1 , number of rooms in the apartment X 2 and living space X 4 .

We use the original data Y i and residuals found by the Regression tool (table “Output of remainder” for model (6)). Let's calculate the relative errors and find the average value
.

WITHDRAWAL OF THE REST


Observation

Predicted Y

Leftovers

Rel. error

1

45,95089273

-7,95089273

20,92340192

2

86,10296493

-23,90296493

38,42920407

3

94,84442678

30,15557322

24,12445858

4

84,17648426

-23,07648426

37,76838667

5

40,2537216

26,7462784

39,91981851

6

68,70572376

24,29427624

26,12287768

7

143,7464899

-25,7464899

21,81905923

8

106,0907598

25,90924022

19,62821228

9

135,357993

-42,85799303

46,33296544

10

114,4792566

-9,47925665

9,027863476

11

41,48765602

0,512343975

1,219866607

12

103,2329236

21,76707636

17,41366109

13

130,3567798

39,64322022

23,3195413

14

35,41901876

2,580981242

6,7920559

15

155,4129693

-24,91296925

19,0903979

16

84,32108188

0,678918123

0,798727204

17

98,0552279

-0,055227902

0,056355002

18

144,2104618

-16,21046182

12,66442329

19

122,8677535

-37,86775351

44,55029825

20

100,0221225

59,97787748

37,48617343

21

53,27196558

6,728034423

11,21339071

22

35,06605378

5,933946225

14,47303957

23

114,4792566

-24,47925665

27,19917406

24

113,1343153

-30,13431529

36,30640396

25

40,43190991

4,568090093

10,15131132

26

39,34427892

-0,344278918

0,882766457

27

144,4794501

-57,57945009

66,25943623

28

56,4827667

-16,4827667

41,20691675

29

95,38240332

-15,38240332

19,22800415

30

228,6988826

-1,698882564

0,748406416

31

222,8067278

12,19327221

5,188626473

32

38,81483144

1,185168555

2,962921389

33

48,36325811

18,63674189

27,81603267

34

126,6080021

-3,608002113

2,933335051

35

84,85052935

15,14947065

15,14947065

36

116,7991162

-11,79911625

11,23725357

37

84,17648426

-13,87648426

19,73895342

38

113,9412801

-31,94128011

38,95278062

39

215,494184

64,50581599

23,03779142

40

141,7795953

58,22040472

29,11020236

Average

101,2375

22,51770962

Using the column of relative errors we find the average value =22.51% (using the AVERAGE function).

The comparison shows that 22.51%>7%. Consequently, the accuracy of the model is unsatisfactory.

By using F – Fisher criterion Let's check the significance of the model as a whole. To do this, we will write down from the results of using the “Regression” tool (table “analysis of variance” for model (6)) F= 39,6702.

Using the function FRIST we find the value F cr =3.252 for significance level α = 5%, and numbers of degrees of freedom k 1 = 2 , k 2 = 37 .

F> F cr, therefore, the equation of model (6) is significant, its use is advisable, the dependent variable Y is described quite well by the factor variables included in model (6) X 1 , X 2. And X 4 .

Additionally using t –Student's t test Let's check the significance of individual coefficients of the model.

t–Statistics for the coefficients of the regression equation are given in the results of the “Regression” tool. The following values ​​were obtained for the selected model (6):


Odds

Standard error

t-statistic

P-Value

Bottom 95%

Top 95%

Bottom 95.0%

Top 95.0%

Y-intersection

-5,643572321

12,07285417

-0,46745966

0,642988

-30,1285

18,84131

-30,1285

18,84131

X4

2,591405557

0,461440597

5,61590284

2.27E-06

1,655561

3,52725

1,655561

3,52725

X1

6,85963077

9,185748512

0,74676884

0,460053

-11,7699

25,48919

-11,7699

25,48919

X2

-1,985156991

7,795346067

-0,25465925

0,800435

-17,7949

13,82454

-17,7949

13,82454

Critical value t cr found for significance level α=5% and number of degrees of freedom k=40–2–1=37 . t cr =2.026 (STUDAR function).

For free odds α =–5.643 statistics defined
, t cr Therefore, the free coefficient is not significant and can be excluded from the model.

For the regression coefficient β 1 =6.859 statistics defined
, β 1 is not significant, it and the regional city factor can be removed from the model.

For the regression coefficient β 2 =-1,985 statistics defined
, t cr, therefore, the regression coefficient β 2 is not significant, it and the factor of the number of rooms in the apartment can be excluded from the model.

For the regression coefficient β 4 =2.591 statistics defined
, >t cr, therefore, the regression coefficient β 4 is significant, it and the factor of the living area of ​​the apartment can be retained in the model.

Conclusions about the significance of the model coefficients are made at the significance level α=5%. Looking at the P-value column, we note that the free coefficient α can be considered significant at the level of 0.64 = 64%; regression coefficient β 1 – at the level of 0.46 = 46%; regression coefficient β 2 – at the level of 0.8 = 80%; and the regression coefficient β 4 – at the level of 2.27E-06= 2.26691790951854E-06 = 0.0000002%.

When new factor variables are added to the equation, the coefficient of determination automatically increases R 2 and decreases average error approximation, although this does not always improve the quality of the model. Therefore, to compare the quality of model (3) and the selected multiple model (6), we use normalized coefficients of determination.

Thus, when adding the factor “city of region” to the regression equation X 1 and the factor “number of rooms in the apartment” X 2 the quality of the model has deteriorated, which speaks in favor of removing factors X 1 and X 2 from the model.

Let's carry out further calculations.

Average elasticity coefficients in the case of a linear model are determined by the formulas
.

Using the AVERAGE function we find: S Y, with an increase only in the factor X 4 for one of his standard deviation– increases by 0.914 S Y

Delta coefficients are determined by the formulas
.

Let's find pair correlation coefficients using the "Correlation" tool of the "Data Analysis" package in Excel.


Y

X1

X2

X4

Y

1

X1

-0,01126

1

X2

0,751061

-0,0341

1

X4

0,874012

-0,0798

0,868524

1

The coefficient of determination was determined earlier and is equal to 0.7677.

Let's calculate the delta coefficients:

;

Since Δ 1 1 And X 2 selected incorrectly and they need to be removed from the model. This means that according to the equation of the resulting linear three-factor model, the change in the resulting factor Y(apartment prices) is 104% explained by the influence of the factor X 4 (living area of ​​the apartment), by 4% influenced by the factor X 2 (number of rooms), by 0.0859% influenced by the factor X 1 (city of the region).

Regression analysis is a statistical research method that allows you to show the dependence of a particular parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large volumes of data. Today, having learned how to build regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of Regression

This concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • sedate;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Let's consider the problem of determining the dependence of the number of team members who quit on the average salary at 6 industrial enterprises.

Task. At six enterprises we analyzed the average monthly wages and the number of employees who left due to at will. In tabular form we have:

Number of people who quit

Salary

30,000 rubles

35,000 rubles

40,000 rubles

45,000 rubles

50,000 rubles

55,000 rubles

60,000 rubles

For the task of determining the dependence of the number of quitting workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +...+a k x k, where x i are the influencing variables, a i are the regression coefficients, and k is the number of factors.

For this problem, Y is the indicator of quitting employees, and the influencing factor is salary, which we denote by X.

Using the capabilities of the Excel spreadsheet processor

Regression analysis in Excel must be preceded by applying built-in functions to existing tabular data. However, for these purposes it is better to use the very useful “Analysis Pack” add-on. To activate it you need:

  • from the “File” tab go to the “Options” section;
  • in the window that opens, select the line “Add-ons”;
  • click on the “Go” button located below, to the right of the “Management” line;
  • check the box next to the name “Analysis package” and confirm your actions by clicking “Ok”.

If everything is done correctly, the required button will appear on the right side of the “Data” tab, located above the Excel worksheet.

in Excel

Now that we have all the necessary virtual tools at hand to carry out econometric calculations, we can begin to solve our problem. For this:

  • Click on the “Data Analysis” button;
  • in the window that opens, click on the “Regression” button;
  • in the tab that appears, enter the range of values ​​for Y (the number of quitting employees) and for X (their salaries);
  • We confirm our actions by pressing the “Ok” button.

As a result, the program will automatically fill a new spreadsheet with regression analysis data. Note! Excel allows you to manually set the location you prefer for this purpose. For example, this could be the same sheet where the Y and X values ​​are located, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-squared

In Excel, the data obtained during processing of the data in the example under consideration has the form:

First of all, you should pay attention to the R-squared value. It represents the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the parameters under consideration by 75.5%. The higher the value of the coefficient of determination, the more suitable the selected model is for a specific task. It is considered to correctly describe the real situation when the R-square value is above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Odds Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are reset to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence is completely small. The "-" sign indicates that the coefficient is negative. This is obvious, since everyone knows that the higher the salary at the enterprise, the fewer people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a relationship equation with several independent variables of the form:

y=f(x 1 +x 2 +…x m) + ε, where y is the resultant characteristic (dependent variable), and x 1, x 2,…x m are factor characteristics (independent variables).

Parameter Estimation

For multiple regression (MR), it is carried out using the method least squares(MNC). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε we construct a system of normal equations (see below)

To understand the principle of the method, consider a two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding attribute reflected in the index.

OLS is applicable to the MR equation on a standardized scale. In this case we get the equation:

in which t y, t x 1, … t xm are standardized variables, for which the average values ​​are equal to 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in in this case are specified as standardized and centralized, therefore their comparison with each other is considered correct and acceptable. In addition, it is customary to screen out factors by discarding those with the lowest βi values.

Problem Using Linear Regression Equation

Suppose we have a table of price dynamics for a specific product N over the past 8 months. It is necessary to make a decision on the advisability of purchasing a batch of it at a price of 1850 rubles/t.

month number

month name

product price N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet processor, you need to use the “Data Analysis” tool, already known from the example presented above. Next, select the “Regression” section and set the parameters. It must be remembered that in the “Input interval Y” field a range of values ​​must be entered for the dependent variable (in this case, prices for goods in specific months of the year), and in the “Input interval X” - for the independent variable (month number). Confirm the action by clicking “Ok”. On a new sheet (if so indicated) we obtain data for regression.

Using them, we construct a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the line with the name of the month number and the coefficients and lines “Y-intersection” from the sheet with the results regression analysis. Thus, the linear regression equation (LR) for task 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, the coefficients of multiple correlation (MCC) and determination are used, as well as the Fisher test and the Student t test. In the Excel spreadsheet with regression results, they are called multiple R, R-squared, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the closeness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong connection between the variables “Number of month” and “Price of product N in rubles per 1 ton”. However, the nature of this relationship remains unknown.

The square of the coefficient of determination R2 (RI) is a numerical characteristic of the proportion of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., statistical data are described with a high degree of accuracy by the resulting SD.

F-statistics, also called Fisher's test, are used to evaluate the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's test) helps to evaluate the significance of the coefficient with an unknown or free term of the linear relationship. If the value of the t-test > t cr, then the hypothesis about the insignificance of the free term linear equation rejected.

In the problem under consideration for the free term, using Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have zero probability that the correct hypothesis about the insignificance of the free term will be rejected. For the coefficient for the unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for an unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the feasibility of purchasing a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Let's consider a specific application problem.

The management of the NNN company must decide on the advisability of purchasing a 20% stake in MMM JSC. The cost of the package (SP) is 70 million US dollars. NNN specialists have collected data on similar transactions. It was decided to evaluate the value of the shareholding according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover volume (VO);
  • accounts receivable (VD);
  • cost of fixed assets (COF).

In addition, the parameter is used: the enterprise's salary arrears (V3 P) in thousands of US dollars.

Solution using Excel spreadsheet processor

First of all, you need to create a table of source data. It looks like this:

  • call the “Data Analysis” window;
  • select the “Regression” section;
  • in the “Input interval Y” box, enter the range of values ​​of the dependent variables from column G;
  • Click on the icon with a red arrow to the right of the “Input interval X” window and highlight the range of all values ​​​​from columns B, C, D, F on the sheet.

Mark the “New worksheet” item and click “Ok”.

Obtain a regression analysis for a given problem.

Study of results and conclusions

We “collect” the regression equation from the rounded data presented above on the Excel spreadsheet:

SP = 0.103*SOF + 0.541*VO - 0.031*VK +0.405*VD +0.691*VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for MMM JSC are presented in the table:

Substituting them into the regression equation, we get a figure of 64.72 million US dollars. This means that the shares of MMM JSC are not worth purchasing, since their value of 70 million US dollars is quite inflated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The Excel examples discussed above will help you solve practical problems in the field of econometrics.

When studying complex phenomena, it is necessary to take into account more than two random factors. A correct understanding of the nature of the relationship between these factors can be obtained only if all the random factors under consideration are examined at once. A joint study of three or more random factors will allow the researcher to establish more or less reasonable assumptions about the causal dependencies between the phenomena being studied. A simple form of multiple relationship is a linear relationship between three characteristics. Random factors are denoted as X 1 , X 2 and X 3. Paired correlation coefficients between X 1 and X 2 is denoted as r 12, respectively between X 1 and X 3 - r 12, between X 2 and X 3 - r 23. As a measure of the closeness of the linear relationship between three characteristics, multiple correlation coefficients are used, denoted R 1 ּ 23 , R 2 ּ 13 , R 3 ּ 12 and partial correlation coefficients, denoted r 12.3 , r 13.2 , r 23.1 .

The multiple correlation coefficient R 1.23 of three factors is an indicator of the closeness of the linear relationship between one of the factors (index before the point) and the combination of two other factors (indices after the point).

The values ​​of the coefficient R are always in the range from 0 to 1. As R approaches one, the degree of linear relationship between the three characteristics increases.

Between the multiple correlation coefficient, e.g. R 2 ּ 13 , and two pair correlation coefficients r 12 and r 23 there is a relationship: each of the paired coefficients cannot exceed absolute value R 2 ּ 13 .

Formulas for calculating multiple correlation coefficients when known values pair correlation coefficients r 12, r 13 and r 23 have the form:

Squared multiple correlation coefficient R 2 is called coefficient of multiple determination. It shows the proportion of variation in the dependent variable under the influence of the factors being studied.

The significance of multiple correlation is assessed by F-criterion:

n – sample size; k – number of factors. In our case k = 3.

null hypothesis about the equality of the multiple correlation coefficient in the population to zero ( h o:r=0) is accepted if f f<f t, and is rejected if
f f ³ f T.

theoretical value f-criteria is determined for v 1 = k- 1 and v 2 = n - k degrees of freedom and accepted significance level a (Appendix 1).

Example of calculating the multiple correlation coefficient. When studying the relationship between factors, pair correlation coefficients were obtained ( n =15): r 12 ==0.6; g 13 = 0.3; r 23 = - 0,2.

It is necessary to find out the dependence of the feature X 2 from sign X 1 and X 3, i.e. calculate the multiple correlation coefficient:

Table value F-criteria with n 1 = 2 and n 2 = 15 – 3 = 12 degrees of freedom with a = 0.05 F 0.05 = 3.89 and at a = 0.01 F 0,01 = 6,93.

Thus, the relationship between signs R 2.13 = 0.74 is significant at
1% significance level F f > F 0,01 .

Judging by the coefficient of multiple determination R 2 = (0.74) 2 = 0.55, trait variation X 2 is 55% associated with the effect of the factors being studied, and 45% of the variation (1-R 2) cannot be explained by the influence of these variables.

Private linear correlation

Partial correlation coefficient is an indicator that measures the degree of conjugation of two characteristics.

Mathematical statistics allows you to establish a correlation between two characteristics with a constant value of the third, without conducting a special experiment, but using paired correlation coefficients r 12 , r 13 , r 23 .

Partial correlation coefficients are calculated using the formulas:

The numbers before the dot indicate which features the relationship is being studied, and the number after the dot indicates the influence of which feature is excluded (eliminated). The error and significance criterion for partial correlation are determined using the same formulas as for pair correlation:

.

Theoretical value t- criterion is determined for v = n– 2 degrees of freedom and accepted significance level a (Appendix 1).

The null hypothesis that the partial correlation coefficient in the population is equal to zero ( H o: r= 0) is accepted if t f< t t, and is rejected if
t f ³ t T.

Partial coefficients can take values ​​between -1 and +1. Private coefficients of determination found by squaring the partial correlation coefficients:

D 12.3 = r 2 12ּ3 ; d 13.2 = r 2 13ּ2 ; d 23ּ1 = r 2 23ּ1 .

Determining the degree of partial influence of individual factors on an effective trait while excluding (eliminating) its connection with other traits that distort this correlation is often of great interest. Sometimes it happens that with a constant value of the eliminated characteristic, it is impossible to notice its statistical influence on the variability of other characteristics. To understand the technique for calculating the partial correlation coefficient, consider an example. There are three options X, Y And Z. For sample size n= 180 paired correlation coefficients are determined

r xy = 0,799; r xz = 0,57; r yz = 0,507.

Let us determine the partial correlation coefficients:

Partial correlation coefficient between parameter X And Y Z (r xy = 0.720) shows that only a small part of the relationship between these characteristics in the overall correlation ( r xy= 0.799) is due to the influence of the third characteristic ( Z). A similar conclusion must be made regarding the partial correlation coefficient between the parameter X and parameter Z with a constant parameter value Y (r X zּу = 0.318 and r xz= 0.57). Against, partial coefficient correlations between parameters Y And Z with a constant parameter value X r yz ּ x= 0.105 is significantly different from general coefficient correlation r y z = 0.507. From this it is clear that if you select objects with the same parameter value X, then the relationship between the signs Y And Z they will have a very weak one, since a significant part of this relationship is due to variation in the parameter X.

Under some circumstances, the partial correlation coefficient may be opposite in sign to the pair one.

For example, when studying the relationship between characteristics X, Y And Z- paired correlation coefficients were obtained (with n = 100): r xy = 0.6; r X z= 0,9;
r y z = 0,4.

Partial correlation coefficients excluding the influence of the third characteristic:

From the example it is clear that the values pair coefficient and the partial correlation coefficient differ in sign.

The partial correlation method makes it possible to calculate the second-order partial correlation coefficient. This coefficient indicates the relationship between the first and second characteristics with a constant value of the third and fourth. The determination of the second-order partial coefficient is based on the first-order partial coefficients using the formula:

Where r 12 . 4 , r 13 ּ4, r 23 ּ4 - partial coefficients, the value of which is determined by the partial coefficient formula, using pair correlation coefficients r 12 , r 13 , r 14 , r 23 , r 24 , r 34 .

7.1. Linear Regression Analysis consists of fitting a graph to a set of observations using the least squares method. Regression analysis allows us to establish a functional relationship between some random variable Y and some influencing Y values X. This dependence is called the regression equation. There are simple ( y=m*x+b) and plural ( y=m 1 *x 1 +m 2 *x 2 +... + m k *x k +b) regression of linear and nonlinear type.
To assess the degree of connection between quantities, it is used Pearson R multiple correlation coefficient(correlation ratio), which can take values ​​from 0 to 1. R=0 if there is no relationship between the quantities, and R=1 if there is a functional connection between the quantities. In most cases, R takes intermediate values ​​from 0 to 1. The value R 2 called coefficient of determination.
The task of constructing a regression dependence is to find the vector of coefficients M multiple linear regression model, in which the coefficient R takes the maximum value.
To assess significance R applies Fisher's F test, calculated by the formula:

Where n– number of experiments; k– number of model coefficients. If F exceeds some critical value for data n And k and accepted confidence probability, then the value R considered significant.

7.2. Tool Regression from Analysis package allows you to calculate the following data:

· odds linear function regression– least squares method; the type of regression function is determined by the structure of the source data;

· coefficient of determination and related quantities(table Regression statistics);

· variance table and criterion statistics to test the significance of regression(table Analysis of variance );

· standard deviation and its other statistical characteristics for each regression coefficient, allowing you to check the significance of this coefficient and build for it confidence intervals;

· regression function values ​​and residuals– differences between the initial values ​​of the variable Y and calculated values ​​of the regression function (table Withdrawal of balance);

· probabilities corresponding to the values ​​of the variable Y ordered in ascending order(table Probability output).

7.3. Call the selection tool via Data > Data Analysis > Regression.

7.4. In field Input interval Y enter the address of the range containing the values ​​of the dependent variable Y. The range must consist of one column.
In field Input interval X enter the address of a range containing the values ​​of variable X. The range must consist of one or more columns, but no more than 16 columns. If specified in the fields Input interval Y And Input interval X ranges include column headers, then you need to check the option box Tags– these headers will be used in the output tables generated by the tool Regression.
Option checkbox Constant - zero should be established if the regression equation has a constant b is forced equal to zero.
Option Reliability level is set when it is necessary to construct confidence intervals for regression coefficients with a confidence level other than 0.95, which is used by default. After checking the option box Reliability level An input field becomes available in which a new confidence level value is entered.
In area Leftovers There are four options: Leftovers, Standardized balances, Balance chart And Selection schedule. If at least one of them is installed, the table will appear in the output results Withdrawal of balance, in which the values ​​of the regression function and residuals will be displayed - the differences between the initial values ​​of the variable Y and the calculated values ​​of the regression function. In area Normal probability There is one option – ; its installation generates a table in the output results Probability output and leads to the construction of the corresponding graph.


7.5. Set the parameters according to the picture. Make sure that the Y value is the first variable (including the cell with the name), and the X value is the other two variables (including the cells with the names). Click OK.

7.6. In the table Regression statistics The following data is provided.

Plural R– root of the coefficient of determination R 2 given in the next line. Another name for this indicator is the correlation index, or multiple correlation coefficient.

R-square– coefficient of determination R 2 ; calculated as a ratio regression sum of squares(cell C12) to total sum of squares(cell C14).

Normalized R-squared calculated by the formula

where n is the number of values ​​of the variable Y, k is the number of columns in the input interval of the variable X.

Standard error– root of the residual variance (cell D13).

Observations– number of values ​​of the variable Y.

7.7. IN Dispersion table in the column SS the sums of squares are given in the column df– number of degrees of freedom. in the column MS– dispersion. In line Regression in the column f The value of the criterion statistics was calculated to test the significance of the regression. This value is calculated as the ratio of the regression variance to the residual variance (cells D12 and D13). In column Significance F the probability of the obtained value of the criterion statistics is calculated. If this probability is less than, for example, 0.05 (a given significance level), then the hypothesis about the insignificance of the regression (i.e., the hypothesis that all coefficients of the regression function are equal to zero) is rejected and the regression is considered to be significant. In this example, the regression is not significant.

7.8. In the following table, in the column Odds, the calculated values ​​of the coefficients of the regression function are written, while in the line Y-intersection the value of the free term is written b. In column Standard error The standard deviations of the coefficients were calculated.
In column t-statistic The ratios of the coefficient values ​​to their standard deviations are recorded. These are the values ​​of criterion statistics for testing hypotheses about the significance of regression coefficients.
In column P-Value significance levels corresponding to the values ​​of criterion statistics are calculated. If the calculated significance level is less than the specified significance level (for example, 0.05). then the hypothesis that the coefficient differs significantly from zero is accepted; otherwise, the hypothesis that the coefficient differs insignificantly from zero is accepted. In this example, only the coefficient b significantly different from zero, the rest - insignificantly.
In columns Bottom 95% And Top 95% the boundaries of confidence intervals with a confidence level of 0.95 are given. These boundaries are calculated using the formulas
Lower 95% = Coefficient - Standard Error * t α;
Upper 95% = Coefficient + Standard Error * t α.
Here t α– quantile of order α Student t distributions with (n-k-1) degrees of freedom. In this case α = 0.95. The boundaries of confidence intervals in columns are calculated in the same way Bottom 90.0% And Top 90.0%.

7.9. Consider the table Withdrawal of balance from the output results. This table appears in the output results only when at least one option in the area is set Leftovers dialog box Regression.

In column Observation the serial numbers of the variable values ​​are given Y.
In column Predicted Y the values ​​of the regression function y i = f(x i) are calculated for those values ​​of the variable X, which corresponds to serial number i in the column Observation.
In column Leftovers contains differences (residues) ε i =Y-y i, and the column Standard balances– normalized residuals, which are calculated as ratios ε i / s ε. where s ε is the standard deviation of the residuals. The square of the value s ε is calculated using the formula

where is the average of the residuals. The value can be calculated as the ratio of two values ​​from the dispersion table: the sum of squared residuals (cell C13) and the degrees of freedom from the row Total(cell B14).

7.10. By table values Withdrawal of balance two types of graphs are built: residual charts And selection schedules(if the appropriate options are set in the area Leftovers dialog box Regression). They are built for each variable component X separately.

On balance charts balances are displayed, i.e. differences between the original values Y and calculated from the regression function for each value of the variable component X.

On selection schedules displays both the original Y values ​​and the calculated regression function values ​​for each variable component value X.

7.11. The last table of the output results is the table Probability output. It appears if in the dialog box Regression option installed Normal probability plot.
Column values Percentile are calculated as follows. Step is calculated h = (1/n)*100%, the first value is h/2, the latter is equal 100-h/2. Starting from the second value, each subsequent value is equal to the previous one, to which a step is added h.
In column Y the variable values ​​are given Y, sorted in ascending order. Based on the data in this table, the so-called schedule normal distribution . It allows you to visually assess the degree of linearity of the relationship between variables X And Y.


8. D analysis of variance

8.1. Analysis package allows for three types of analysis of variance. The choice of a specific instrument is determined by the number of factors and the number of samples in the data set being studied.
used to test the hypothesis that the means of two or more samples belonging to the same sample are similar population.
Two-way ANOVA with repetitions is a more complex option univariate analysis, including more than one sample for each group of data.
Two-way ANOVA without repetition is a two-way analysis of variance that does not include more than one sample per group. It is used to test the hypothesis that the means of two or more samples are the same (the samples belong to the same population).

8.2. One-way ANOVA

8.2.1. Let's prepare the data for analysis. Create a new sheet and copy the columns to it A, B, C, D. Remove the first two lines. The prepared data can be used to conduct One-way analysis of variance.

8.2.2. Call the selection tool via Data > Data Analysis > One-way ANOVA. Fill in according to the picture. Click OK.

8.2.3. Consider the table Results: Check– number of repetitions, Sum– the sum of indicator values ​​by row, Dispersion– partial variance of the indicator.

8.2.4. Table Analysis of variance: first column Source of Variation contains the name of the dispersions, SS– sum of squared deviations, df- degree of freedom, MS– average square, F-test actual F distribution. P-value– the probability that the variance reproduced by the equation is equal to the variance of the residuals. It establishes the probability that the obtained quantitative determination of the relationship between factors and the result can be considered random. F-critical is the theoretical F value, which is subsequently compared with the actual F.

8.2.5. Null hypothesis of equality mathematical expectations of all samples is accepted if the inequality F-test < F-critical. this hypothesis should be rejected. In this case, the average values ​​of the samples differ significantly.

Construction of linear regression, evaluation of its parameters and their significance can be performed much faster when using the package Excel analysis(Regression). Let us consider the interpretation of the results obtained in general case (k explanatory variables) according to example 3.6.

In the table regression statistics the following values ​​are given:

Multiple R – multiple correlation coefficient;

R- square– coefficient of determination R 2 ;

Normalized R - square– adjusted R 2 adjusted for the number of degrees of freedom;

Standard error– regression standard error S;

Observations – number of observations n.

In the table Analysis of variance are given:

1. Column df - number of degrees of freedom equal to

for string Regression df = k;

for string Remainderdf = nk – 1;

for string Totaldf = n– 1.

2. Column SS – the sum of squared deviations equal to

for string Regression ;

for string Remainder ;

for string Total .

3. Column MS variances determined by the formula MS = SS/df:

for string Regression– factor dispersion;

for string Remainder– residual variance.

4. Column F – calculated value F-criterion calculated using the formula

F = MS(regression)/ MS(remainder).

5. Column Significance F – significance level value corresponding to the calculated F-statistics .

Significance F= FDIST( F- statistics, df(regression), df(remainder)).

If significance F < стандартного уровня значимости, то R 2 is statistically significant.

Odds Standard error t-statistics P-value Bottom 95% Top 95%
Y 65,92 11,74 5,61 0,00080 38,16 93,68
X 0,107 0,014 7,32 0,00016 0,0728 0,142

This table shows:

1. Odds– coefficient values a, b.

2. Standard error– standard errors of regression coefficients S a, Sb.



3. t- statistics– calculated values t -criteria calculated by the formula:

t-statistic = Coefficients/Standard error.

4.R-value (significance t) is the significance level value corresponding to the calculated t- statistics.

R-value = STUDIDIST(t-statistics, df(remainder)).

If R-meaning< стандартного уровня значимости, то соответствующий коэффициент статистически значим.

5. Bottom 95% and Top 95%– lower and upper limits 95% confidence intervals for the coefficients of the theoretical linear regression equation.

WITHDRAWAL OF THE REST
Observation Predicted y Residues e
72,70 -29,70
82,91 -20,91
94,53 -4,53
105,72 5,27
117,56 12,44
129,70 19,29
144,22 20,77
166,49 24,50
268,13 -27,13

In the table WITHDRAWAL OF THE REST indicated:

in the column Observation– observation number;

in the column Foretold y – calculated values ​​of the dependent variable;

in the column Leftovers e – the difference between the observed and calculated values ​​of the dependent variable.

Example 3.6. There are data (conventional units) on food costs y and per capita income x for nine groups of families:

x
y

Using the results of the Excel analysis package (Regression), we will analyze the dependence of food costs on per capita income.

The results of regression analysis are usually written in the form:

where the standard errors of the regression coefficients are indicated in parentheses.

Regression coefficients A = 65,92 and b= 0.107. Direction of communication between y And x determines the sign of the regression coefficient b= 0.107, i.e. the connection is direct and positive. Coefficient b= 0.107 shows that with an increase in per capita income by 1 conventional. units food costs increase by 0.107 conventional units. units

Let us evaluate the significance of the coefficients of the resulting model. Significance of coefficients ( a, b) is checked by t-test:

P-value ( a) = 0,00080 < 0,01 < 0,05

P-value ( b) = 0,00016 < 0,01 < 0,05,

therefore, the coefficients ( a, b) are significant at the 1% level, and even more so at the 5% significance level. Thus, the regression coefficients are significant and the model is adequate to the original data.

The regression estimation results are compatible not only with the obtained values ​​of the regression coefficients, but also with a certain set of them (confidence interval). With a 95% probability, confidence intervals for the coefficients are (38.16 – 93.68) for a and (0.0728 – 0.142) for b.

The quality of the model is assessed by the coefficient of determination R 2 .

Magnitude R 2 = 0.884 means that the per capita income factor can explain 88.4% of the variation (scatter) in food expenses.

Significance R 2 is checked by F- test: significance F = 0,00016 < 0,01 < 0,05, следовательно, R 2 is significant at the 1% level, and even more so at the 5% significance level.

In the case of pairwise linear regression, the correlation coefficient can be defined as . The obtained value of the correlation coefficient indicates that the relationship between food expenses and per capita income is very close.



New on the site

>

Most popular