Home Stomatitis Conduct regression analysis. Methods of mathematical statistics

Conduct regression analysis. Methods of mathematical statistics

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below on specific practical examples Let's look at these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, the size wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

Model linear regression has the following form:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.



First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.



Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict possible meaning another.

The correlation coefficient is denoted by r. Varies from +1 to -1. Classification of correlations for different areas will be different. When the coefficient is 0 linear dependence does not exist between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

  1. In the “Statistical” category, select the CORREL function.
  2. Argument “Array 1” - the first range of values ​​– machine operating time: A2:A14.
  3. Argument “Array 2” - second range of values ​​– repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data has become visible.

The main purpose of regression analysis consists in determining the analytical form of communication in which the change in the effective characteristic is due to the influence of one or more factor characteristics, and the set of all other factors that also influence the effective characteristic are taken as constant and average values.
Regression Analysis Problems:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, a distinction is made between positive linear and nonlinear and negative linear and nonlinear regression.
b) Determining the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Evaluation Not known values dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of specified values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Paired regression is an equation for the relationship between two variables y and x: , where y is the dependent variable (resultative attribute); x is an independent explanatory variable (feature-factor).

There are linear and nonlinear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are nonlinear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are nonlinear with respect to the estimated parameters.
Regressions that are nonlinear in explanatory variables:

Regressions that are nonlinear with respect to the estimated parameters: The construction of a regression equation comes down to estimating its parameters. To estimate the parameters of regressions linear in parameters, use the method least squares(MNC). The least squares method makes it possible to obtain such parameter estimates at which the sum of squared deviations of the actual values ​​of the resultant characteristic y from the theoretical ones is minimal, i.e.
.
For linear and nonlinear equations reducible to linear ones, solve next system regarding a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the phenomena being studied is assessed linear coefficient pairwise correlation for linear regression:

and correlation index - for nonlinear regression:

The quality of the constructed model will be assessed by the coefficient (index) of determination, as well as the average error of approximation.
Average approximation error - average deviation of calculated values ​​from actual ones:
.
The permissible limit of values ​​is no more than 8-10%.
The average elasticity coefficient shows by what percentage on average the result y will change from its average value when the factor x changes by 1% from its average value:
.

Task analysis of variance consists of analyzing the variance of the dependent variable:
,
Where - total amount squared deviations;
- the sum of squared deviations due to regression (“explained” or “factorial”);
- residual sum of squared deviations.
The share of variance explained by regression in the total variance of the resultant characteristic y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

The F-test - assessing the quality of the regression equation - consists of testing the hypothesis No about the statistical insignificance of the regression equation and the indicator of the closeness of the relationship. To do this, a comparison is made between the actual F fact and the critical (tabular) F table values ​​of the Fisher F-criterion. F fact is determined from the ratio of the values ​​of factor and residual variances calculated per degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors at given degrees of freedom and significance level a. The significance level a is the probability of rejecting the correct hypothesis, given that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F fact, then the hypothesis H o is not rejected and the statistical insignificance and unreliability of the regression equation is recognized.
For rate statistical significance regression and correlation coefficients, Student's t-test and confidence intervals for each indicator are calculated. A hypothesis is put forward about the random nature of the indicators, i.e. about their insignificant difference from zero. Assessing the significance of regression and correlation coefficients using Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of the linear regression parameters and the correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t table and t fact - we accept or reject the hypothesis H o.
The relationship between the Fisher F-test and the Student t-statistic is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a, b и не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t is a fact that the hypothesis H o is not rejected and the random nature of the formation of a, b or is recognized.
To calculate the confidence interval, we determine the maximum error D for each indicator:
, .
The formulas for calculating confidence intervals are as follows:
; ;
; ;
If zero falls within the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is taken to be zero, since it cannot simultaneously take both positive and negative values.
The forecast value is determined by substituting the corresponding (forecast) value into the regression equation. The average standard error of the forecast is calculated:
,
Where
and is being built confidence interval forecast:
; ;
Where .

Example solution

Task No. 1. For seven territories of the Ural region in 199X, the values ​​of two characteristics are known.
Table 1.
Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power (you must first perform the procedure of linearization of the variables by taking the logarithm of both parts);
c) demonstrative;
d) an equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model using the average error of approximation and Fisher's F test.

Solution (Option No. 1)

To calculate parameters a and b of linear regression (calculation can be done using a calculator).
solve a system of normal equations for A And b:
Based on the initial data, we calculate :
y x yx x 2 y 2 A i
l 68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total 405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed. meaning (Total/n) 57,89 54,90 3166,05 3048,34 3383,68 X X 8,1
s 5,74 5,86 X X X X X X
s 2 32,92 34,34 X X X X X X


Regression equation: y = 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of expenses for the purchase of food products decreases by an average of 0.35 percentage points.
Let's calculate the linear pair correlation coefficient:

The connection is moderate, inverse.
Let's determine the coefficient of determination:

The 12.7% variation in the result is explained by the variation in the x factor. Substituting actual values ​​into the regression equation X, let's determine the theoretical (calculated) values . Let's find the value of the average approximation error:

On average, calculated values ​​deviate from actual ones by 8.1%.
Let's calculate the F-criterion:

since 1< F < ¥ , should be considered F -1 .
The resulting value indicates the need to accept the hypothesis But oh the random nature of the identified dependence and the statistical insignificance of the parameters of the equation and the indicator of the closeness of the connection.
1b. The construction of a power model is preceded by the procedure of linearization of variables. In the example, linearization is performed by taking logarithms of both sides of the equation:


WhereY=lg(y), X=lg(x), C=lg(a).

For calculations we use the data in table. 1.3.

Table 1.3

Y X YX Y2 X 2 A i
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total 12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Average value 1,7605 1,7370 3,0572 3,1011 3,0194 X X 28,27 8,0
σ 0,0425 0,0484 X X X X X X X
σ 2 0,0018 0,0023 X X X X X X X

Let's calculate C and b:


We get a linear equation: .
Having performed its potentiation, we get:

Substituting actual values ​​into this equation X, we obtain theoretical values ​​of the result. Using them, we will calculate the indicators: tightness of connection - correlation index and average approximation error

The performance of the power-law model indicates that it is slightly better linear function describes the relationship.

1c. Constructing the equation of an exponential curve

preceded by a procedure for linearizing variables by taking logarithms of both sides of the equation:

For calculations we use the table data.

Y x Yx Y2 x 2 A i
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total 12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed. zn. 1,7605 54,9 96,5711 3,1011 3048,34 X X 28,68 8,0
σ 0,0425 5,86 X X X X X X X
σ 2 0,0018 34,339 X X X X X X X

Values ​​of regression parameters A and IN amounted to:


The resulting linear equation is: . Let us potentiate the resulting equation and write it in the usual form:

We will evaluate the closeness of the connection through the correlation index:

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type equalities are used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposing types of relationships: correlation and regression.

The first is characterized by the equality of conditional variables. IN in this case It is not known with certainty which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. A hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. A logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + m * In x + In E.

Multiple and nonlinear

Two more complex types Regression is multiple and non-linear. The equation multiple regression is expressed by the function y = f(x 1, x 2 ...x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. Nonlinear equation regression is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to linear view. In the most traditional application programs, it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. Negative indicator indicates availability feedback, positive - about a straight line. If the coefficient takes a value equal to 0, then there is no relationship. How closer value towards 1 - the stronger the connection between the parameters; the closer to 0 - the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values ​​of the describing factor are marked along the abscissa axis, while the values ​​of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about practically complete absence communications. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is the square exponent multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. A paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic meaning. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values ​​of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm. In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to correctly build multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and what is being modeled. Factors that will need to be included must meet the following criteria:

  • Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
  • There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences- the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

Exists great amount methods and techniques that explain how factors can be selected for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

  • Elimination method.
  • Switching method.
  • Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.

In the presence of correlation connection between factor and resultant signs, doctors often have to establish by what amount the value of one sign can change when the other changes to a generally accepted unit of measurement or one established by the researcher himself.

For example, how will the body weight of 1st grade schoolchildren (girls or boys) change if their height increases by 1 cm? For these purposes, the method of regression analysis is used.

The regression analysis method is most often used to develop normative scales and standards physical development.

  1. Definition of Regression. Regression is a function that allows, from the average value of one characteristic, to determine the average value of another characteristic that is correlated with the first.

    For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number colds on average at certain values ​​of average monthly air temperature in the autumn-winter period.

  2. Determination of the regression coefficient. Regression coefficient is the absolute value by which, on average, the value of one characteristic changes when another associated characteristic changes by a specified unit of measurement.
  3. Regression coefficient formula. R y/x = r xy x (σ y / σ x)
    where R у/х - regression coefficient;
    r xy - correlation coefficient between characteristics x and y;
    (σ y and σ x) - standard deviations of characteristics x and y.

    In our example;
    σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
    σ y = 8.65 (standard deviation of the number of infectious and cold diseases).
    Thus, R y/x is the regression coefficient.
    R у/х = -0.96 x (4.6 / 8.65) = 1.8, i.e. when the average monthly air temperature (x) decreases by 1 degree, the average number of infectious and cold diseases (y) in the autumn-winter period will change by 1.8 cases.

  4. Regression equation. y = M y + R y/x (x - M x)
    where y is the average value of the characteristic, which should be determined when the average value of another characteristic changes (x);
    x is the known average value of another characteristic;
    R y/x - regression coefficient;
    M x, M y - known average values ​​of characteristics x and y.

    For example, the average number of infectious and cold diseases (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x = - 9°, R y/x = 1.8 diseases, M x = -7°, M y = 20 diseases, then y = 20 + 1.8 x (9-7) = 20 + 3 .6 = 23.6 diseases.
    This equation is applied in the case of a linear relationship between two characteristics (x and y).

  5. Purpose of the Regression Equation. The regression equation is used to construct a regression line. The latter allows, without special measurements, to determine any average value (y) of one characteristic if the value (x) of another characteristic changes. Based on these data, a graph is constructed - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values ​​of the number of colds.
  6. Regression Sigma (formula).
    where σ Rу/х - sigma (standard deviation) of regression;
    σ y - standard deviation of the characteristic y;
    r xy - correlation coefficient between characteristics x and y.

    So, if σ y - standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is equal to - 0.96, then

  7. Regression sigma assignment. Gives a description of the measure of diversity of the resulting characteristic (y).

    For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. Thus, the average number of colds at air temperature x 1 = -6° can range from 15.78 diseases to 20.62 diseases.
    At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.

    Regression sigma is used to construct a regression scale, which reflects the deviation of the values ​​of the resulting characteristic from its average value plotted on the regression line.

  8. Data required to calculate and plot the regression scale
    • regression coefficient - R у/х;
    • regression equation - y = M y + R y/x (x-M x);
    • regression sigma - σ Rx/y
  9. Sequence of calculations and graphical representation of the regression scale.
    • determine the regression coefficient using the formula (see paragraph 3). For example, it is necessary to determine how much body weight will change on average (at a certain age depending on gender) if average height will change by 1 cm.
    • using the regression equation formula (see point 4), determine what, for example, body weight will be on average (y, y 2, y 3 ...) * for a certain height value (x, x 2, x 3 ...) .
      ________________
      * The value of "y" should be calculated for at least three known values ​​of "x".

      At the same time, the average values ​​of body weight and height (M x, and M y) for a certain age and gender are known

    • calculate the regression sigma, knowing the corresponding values ​​of σ y and r xy and substituting their values ​​into the formula (see paragraph 6).
    • based on the known values ​​x 1, x 2, x 3 and the corresponding average values ​​y 1, y 2 y 3, as well as the smallest (y - σ rу/х) and largest (y + σ rу/х) values ​​(y) construct a regression scale.

      To graphically represent the regression scale, the values ​​x, x2, x3 (ordinate axis) are first marked on the graph, i.e. a regression line is constructed, for example, the dependence of body weight (y) on height (x).

      Then at the corresponding points y 1, y 2, y 3 are marked numeric values regression sigma, i.e. find the smallest on the graph and highest value y 1, y 2, y 3.

  10. Practical use of the regression scale. Normative scales and standards are being developed, in particular for physical development. Using a standard scale, you can give an individual assessment of children's development. In this case, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one sigma of regression to the average calculated unit of body weight - (y) for a given height (x) (y ± 1 σ Ry/x).

    Physical development is considered disharmonious in terms of body weight if the child’s body weight for a certain height is within the second sigma of regression: (y ± 2 σ Ry/x)

    Physical development will be sharply disharmonious due to both excess and insufficient body weight if the body weight for a certain height is within the third sigma of regression (y ± 3 σ Ry/x).

According to the results statistical research physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and the average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

  • calculate the regression coefficient;
  • using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
  • calculate the regression sigma, construct a regression scale, and present the results of its solution graphically;
  • draw appropriate conclusions.

The conditions of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem Results of solving the problem
regression equation regression sigma regression scale (expected body weight (in kg))
M σ r xy R y/x X U σ R x/y y - σ Rу/х y + σ Rу/х
1 2 3 4 5 6 7 8 9 10
Height (x) 109 cm ± 4.4cm +0,9 0,16 100cm 17.56 kg ± 0.35 kg 17.21 kg 17.91 kg
Body mass (y) 19 kg ± 0.8 kg 110 cm 19.16 kg 18.81 kg 19.51 kg
120 cm 20.76 kg 20.41 kg 21.11 kg

Solution.

Conclusion. Thus, the regression scale within the calculated values ​​of body weight allows you to determine it at any other value of height or estimate individual development child. To do this, restore the perpendicular to the regression line.

  1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
  2. Lisitsyn Yu.P. Public health and healthcare. Textbook for universities. - M.: GEOTAR-MED, 2007. - 512 p.
  3. Medic V.A., Yuryev V.K. Course of lectures on public health and healthcare: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
  4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Manual in 2 volumes). - St. Petersburg, 1998. -528 p.
  5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and healthcare organization ( Tutorial) - Moscow, 2000. - 432 p.
  6. S. Glanz. Medical and biological statistics. Translation from English - M., Praktika, 1998. - 459 p.

Regression analysis underlies the creation of most econometric models, which include cost estimation models. To build valuation models, this method can be used if the number of analogues (comparable objects) and the number of cost factors (comparison elements) are related to each other as follows: P> (5 -g-10) x To, those. there should be 5-10 times more analogues than cost factors. The same requirement for the ratio of the amount of data and the number of factors applies to other tasks: establishing a connection between the cost and consumer parameters of the object; justification of the procedure for calculating corrective indices; identifying price trends; establishing a connection between wear and changes in influencing factors; obtaining dependencies for calculating cost standards, etc. Compliance with this requirement is necessary in order to reduce the likelihood of working with a data sample that does not satisfy the requirement of normal distribution of random variables.

The regression relationship reflects only the average trend of changes in the resulting variable, for example, cost, from changes in one or more factor variables, for example, location, number of rooms, area, floor, etc. This is the difference between a regression relationship and a functional one, in which the value of the resulting variable is strictly defined for a given value of the factor variables.

The presence of a regression relationship / between the resulting at and factor variables x p ..., x k(factors) indicates that this relationship is determined not only by the influence of selected factor variables, but also by the influence of variables, some of which are generally unknown, others cannot be assessed and taken into account:

The influence of unaccounted variables is indicated by the second term of this equation ?, which is called the approximation error.

The following types of regression dependencies are distinguished:

  • ? paired regression - relationship between two variables (resultant and factor);
  • ? multiple regression - the relationship between one outcome variable and two or more factor variables included in the study.

The main task of regression analysis is quantitation the closeness of the relationship between variables (in paired regression) and multiple variables (in multiple regression). The closeness of the connection is quantitatively expressed by the correlation coefficient.

The use of regression analysis makes it possible to establish the pattern of influence of the main factors (hedonic characteristics) on the indicator being studied, both in their entirety and for each of them separately. With the help of regression analysis, as a method of mathematical statistics, it is possible, firstly, to find and describe the form of the analytical dependence of the resulting (searched) variable on the factor ones and, secondly, to evaluate the closeness of this dependence.

By solving the first problem, a mathematical regression model is obtained, with the help of which the desired indicator is then calculated for given values ​​of the factors. Solving the second problem allows us to establish the reliability of the calculated result.

Thus, regression analysis can be defined as a set of formal (mathematical) procedures designed to measure the closeness, direction and analytical expression of the form of relationship between the resulting and factor variables, i.e. the output of such an analysis should be a structurally and quantitatively defined statistical model of the form:

Where y - the average value of the resulting variable (the desired indicator, for example, cost, rent, capitalization rate) by P her observations; x - value of the factor variable (/th cost factor); To - number of factor variables.

Function f(x l ,...,x lc), describing the dependence of the resulting variable on the factor factors is called a regression equation (function). The term “regression” (regression (Latin) - retreat, return to something) is associated with the specifics of one of the specific problems solved at the stage of formation of the method, and currently does not reflect the entire essence of the method, but continues to be used.

Regression analysis in general case includes the following steps:

  • ? forming a sample of homogeneous objects and collecting initial information about these objects;
  • ? selection of the main factors influencing the resulting variable;
  • ? checking the sample for normality using X 2 or binomial test;
  • ? acceptance of a hypothesis about the form of communication;
  • ? mathematical processing data;
  • ? obtaining a regression model;
  • ? assessment of its statistical indicators;
  • ? verification calculations using a regression model;
  • ? analysis of results.

The specified sequence of operations takes place when studying both a paired relationship between a factor variable and one resultant variable, and a multiple relationship between a resultant variable and several factorial ones.

The use of regression analysis imposes certain requirements on the initial information:

  • ? the statistical sample of objects must be homogeneous in functional and structural-technological terms;
  • ? quite numerous;
  • ? the cost indicator under study - the resulting variable (price, cost, expenses) - must be brought to the same conditions for its calculation for all objects in the sample;
  • ? factor variables must be measured accurately enough;
  • ? factor variables must be independent or minimally dependent.

The requirements for homogeneity and completeness of the sample are in conflict: the stricter the selection of objects based on their homogeneity, the smaller the sample obtained, and, conversely, to enlarge the sample it is necessary to include objects that are not very similar to each other.

After data on a group of homogeneous objects has been collected, they are analyzed to establish the form of connection between the resulting and factor variables in the form of a theoretical regression line. The process of finding a theoretical regression line consists of a reasonable choice of the approximating curve and calculating the coefficients of its equation. A regression line is a smooth curve (in a particular case a straight line) that describes using a mathematical function general trend the studied dependence and smoothing out irregular, random emissions from the influence of side factors.

To display paired regression dependencies in assessment tasks, the following functions are most often used: linear - y - a 0 + ars + s power - y - aj&i + s indicative - y - linear exponential - y - a 0 + ap* + c. Here - e approximation error caused by the action of unaccounted random factors.

In these functions, y is the resulting variable; x - factor variable (factor); A 0 , a r a 2 - regression model parameters, regression coefficients.

The linear exponential model belongs to the class of so-called hybrid models of the form:

Where

where x (i = 1, /) - values ​​of factors;

b t (i = 0, /) - coefficients of the regression equation.

In this equation the components A, B And Z correspond to the cost of individual components of the asset being valued, for example, the cost of a land plot and the cost of improvements, and the parameter Q is common. It is intended to adjust the value of all components of the asset being valued by common factor influences such as location.

The values ​​of the factors that are in the power of the corresponding coefficients are binary variables (0 or 1). The factors at the base of the degree are discrete or continuous variables.

Factors associated with multiplication coefficients are also continuous or discrete.

Specification is carried out, as a rule, using an empirical approach and includes two stages:

  • ? plotting regression field points on a graph;
  • ? graphical (visual) analysis of the type of possible approximating curve.

The type of regression curve cannot always be selected immediately. To determine it, first plot the points of the regression field based on the original data. Then visually draw a line along the position of the points, trying to find out the qualitative pattern of the connection: uniform growth or uniform decline, growth (decrease) with an increase (decrease) in the rate of dynamics, smooth approach to a certain level.

This empirical approach is complemented by logical analysis, starting from already known ideas about the economic and physical nature of the factors under study and their mutual influence.

For example, it is known that the dependencies of the resulting variables are economic indicators(prices, rentals) from a number of factor variables - price-forming factors (distance from the center of the settlement, area, etc.) are non-linear in nature, and they can be described quite strictly by power, exponential or quadratic functions. But for small ranges of factor changes, acceptable results can be obtained using a linear function.

If, however, it is still impossible to immediately make a confident choice of any one function, then two or three functions are selected, their parameters are calculated, and then, using the appropriate criteria for the closeness of the connection, the function is finally selected.

In theory, the regression process of finding the shape of a curve is called specification model, and its coefficients - calibration models.

If it is found that the resulting variable y depends on several factor variables (factors) x ( , x 2 , ..., x k, then they resort to building a multiple regression model. Typically, three forms of multiple communication are used: linear - y - a 0 + a x x x + a^x 2 + ... + a k x k, indicative - y - a 0 a*i a x t- a x b, power - y - a 0 x x ix 2 a 2. .x^or combinations thereof.

Exponential and power functions are more universal, since they approximate nonlinear relationships, which are the majority of those studied in the assessment of dependencies. In addition, they can be used in assessing objects and in the method statistical modeling in mass assessment, and in the method of direct comparison in individual assessment when establishing correction factors.

At the calibration stage, the parameters of the regression model are calculated using the least squares method, the essence of which is that the sum of squared deviations of the calculated values ​​of the resulting variable at., i.e. calculated using the selected coupling equation, from the actual values ​​should be minimal:

Values ​​j) (. and u. are known, therefore Q is a function of only the coefficients of the equation. To find the minimum S you need to take partial derivatives Q by the coefficients of the equation and equate them to zero:

As a result, we obtain a system of normal equations, the number of which is equal to the number of determined coefficients of the desired regression equation.

Suppose we need to find the coefficients linear equation y - a 0 + ars. The sum of squared deviations has the form:

/=1

Differentiate the function Q by unknown coefficients a 0 and and equate the partial derivatives to zero:

After the transformations we get:

Where P - number of original actual values at them (number of analogues).

The given procedure for calculating the coefficients of the regression equation is also applicable for nonlinear dependencies, if these dependencies can be linearized, i.e. lead to a linear form using a change of variables. Power and exponential functions after logarithm and appropriate change of variables acquire a linear form. For example, a power function after logarithm takes the form: In y = 1p 0 +a x 1ph. After replacing variables Y- In y, L 0 - In and No. X- In x we ​​get a linear function

Y=A 0 + cijX, the coefficients of which are found in the manner described above.

The least squares method is also used to calculate the coefficients of a multiple regression model. Thus, a system of normal equations for calculating a linear function with two variables Xj And x 2 after a series of transformations it looks like this:

Usually this system equations are solved using linear algebra methods. Plural power function lead to a linear form by taking logarithms and changing variables in the same way as a pair power function.

When using hybrid models, multiple regression coefficients are found using numerical procedures of the method of successive approximations.

To make a final choice from several regression equations, it is necessary to test each equation for the strength of the relationship, which is measured by the correlation coefficient, variance and coefficient of variation. Student's and Fisher's tests can also be used for evaluation. The greater the closeness of the connection a curve exhibits, the more preferable it is, all other things being equal.

If a problem of this class is being solved, when it is necessary to establish the dependence of a cost indicator on cost factors, then the desire to take into account as many influencing factors as possible and thereby build a more accurate multiple regression model is understandable. However, expanding the number of factors is hampered by two objective limitations. Firstly, to build a multiple regression model, a much larger sample of objects is required than to build a paired model. It is generally accepted that the number of objects in the sample should exceed the number P factors by at least 5-10 times. It follows that to build a model with three influencing factors, it is necessary to collect a sample of approximately 20 objects with a different set of factor values. Secondly, the factors selected for the model in their influence on the cost indicator must be sufficiently independent of each other. This is not easy to ensure, since the sample usually combines objects belonging to the same family, for which there is a natural change in many factors from object to object.

The quality of regression models is usually checked using the following statistical indicators.

Standard deviation of regression equation error (estimation error):

Where P - sample size (number of analogues);

To - number of factors (cost factors);

Error, unexplained regression equation(Fig. 3.2);

u. - the actual value of the resulting variable (for example, cost); y t - the calculated value of the result variable.

This indicator is also called standard error of estimation (RMS error). In the figure, the dots indicate specific sample values, the symbol indicates the line of sample average values, and the sloping dash-dotted line is the regression line.


Rice. 3.2.

The standard deviation of the estimation error measures the amount of deviation of the actual values ​​of y from the corresponding calculated values at( , obtained using a regression model. If the sample on which the model is based is subject to the normal distribution law, then it can be argued that 68% of the real values at are in the range at ± &e from the regression line, and 95% is in the range at ± 2d e. This indicator is convenient because the units of measurement sg? match the units of measurement at,. In this regard, it can be used to indicate the accuracy of the result obtained in the assessment process. For example, in a certificate of value you can indicate that the market value obtained using a regression model V with a 95% probability of being in the range from (V -2d,.) before (y + 2d s).

Coefficient of variation of the resulting variable:

Where y - the average value of the resulting variable (Fig. 3.2).

In regression analysis, the coefficient of variation var is standard deviation result, expressed as a percentage of the average value of the resulting variable. The coefficient of variation can serve as a criterion for the predictive qualities of the resulting regression model: the smaller the value var, the higher the predictive qualities of the model. The use of the coefficient of variation is preferable to the &e indicator, since it is a relative indicator. When using this indicator in practice, it can be recommended not to use a model whose coefficient of variation exceeds 33%, since in this case it cannot be said that the sample data is subject to a normal distribution law.

Determination coefficient (squared multiple correlation coefficient):

This indicator is used to analyze the overall quality of the resulting regression model. It indicates what percentage of the variance in the resulting variable is explained by the influence of all factor variables included in the model. The coefficient of determination always lies in the range from zero to one. The closer the value of the coefficient of determination is to unity, the better model describes the original data series. The coefficient of determination can be represented differently:

Here is the error explained by the regression model,

A - error, unexplained

regression model. From an economic point of view, this criterion allows us to judge what percentage of price variation is explained by the regression equation.

The exact limit of acceptability of the indicator R 2 It is impossible to specify for all cases. Both the sample size and the meaningful interpretation of the equation must be taken into account. As a rule, when studying data about objects of the same type obtained at approximately the same point in time, the value R 2 does not exceed the level of 0.6-0.7. If all forecast errors are zero, i.e. when the relationship between the resultant and factor variables is functional, then R 2 =1.

Adjusted coefficient of determination:

The need to introduce an adjusted coefficient of determination is explained by the fact that with an increase in the number of factors To the usual coefficient of determination almost always increases, but the number of degrees of freedom decreases (p - k- 1). The entered adjustment always reduces the value R2, because the (P - 1) > (p-k- 1). As a result, the value R 2 CKOf) may even become negative. This means that the value R 2 was close to zero before adjustment and the proportion of variance of the variable explained using the regression equation at very small.

Of the two options for regression models that differ in the value of the adjusted coefficient of determination, but have equally good other quality criteria, the option with a larger value of the adjusted coefficient of determination is preferable. The coefficient of determination is not adjusted if (p - k): k> 20.

Fisher coefficient:

This criterion is used to assess the significance of the coefficient of determination. Residual sum of squares represents a measure of prediction error using regression of known cost values y.. Its comparison with the regression sum of squares shows how many times the regression dependence predicts the result better than the average at. There is a table of critical values F R Fisher coefficient, depending on the number of degrees of freedom of the numerator - To, denominator v 2 = p - k- 1 and significance level a. If the calculated value of the Fisher test F R more table value, then the hypothesis about the insignificance of the coefficient of determination, i.e. about the discrepancy between the connections embedded in the regression equation and those that actually exist, with probability p = 1 - a is rejected.

Average approximation error(average percentage deviation) is calculated as the average relative difference, expressed as a percentage, between the actual and calculated values ​​of the resulting variable:

How less value of this indicator, the better the predictive qualities of the model. When this indicator is no higher than 7%, the model is highly accurate. If 8 > 15% indicates unsatisfactory accuracy of the model.

Standard error of the regression coefficient:

where (/I) -1 .- diagonal element of the matrix (X G X)~ 1 k - number of factors;

X- matrix of factor variable values:

X 7 - transposed matrix of factor variable values;

(ZhL) _| - matrix inverse of the matrix.

The smaller these indicators for each regression coefficient, the more reliable the estimate of the corresponding regression coefficient.

Student's test (t-statistics):

This criterion allows you to measure the degree of reliability (significance) of the relationship determined by a given regression coefficient. If the calculated value t. greater than the table value

t av, where v - p - k - 1 is the number of degrees of freedom, then the hypothesis that this coefficient is statistically insignificant is rejected with probability (100 - a)%. There are special tables of /-distributions that allow, based on a given level of significance a and the number of degrees of freedom v, to determine critical value criterion. The most commonly used value for a is 5%.

Multicollinearity, i.e. the effect of mutual relationships between factor variables leads to the need to be content with a limited number of them. If this is not taken into account, then you can end up with an illogical regression model. To avoid the negative effect of multicollinearity, pairwise correlation coefficients are calculated before building a multiple regression model r xjxj between selected variables X. And X

Here XjX; - the average value of the product of two factor variables;

XjXj- the product of the average values ​​of two factor variables;

Estimation of the variance of the factor variable x..

Two variables are considered to be regression related (i.e., collinear) if their pairwise correlation coefficient is absolute value strictly more than 0.8. In this case, any of these variables must be excluded from consideration.

In order to expand the capabilities of economic analysis of the resulting regression models, average elasticity coefficients, determined by the formula:

Where Xj- the average value of the corresponding factor variable;

y - the average value of the resulting variable; a i - regression coefficient for the corresponding factor variable.

The elasticity coefficient shows by what percentage on average the value of the resulting variable will change when the factor variable changes by 1%, i.e. how the resulting variable reacts to changes in the factor variable. For example, how does the price of sq. m. react? m of apartment area at a distance from the city center.

From the point of view of analyzing the significance of a particular regression coefficient, it is useful to estimate partial coefficient of determination:

Here is the estimate of the variance of the resulting

variable. This coefficient shows by what percentage the variation in the resulting variable is explained by the variation in the i-th factor variable included in the regression equation.

  • Hedonic characteristics are understood as characteristics of an object that reflect its useful (valuable) properties from the point of view of buyers and sellers.


New on the site

>

Most popular