Home Gums What are the requirements in a regression analysis model. Methods of mathematical statistics

What are the requirements in a regression analysis model. Methods of mathematical statistics

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics
Regression statistics
Plural R 0,998364
R-square 0,99673
Normalized R-squared 0,996321
Standard error 0,42405
Observations 10

Let's first consider top part calculations presented in table 8.3a - regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases, the R-squared value falls between these values, called extreme values, i.e. between zero and one.

If the R-squared value is close to one, this means that the constructed model explains almost all the variability in the relevant variables. Conversely, an R-squared value close to zero means the quality of the constructed model is poor.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Plural R- multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Multiple R is equal to square root from the coefficient of determination, this quantity takes values ​​in the range from zero to one.

In simple linear regression analysis, multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds Standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is provided

Now consider the middle part of the calculations, presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. The results of the derivation of residuals are presented. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers
Observation Predicted Y Leftovers Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value

The purpose of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factor, explanatory, determinant, regressor and predictor variables.

The dependent variable is sometimes called the determined, explained, or “response” variable. The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is effective method modeling and forecasting.

Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.

Paired Regression Analysis

The first steps when using regression analysis will be almost identical to those we took in calculating the correlation coefficient. Three main conditions for effectiveness correlation analysis according to the Pearson method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the method least squares.

To more clearly illustrate the differences between the two methods of data analysis, let us turn to the example already discussed with the variables “SPS support” and “rural population share”. The source data is identical. The difference in scatterplots will be that in regression analysis it is correct to plot the dependent variable - in our case, “SPS support” on the Y-axis, whereas in correlation analysis this does not matter. After cleaning outliers, the scatterplot looks like this:

The fundamental idea of ​​regression analysis is that, having general trend for variables - in the form of a regression line - you can predict the value of the dependent variable, given the values ​​of the independent one.

Let's imagine the usual mathematical linear function. Any straight line in Euclidean space can be described by the formula:

where a is a constant that specifies the displacement along the ordinate axis; b is a coefficient that determines the angle of inclination of the line.

Knowing the slope and constant, you can calculate (predict) the value of y for any x.

This simplest function and formed the basis for a regression analysis model with the caveat that we will not predict the value of y exactly, but within a certain confidence interval, i.e. approximately.

The constant is the point of intersection of the regression line and the y-axis (F-intersection, usually denoted “interceptor” in statistical packages). In our example with voting for the Union of Right Forces, its rounded value will be 10.55. The angular coefficient b will be approximately -0.1 (as in correlation analysis, the sign shows the type of connection - direct or inverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. us. + 10.55.

ATP = -0.10 x 47 + 10.55 = 5.63.

The difference between the original and predicted values ​​is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of the “Republic of Adygea” the remainder will be equal to 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less successfully the predicted value.

We calculate the predicted values ​​and residuals for all cases:
Happening Sat down. us. THX

(original)

THX

(predicted)

Leftovers
Republic of Adygea 47 3,92 5,63 -1,71 -
Altai Republic 76 5,4 2,59 2,81
Republic of Bashkortostan 36 6,04 6,78 -0,74
The Republic of Buryatia 41 8,36 6,25 2,11
The Republic of Dagestan 59 1,22 4,37 -3,15
The Republic of Ingushetia 59 0,38 4,37 3,99
Etc.

Analysis of the ratio of initial and predicted values ​​serves to assess the quality of the resulting model and its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values ​​of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variables, in our case - 0.63. To meaningfully interpret multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R-squared (R 2) shows the proportion of variation in the dependent variable that is explained by the independent variable(s).

In our case, R 2 = 0.39 (0.63 2); this means that the variable “rural population share” explains approximately 40% of the variation in the variable “SPS support”. The larger the coefficient of determination, the higher the quality of the model.

Another indicator of model quality is the standard error of estimate. This is a measure of how widely the points are “scattered” around the regression line. The measure of spread for interval variables is standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of residuals. The higher its value, the greater the scatter and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the “SPS support” variable.

Regression statistics also include analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two quantities (/"-ratio). Dispersion statistics are especially important for sample studies- it shows how likely it is that there is a relationship between the independent and dependent variables in population. However, even for continuous research (as in our example), studying the results analysis of variance not useful. In this case, they check whether the identified statistical pattern is caused by a coincidence of random circumstances, how typical it is for the set of conditions in which the population under study is located, i.e. it is not the truth of the result obtained for some larger general population that is established, but the degree of its regularity and freedom from random influences.

In our case, the ANOVA statistics are as follows:

SS df MS F meaning
Regress. 258,77 1,00 258,77 54,29 0.000000001
Remainder 395,59 83,00 L,11
Total 654,36

The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is due to chance).

The t criterion performs a similar function, but in relation to regression coefficients (angular and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

Model multiple regression almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 + …+ bpXp + a.

If there are more than two independent variables, we are not able to get a visual idea of ​​their relationship; in this regard, multiple regression is less “visual” than pairwise regression. When you have two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows you to visually represent the structure of the data well.

When working with multiple regression, as opposed to pairwise regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. Step-by-step algorithm involves the sequential inclusion (exclusion) of independent variables based on their explanatory “weight”. The stepwise method is good when there are many independent variables; it “cleanses” the model of frankly weak predictors, making it more compact and concise.

An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements we considered for the case of pairwise regression. In addition, there are other important components to the statistics of multiple regression analysis.

We will illustrate the work with multiple regression using the example of testing hypotheses that explain differences in the level of electoral activity across Russian regions. Specific empirical studies have suggested that voter turnout levels are influenced by:

National factor (variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;

Urbanization factor (variable " urban population"; operationalized as the share of the urban population in the constituent entities of the Russian Federation; we have already worked with this factor within the framework of correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.

The dependent variable - “intensity of electoral activity” (“active”) is operationalized through average turnout data by region in federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will be as follows:

Happening Variables
Assets. Gor. us. Rus. us.
Republic of Adygea 64,92 53 68
Altai Republic 68,60 24 60
The Republic of Buryatia 60,75 59 70
The Republic of Dagestan 79,92 41 9
The Republic of Ingushetia 75,05 41 23
Republic of Kalmykia 68,52 39 37
Karachay-Cherkess Republic 66,68 44 42
Republic of Karelia 61,70 73 73
Komi Republic 59,60 74 57
Mari El Republic 65,19 62 47

Etc. (after cleaning out emissions, 83 out of 88 cases remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the urbanization factor together explain about 38% of the variation in the “electoral activity” variable.

2. Average error is 3.38. This is exactly how “wrong on average” the constructed model is when predicting the level of turnout.

3. /l-ratio of explained and unexplained variation is 25.2 at the 0.000000003 level. The null hypothesis about the randomness of the identified relationships is rejected.

4. The criterion / for the constant and regression coefficients of the variables “urban population” and “Russian population” is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis that the coefficients are random is rejected.

Additional useful statistics in analyzing the relationship between the original and predicted values ​​of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values ​​of all independent variables for this case deviates from the mean for all independent variables simultaneously). The second is a measure of the influence of the case. Different observations have different effects on the slope of the regression line, and Cook's distance can be used to compare them on this indicator. This can be useful when cleaning up outliers (an outlier can be thought of as an overly influential case).

In our example, unique and influential cases include Dagestan.

Happening Original

values

Predska

values

Leftovers Distance

Mahalanobis

Distance
Adygea 64,92 66,33 -1,40 0,69 0,00
Altai Republic 68,60 69.91 -1,31 6,80 0,01
The Republic of Buryatia 60,75 65,56 -4,81 0,23 0,01
The Republic of Dagestan 79,92 71,01 8,91 10,57 0,44
The Republic of Ingushetia 75,05 70,21 4,84 6,73 0,08
Republic of Kalmykia 68,52 69,59 -1,07 4,20 0,00

The regression model itself has the following parameters: Y-intersection (constant) = 75.99; b (horizontal) = -0.1; Kommersant (Russian nas.) = -0.06. Final formula.

Characteristics of causal dependencies

Cause-and-effect relationships- this is a connection between phenomena and processes, when a change in one of them - the cause - leads to a change in the other - the effect.

Signs according to their significance for studying the relationship are divided into two classes.

Signs that cause changes in other associated features are called factorial (or factors).

Signs that change under the influence of factor signs are effective.

The following forms of communication are distinguished: functional and stochastic. Functional is a relationship in which a certain value of a factor characteristic corresponds to one and only one value of the resultant characteristic. The functional connection is manifested in all cases of observation and for each specific unit of the population under study.

The functional relationship can be represented by the following equation:
y i =f(x i), where: y i - resultant sign; f(x i) - a known function of the connection between the resultant and factor characteristics; x i - factor sign.
In real nature there are no functional connections. They are only abstractions, useful in analyzing phenomena, but simplifying reality.

Stochastic (statistical or random)connection represents a relationship between quantities in which one of them reacts to a change in another quantity or other quantities by changing the distribution law. In other words, with this connection different meanings one variable corresponds to different distributions of another variable. This is due to the fact that the dependent variable, in addition to the independent ones under consideration, is influenced by a number of unaccounted or uncontrolled random factors, as well as some inevitable errors in the measurement of variables. Due to the fact that the values ​​of the dependent variable are subject to random scatter, they cannot be predicted with sufficient accuracy, but can only be indicated with a certain probability.

Due to the ambiguity of the stochastic dependence between Y and X, in particular, the dependence scheme averaged over x is of interest, i.e. a pattern in the change in the average value - the conditional mathematical expectation Mx(Y) (the mathematical expectation of a random variable Y, found provided that the variable X takes on the value x) depending on x.

A special case of stochastic communication is correlation communication. Correlation(from lat. correlation- correlation, relationship). Direct definition of the term correlation - stochastic, probable, possible connection between two (pair) or several (multiple) random variables.

A correlation dependence between two variables is also called a statistical relationship between these variables, in which each value of one variable corresponds to a certain average value, i.e. conditional mathematical expectation is different. Correlation dependence is a special case of stochastic dependence, in which a change in the values ​​of factor characteristics (x 1 x 2 ..., x n) entails a change in the average value of the resulting characteristic.



It is customary to distinguish the following types of correlation:

1. Pair correlation – a connection between two characteristics (resultative and factor or two factor).

2. Partial correlation - the dependence between the resultant and one factor characteristics with a fixed value of other factor characteristics included in the study.

3. Multiple correlation– dependence of the resultant and two or more factor characteristics included in the study.

Purpose of Regression Analysis

The analytical form of representing cause-and-effect relationships is regression models. The scientific validity and popularity of regression analysis makes it one of the main mathematical tools for modeling the phenomenon under study. This method is used to smooth experimental data and obtain quantitative estimates of comparative influence various factors to the outcome variable.

Regression analysis is in determining the analytical expression of a relationship in which a change in one value (dependent variable or resultant characteristic) is due to the influence of one or more independent quantities(factors or predictors), and the set of all other factors that also influence the dependent value is taken as constant and average values.

Goals of regression analysis:

Assessment of the functional dependence of the conditional average value of the resultant characteristic y on the factor factors (x 1, x 2, ..., x n);

Predicting the value of a dependent variable using the independent variable(s).

Determining the contribution of individual independent variables to the variation of the dependent variable.

Regression analysis cannot be used to determine whether there is a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

In regression analysis, it is assumed in advance that there are cause-and-effect relationships between the resultant (U) and factor characteristics x 1, x 2 ..., x n.

Function , op The determining dependence of the indicator on the parameters is called the regression equation (function) 1 . The regression equation shows the expected value of the dependent variable given certain values ​​of the independent variables.
Depending on the number of factors included in the model X models are divided into single-factor (pair regression model) and multi-factor (multiple regression model). Depending on the type of function, models are divided into linear and nonlinear.

Paired regression model

Due to the influence of unaccounted random factors and causes, individual observations y will deviate to a greater or lesser extent from the regression function f(x). In this case, the equation for the relationship between two variables (paired regression model) can be presented as:

Y=f(X) + ɛ,

where ɛ is a random variable characterizing the deviation from the regression function. This variable is called the disturbance or disturbance (residual or error). Thus, in the regression model the dependent variable Y there is some function f(X) up to random disturbance ɛ.

Let's consider the classical linear pairwise regression model (CLMPR). She looks like

y i =β 0 +β 1 x i +ɛ i (i=1,2, …, n),(1)

Where y i– explained (resulting, dependent, endogenous variable); x i– explanatory (predictor, factor, exogenous) variable; β 0 , β 1– numerical coefficients; ɛi– random (stochastic) component or error.

Basic conditions (prerequisites, hypotheses) of KLMPR:

1) x i– a deterministic (non-random) quantity, and it is assumed that among the values ​​x i - not all are the same.

2) Expected value(average value) disturbances ɛi equals zero:

М[ɛ i ]=0 (i=1,2, …, n).

3) The dispersion of the disturbance is constant for any values ​​of i (homoscedasticity condition):

D[ɛ i ]=σ 2 (i=1,2, …, n).

4) Disturbances for different observations are uncorrelated:

cov[ɛ i , ɛ j ]=M[ɛ i , ɛ j ]=0 for i≠j,

where cov[ɛ i , ɛ j ] is the covariance coefficient (correlation moment).

5) Disturbances are normally distributed random variables with zero mean and variance σ 2:

ɛ i ≈ N(0, σ 2).

To obtain a regression equation, the first four premises are sufficient. The requirement to fulfill the fifth prerequisite is necessary to assess the accuracy of the regression equation and its parameters.

Comment: The focus on linear relationships is explained by the limited variation of variables and the fact that in most cases nonlinear forms of relationships are converted (by logarithm or substitution of variables) into a linear form to perform calculations.

Traditional method least squares (LS)

The model estimate from the sample is the equation

ŷ i = a 0 + a 1 x i(i=1,2, …, n), (2)

where ŷ i – theoretical (approximating) values ​​of the dependent variable obtained from the regression equation; a 0 , a 1 - coefficients (parameters) of the regression equation (sample estimates of the coefficients β 0, β 1, respectively).

According to least squares, the unknown parameters a 0 , a 1 are chosen so that the sum of squared deviations of the values ​​ŷ i from the empirical values ​​y i (residual sum of squares) is minimal:

Q e =∑e i 2 = ∑(y i – ŷ i) 2 = ∑(yi – (a 0 + a 1 x i)) 2 → min, (3)

where e i = y i - ŷ i – sample estimate of disturbance ɛ i, or regression residual.

The problem comes down to finding such values ​​of parameters a 0 and a 1 for which the function Q e takes smallest value. Note that the function Q e = Q e (a 0 , a 1) is a function of two variables a 0 and a 1 until we found and then fixed their “best” (in the sense of the least squares method) values, a x i, y i are constant numbers found experimentally.

The necessary conditions extrema (3) are found by equating the partial derivatives of this function of two variables to zero. As a result, we obtain a system of two linear equations, which is called the system of normal equations:

(4)

Coefficient a 1 is a sample regression coefficient of y on x, which shows how many units on average the variable y changes when the variable x changes by one unit of its measurement, that is, the variation in y per unit of variation in x. Sign a 1 indicates the direction of this change. Coefficient a 0 – displacement, according to (2) equal to the valueŷ i for x=0 and may not have a meaningful interpretation. For this reason, the dependent variable is sometimes called the response.

Statistical properties of regression coefficient estimates:

The coefficient estimates a 0 , a 1 are unbiased;

The variances of estimates a 0 , a 1 decrease (the accuracy of estimates increases) with increasing sample size n;

The variance of the estimate of the slope a 1 decreases with increasing and therefore it is advisable to choose x i so that their spread around the average value is large;

For x¯ > 0 (which is of greatest interest), there is a negative statistical relationship between a 0 and a 1 (an increase in a 1 leads to a decrease in a 0).

The main feature of regression analysis: with its help, you can obtain specific information about what form and nature the relationship between the variables under study has.

Sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

    Problem formulation. At this stage, preliminary hypotheses about the dependence of the phenomena under study are formed.

    Definition of dependent and independent (explanatory) variables.

    Collection of statistical data. Data must be collected for each of the variables included in the regression model.

    Formulation of a hypothesis about the form of connection (simple or multiple, linear or nonlinear).

    Definition regression functions (consists in calculating the numerical values ​​of the parameters of the regression equation)

    Assessing the accuracy of regression analysis.

    Interpretation of the results obtained. The obtained results of regression analysis are compared with preliminary hypotheses. The correctness and credibility of the results obtained are assessed.

    Prediction unknown values dependent variable.

Using regression analysis, it is possible to solve the problem of forecasting and classification. Predicted values ​​are calculated by substituting the values ​​of explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and that part of the set where the function value is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Regression Analysis Problems

Let's consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, estimation of unknown values ​​of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

    positive linear regression(expressed in uniform growth of the function);

    positive uniformly increasing regression;

    positive uniformly increasing regression;

    negative linear regression (expressed as a uniform decline in the function);

    negative uniformly accelerated decreasing regression;

    negative uniformly decreasing regression.

However, the varieties described are usually not found in pure form, but in combination with each other. In this case, we talk about combined forms of regression.

Definition of the regression function.

The second task comes down to identifying the effect on the dependent variable of the main factors or causes, other things being equal, and subject to the exclusion of the influence of random elements on the dependent variable. Regression function is defined in the form of a mathematical equation of one type or another.

Estimation of unknown values ​​of the dependent variable.

The solution to this problem comes down to solving a problem of one of the following types:

    Estimation of the values ​​of the dependent variable within the considered interval of the initial data, i.e. missing values; in this case, the interpolation problem is solved.

    Estimation of future values ​​of the dependent variable, i.e. finding values ​​outside the specified interval of the source data; in this case, the problem of extrapolation is solved.

Both problems are solved by substituting the found parameter estimates for the values ​​of independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. the relationship between the variables under consideration is assumed to be linear. So, in this example, we plotted a scatterplot and were able to see a clear linear relationship. If, on the scatter diagram of the variables, we see a clear absence of a linear relationship, i.e. If there is a nonlinear relationship, nonlinear analysis methods should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values ​​is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, its main limitation should be considered. It consists in the fact that regression analysis allows us to detect only dependencies, and not the connections underlying these dependencies.

Regression analysis allows you to estimate the strength of the relationship between variables by calculating the estimated value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of a constant a and the slope of the line (or slope) b, multiplied by the value of the variable X. The constant a is also called the intercept term, and the slope is the regression coefficient or B-coefficient.

In most cases (if not always) there is a certain scatter of observations relative to the regression line.

Remainder is the deviation of a single point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis package" and the Regression analysis tool. We set the input intervals X and Y. The input interval Y is the range of dependent analyzed data, it must include one column. The input interval X is the range of independent data that needs to be analyzed. The number of input ranges should not exceed 16.

At the output of the procedure in the output range we obtain the report given in table 8.3a-8.3v.

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics

Regression statistics

Plural R

R-square

Normalized R-squared

Standard error

Observations

Let's first look at the top part of the calculations presented in table 8.3a, - regression statistics.

Magnitude R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-square close to unity, this means that the constructed model explains almost all the variability in the corresponding variables. Conversely, the meaning R-square, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Plural R is equal to the square root of the coefficient of determination; this quantity takes values ​​in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

Odds

Standard error

t-statistic

Y-intersection

Variable X 1

* A truncated version of the calculations is provided

Now consider the middle part of the calculations presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

IN table 8.3c. output results are presented leftovers. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers

Observation

Predicted Y

Leftovers

Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value remainder in our case - 0.778, the smallest - 0.043. To better interpret these data, we will use the graph of the original data and the constructed regression line presented in rice. 8.3. As you can see, the regression line is quite accurately “fitted” to the values ​​of the original data.

It should be taken into account that the example under consideration is quite simple and it is not always possible to qualitatively construct a linear regression line.

Rice. 8.3. Source data and regression line

The problem of estimating unknown future values ​​of the dependent variable based on known values ​​of the independent variable has remained unconsidered, i.e. forecasting problem.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values ​​of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable forecast results

Y(predicted)

Thus, as a result of using regression analysis in Microsoft Excel, we:

    built a regression equation;

    established the form of dependence and direction of connection between variables - positive linear regression, which is expressed in uniform growth of the function;

    established the direction of the relationship between the variables;

    assessed the quality of the resulting regression line;

    were able to see deviations of the calculated data from the data of the original set;

    predicted future values ​​of the dependent variable.

If regression function defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, the constructed model and predicted values ​​can be considered to have sufficient reliability.

The predicted values ​​obtained in this way are the average values ​​that can be expected.

In this work we reviewed the main characteristics descriptive statistics and among them such concepts as average value,median,maximum,minimum and other characteristics of data variation.

The concept was also briefly discussed emissions. The characteristics considered relate to the so-called exploratory data analysis; its conclusions may not apply to the general population, but only to a sample of data. Exploratory data analysis is used to obtain primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities for practical use were also discussed.

The regression analysis method is used to determine the technical and economic parameters of products belonging to a specific parametric series in order to build and align value relationships. This method is used to analyze and justify the level and price ratios of products characterized by the presence of one or more technical and economic parameters that reflect the main consumer properties. Regression analysis allows us to find an empirical formula that describes the dependence of price on the technical and economic parameters of products:

P=f(X1X2,...,Xn),

where P is the value of the unit price of the product, rub.; (X1, X2, ... Xn) - technical and economic parameters of products.

The method of regression analysis - the most advanced of the used normative-parametric methods - is effective when carrying out calculations based on the use of modern information technologies and systems. Its application includes the following main steps:

  • determination of classification parametric groups of products;
  • selection of parameters that most influence the price of the product;
  • selection and justification of the form of connection between price changes when parameters change;
  • construction of a system of normal equations and calculation of regression coefficients.

Basic qualification group products, the price of which is subject to equalization, is a parametric series, within which products can be grouped into different designs depending on their application, operating conditions and requirements, etc. When forming parametric series, automatic classification methods can be used, which make it possible to products to identify their homogeneous groups. The selection of technical and economic parameters is made based on the following basic requirements:

  • the selected parameters include the parameters recorded in the standards and technical conditions; in addition to technical parameters (power, load capacity, speed, etc.), indicators of product serialization, complexity coefficients, unification, etc. are used;
  • the set of selected parameters should sufficiently fully characterize the design, technological and operational properties of the products included in the series, and have a fairly close correlation with price;
  • parameters should not be interdependent.

To select technical and economic parameters that significantly affect the price, a matrix of pair correlation coefficients is calculated. Based on the magnitude of the correlation coefficients between the parameters, one can judge the closeness of their connection. At the same time, a correlation close to zero shows an insignificant influence of the parameter on the price. The final selection of technical and economic parameters is carried out in the process of step-by-step regression analysis using computer equipment and corresponding standard programs.

In pricing practice, the following set of functions is used:

linear

P = ao + alXl + ... + antXn,

linear-power

P = ao + a1X1 + ... + anXn + (an+1Xn) (an+1Xn) +... + (an+nXn2) (an+nXn2)

inverse logarithm

P = a0 + a1: In X1 + ... + an: In Xn,

power

P = a0 (X1^a1) (X2^a2) .. (Xn^an)

indicative

P = e^(a1+a1X1+...+anXn)

hyperbolic

P = ao + a1:X1 + a2:X2 + ... + ap:Xn,

where P is price equalization; X1 X2,..., Xn - the value of the technical and economic parameters of the products of the series; a0, a1 ..., аn - calculated coefficients of the regression equation.

In practical work on pricing, depending on the form of relationship between prices and technical and economic parameters, other regression equations can be used. The type of function of the connection between price and a set of technical and economic parameters can be preset or selected automatically during computer processing. Closeness correlation connection between the price and the set of parameters is assessed by the value of the multiple correlation coefficient. Its proximity to one indicates a close connection. Using the regression equation, equalized (calculated) price values ​​for products of a given parametric series are obtained. To evaluate the results of equalization, the relative values ​​of the deviation of the calculated price values ​​from the actual ones are calculated:

Tsr = Rf - Rr: R x 100

where Рф, Рр - actual and calculated prices.

The value of CR should not exceed 8-10%. In case of significant deviations of calculated values ​​from actual ones, it is necessary to investigate:

  • the correctness of the formation of a parametric series, since it may contain products that, in their parameters, differ sharply from other products in the series. They must be excluded;
  • correct selection of technical and economic parameters. A set of parameters is possible that is weakly correlated with price. In this case, it is necessary to continue searching and selecting parameters.

The procedure and methodology for conducting regression analysis, finding unknown parameters of the equation and economic assessment of the results obtained are carried out in accordance with the requirements mathematical statistics.



New on the site

>

Most popular