Home Removal Examples of solving multiple regression problems. Introduction to Multiple Regression

Examples of solving multiple regression problems. Introduction to Multiple Regression

The purpose of multiple regression is to analyze the relationship between one dependent and several independent variables.

Example: There is data on the cost of one workstation (when purchasing 50 workstations) for various PDM systems. Required: evaluate the relationship between the price of a PDM system workstation and the number of characteristics implemented in it, given in Table 2.

Table 2 - Characteristics of PDM systems

Order number PDM system Price Product Configuration Management Product models Teamwork Product change management Document flow Archives Search documents Project planning Product manufacturing management
iMAN Yes Yes
PartYPlus Yes Yes
PDM STEP Suite Yes Yes
Search Yes Yes
Windchill Yes Yes
Compass Manager Yes Yes
T-Flex Docs Yes Yes
TechnoPro No No

The numerical value of the characteristics (except for “Cost”, “Product Models” and “Teamwork”) means the number of requirements of each characteristic implemented.

Let's create and fill out a spreadsheet with the initial data (Figure 27).

The value “1” of the variables “Mod. ed." and “Collection.” district." corresponds to the “Yes” value of the source data, and the value “0” to the “No” value of the source data.

Let's build a regression between the dependent variable “Cost” and the independent variables “Ex. conf.", "Mod. ed.", "Collect. r-ta", "Ex. changes.”, “Doc.”, “Archives”, “Search”, “Plan-e”, “Control. made."

To start statistical analysis of the source data, call the “Multiple Regression” module (Figure 22).

In the dialog box that appears (Figure 23), indicate the variables for which statistical analysis will be performed.

Figure 27 - Initial data

To do this, click the Variables button and in the dialog box that appears (Figure 28), in the part corresponding to the dependent variables (Dependent var.), select “1-Cost”, and in the part corresponding to the independent variables (Independent variable list), select all other variables. Selecting several variables from the list is carried out using the “Ctrl” or “Shift” keys, or by specifying the numbers (range of numbers) of the variables in the corresponding field.



Figure 28 - Dialog box for setting variables for statistical analysis

After the variables are selected, click the “OK” button in the dialog box for setting the parameters of the “Multiple Regression” module. In the window that appears with the inscription “No of indep. vars. >=(N-1); cannot invert corr. matrix." (Figure 29) press the “OK” button.

This message appears when the system cannot build a regression for all declared independent variables, because the number of variables is greater than or equal to the number of cases minus 1.

In the window that appears (Figure 30) on the “Advanced” tab, you can change the method for constructing the regression equation.

Figure 29 - Error message

To do this, in the “Method” field, select “Forward stepwise” (step-by-step with inclusion).

Figure 30 - Window for selecting a method and setting parameters for constructing a regression equation

The stepwise regression method consists of adding or excluding some independent variable to the model at each step. Thus, many of the most “significant” variables are highlighted. This allows you to reduce the number of variables that describe the dependence.

Stepwise analysis with elimination (“Backward stepwise”). In this case, all variables will first be included in the model, and then at each step, variables that make little contribution to the predictions will be eliminated. Then, as the result of a successful analysis, only “important” variables in the model can be retained, that is, those variables whose contribution to discrimination is greater than others.

Step-by-step analysis with inclusion (“Forward stepwise”). When using this method, independent variables are sequentially included in the regression equation until the equation satisfactorily describes the original data. The inclusion of variables is determined using the F - test. At each step, all variables are looked at and the one that makes the greatest contribution to the difference between the populations is found. This variable must be included in the model at this step and proceed to the next step.

In the “Intercept” field (free regression term), you can choose whether to include it in the equation (“Include in model”) or not take it into account and consider it equal to zero (“Set to zero”).

The “Tolerance” parameter is the tolerance of variables. Defined as 1 minus the square of the coefficient multiple correlation this variable with all the other independent variables in the regression equation. Therefore, the lower the tolerance of a variable, the more redundant is its contribution to the regression equation. If the tolerance of any of the variables in the regression equation is equal to or close to zero, then the regression equation cannot be estimated. Therefore, it is advisable to set the tolerance parameter to 0.05 or 0.1.

The parameter “Ridge regression; lambda:" is used when the independent variables are highly intercorrelated, and robust estimates for the coefficients of the regression equation cannot be obtained through the method least squares. The specified constant (lambda) will be added to the diagonal of the correlation matrix, which will then be re-standardized (so that all diagonal elements are equal to 1.0). In other words, this parameter artificially reduces the correlation coefficients so that more robust (yet biased) estimates of the regression parameters can be calculated. In our case, this parameter is not used.

The “Batch processing/printing” parameter is used when it is necessary to immediately prepare several tables for a report, reflecting the results and process regression analysis. This option is very useful when you need to print or analyze the results of a stepwise regression analysis at each step.

On the “Stepwise” tab (Figure 31), you can set parameters for the conditions for inclusion (“F to enter”) or exclusion (“F to remove”) of variables when constructing a regression equation, as well as the number of steps for constructing the equation (“Number of steps”).

Figure 31 – “Stepwise” tab of the window for selecting a method and setting construction parameters regression equation

F is the magnitude of the F-test value.

If, during step-by-step analysis with inclusion, it is necessary that all or almost all variables enter the regression equation, then the “F to enter” value must be set to the minimum (0.0001), and the “F to remove” value must also be set to the minimum.

If, during step-by-step analysis with exclusion, it is necessary to remove all variables (one at a time) from the regression equation, then it is necessary to set the “F to enter” value very large, for example 999, and set the “F to remove” value close to “F to enter”.

It should be remembered that the value of the “F to remove” parameter should always be less than “F to enter”.

The “Display results” option has two options:

2) At each step – display the analysis results at each step.

After clicking the “OK” button in the window for selecting regression analysis methods, the analysis results window will appear (Figure 32).

Figure 32 - Analysis results window

Figure 33 - Brief results of regression analysis

According to the results of the analysis, the coefficient of determination is . This means that the constructed regression explains 99.987% of the spread of values ​​relative to the average, i.e. explains almost all the variability of the variables.

Great importance and its significance level show that the constructed regression is highly significant.

To view summary results regression, click the “Summary: Regression result” button. The screen will appear spreadsheet with the analysis results (Figure 33).

The third column (“B”) displays estimates of the unknown parameters of the model, i.e. regression equation coefficients.

Thus, the desired regression looks like:

A qualitatively constructed regression equation can be interpreted as follows:

1) The cost of a PDM system increases with the increase in the number of implemented functions for change management, document flow and planning, and also if the system includes a product model support function;

2) The cost of a PDM system decreases with increasing configuration management functions implemented and with increasing search capabilities.

The objective of multiple linear regression is to construct a linear model of the relationship between a set of continuous predictors and a continuous dependent variable. The following regression equation is often used:

Here and i- regression coefficients, b 0- free member (if used), e- a term containing an error - various assumptions are made about it, which, however, more often come down to the normality of the distribution with a zero vector mat. expectations and correlation matrix.

This linear model describes well many problems in various subject areas, for example, economics, industry, medicine. This is because some problems are linear in nature.

Let's give a simple example. Suppose you need to predict the cost of laying a road based on its known parameters. At the same time, we have data on roads already laid, indicating the length, depth of pavement, amount of working material, number of workers, and so on.

It is clear that the cost of the road will eventually become equal to the amount the costs of all these factors separately. You will need a certain amount, for example, of crushed stone, with a known cost per ton, and a certain amount of asphalt, also with a known cost.

It may be necessary to cut down forests for installation, which will also lead to additional costs. All this together will give the cost of creating the road.

In this case, the model will include a free member who, for example, will be responsible for organizational expenses (which are approximately the same for all construction and installation work of a given level) or tax deductions.

The error will include factors that we did not take into account when building the model (for example, weather during construction - it is impossible to take it into account at all).

Example: Multiple Regression Analysis

For this example, several possible correlations of poverty rate and the degree that predicts the percentage of families below the poverty line will be analyzed. Therefore, we will consider the variable characterizing the percentage of families below the poverty line to be a dependent variable, and the remaining variables to be continuous predictors.

Regression coefficients

To find out which of the independent variables contributes more to predicting poverty levels, we examine standardized coefficients(or Beta) regression.

Rice. 1. Estimates of parameters of regression coefficients.

Beta coefficients are the coefficients you would get if you normalized all variables to a mean of 0 and a standard deviation of 1. Therefore, the magnitude of these Beta coefficients allows you to compare the relative contribution of each independent variable to the dependent variable. As can be seen from the table shown above, the variables change in population since 1960 (POP_ CHING), percentage of the population living in rural areas (PT_RURAL) and number of people employed in agriculture (N_Empld) are the most important predictors of poverty levels, because only they are statistically significant (95% of them confidence interval does not include 0). The regression coefficient for population change since 1960 (Pop_Chng) is negative, therefore, the less the population increases, the more families who live below the poverty line in the respective county. The regression coefficient for the population (%) living in the village (Pt_Rural) is positive, i.e., the higher the percentage rural residents, the higher the poverty level.

Significance of predictor effects

Let's look at the table with significance criteria.

Rice. 2. Simultaneous results for each given variable.

As this Table shows, only the effects of 2 variables are statistically significant: population change since 1960 (Pop_Chng) and percentage of population living in a village (Pt_Rural), p< .05.

Residue analysis. After fitting a regression equation, you almost always need to check the predicted values ​​and residuals. For example, large outliers can greatly distort the results and lead to erroneous conclusions.

Line-by-line emissions graph

It is usually necessary to check the original or standardized residues for large outliers.

Rice. 3. Observation numbers and residuals.

The scale of the vertical axis of this graph is plotted according to the sigma value, i.e., standard deviation leftovers If one or more observations do not fall within the ±3 times sigma interval, then it may be worth eliminating those observations (this can easily be done through observation selection conditions) and running the analysis again to ensure that the results are not affected by these outliers.

Mahalanobis distances

Most statistics textbooks spend a lot of time on outliers and residuals relative to the dependent variable. However, the role of outliers in predictors often remains unidentified. On the predictor variable side there is a list of variables that participate with various weights (regression coefficients) in predicting the dependent variable. You can think of independent variables as a multidimensional space in which any observation can be plotted. For example, if you had two independent variables with equal regression coefficients, you could plot a scatter plot of the two variables and place each observation on that plot. You could then mark the average value on this graph and calculate the distances from each observation to this average (the so-called center of gravity) in two-dimensional space. This is the main idea behind calculating the Mahalanobis distance. Now let's look at the histogram of the population change variable since 1960.

Rice. 4. Histogram of Mahalanobis distance distribution.

It follows from the graph that there is one outlier at the Mahalanobis distances.

Rice. 5. Observed, predicted and residual values.

Notice that Shelby County (in the first row) stands out from the rest of the counties. If you look at the raw data, you will find that Shelby County actually has the highest number of people employed in agriculture (variable N_Empld). It might be reasonable to express it as a percentage rather than an absolute number, in which case Shelby County's Mahalanobis distance would likely not be as large compared to other counties. Clearly Shelby County is an outlier.

Removed Remnants

Another very important statistic that helps assess the severity of an emissions problem is the removed residues. These are the standardized residuals for the corresponding observations that are obtained when that observation is removed from the analysis. Remember that the multiple regression procedure fits the regression surface to show the relationship between the dependent variable and the predictor variable. If one observation is an outlier (like Shelby County), then there is a tendency for the regression surface to "pull" toward that outlier. As a result, if the corresponding observation is removed, a different surface (and Beta coefficients) will be obtained. Therefore, if the removed residuals are very different from the standardized residuals, then you will have reason to believe that the regression analysis is seriously biased by the corresponding observation. In this example, the removed residuals for Shelby County show that it is an outlier, which seriously biases the analysis. The scatterplot clearly shows an outlier.

Rice. 6. Initial residuals and Deleted residuals of a variable indicating the percentage of families living below the subsistence level.

Most of them have more or less clear interpretations, however, let's turn to normal probability graphs.

As already mentioned, multiple regression assumes that there is a linear relationship between the variables in the equation and that the residuals are normally distributed. If these assumptions are violated, the conclusion may be inaccurate. A normal probability plot of the residuals will tell you whether there are serious violations of these assumptions or not.

Rice. 7. Normal probability graph; Initial balances.

This graph was constructed as follows. First, the standardized residuals are ranked in order. From these ranks, z-scores (i.e., standard values ​​of the normal distribution) can be calculated based on the assumption that the data obeys normal distribution. These z values ​​are plotted on the y axis on the graph.

If the observed residuals (plotted on the x-axis) were normally distributed, then all values ​​would fall on a straight line on the graph. On our graph, all points lie very close to the curve. If the residuals are not normally distributed, then they deviate from this line. Outliers also become noticeable in this graph.

If there is a loss of fit and the data appears to form a clear curve (eg, an S shape) about the line, then the dependent variable can be transformed in some way (eg, a logarithmic transformation to "shrink" the tail of the distribution, etc.). A discussion of this method is beyond the scope of this example (Neter, Wasserman, and Kutner, 1985, pp. 134–141, present a discussion of transformations that remove non-normality and nonlinearity in the data). However, researchers very often simply perform analyzes directly without testing the underlying assumptions, leading to erroneous conclusions.

Suppose a developer is assessing the value of a group of small office buildings in a traditional business district.

A developer can use multiple regression analysis to estimate the price of an office building in this area based on the following variables.

y is the estimated price of an office building;

x 1 - total area in square meters;

x 2 - number of offices;

x 3 - number of inputs (0.5 input means input only for correspondence delivery);

x 4 - operating time of the building in years.

This example assumes that there is linear dependence between each independent variable (x 1, x 2, x 3 and x 4) and the dependent variable (y), that is, the price of an office building in a given area. The source data is shown in the figure.

The settings for solving the problem are shown in the window picture " Regression". The calculation results are placed on a separate sheet in three tables

As a result we got the following mathematical model:

y = 52318 + 27.64*x1 + 12530*x2 + 2553*x3 - 234.24*x4.

Now the developer can determine the estimated value of an office building in the same area. If this building has an area of ​​2500 square meters, three offices, two entrances and a service life of 25 years, you can estimate its value using the following formula:

y = 27.64*2500 + 12530*3 + 2553*2 - 234.24*25 + 52318 = 158,261 c.u.

In regression analysis, the most important results are:

  • coefficients of variables and Y-intersection, which are the required parameters of the model;
  • multiple R, characterizing the accuracy of the model for the available source data;
  • Fisher's F test(in the example considered, it significantly exceeds critical value, equal to 4.06);
  • t-statistic– values ​​characterizing the degree of significance of individual coefficients of the model.

The t-statistics deserve special attention. Very often, when building a regression model, it is not known whether this or that factor x affects y. Including factors in the model that do not affect the output value degrades the quality of the model. Calculating t-statistics helps detect such factors. An approximate estimate can be made as follows: if for n>>k the value of the t-statistics for absolute value significantly more than three, the corresponding coefficient should be considered significant, and the factor should be included in the model, otherwise excluded from the model. Thus, we can propose a technology for constructing a regression model, consisting of two stages:

1) process with package " Regression"all available data, analyze t-statistic values;

2) remove from the source data table the columns with those factors for which the coefficients are insignificant and process them with the package " Regression" new table.

Good afternoon, dear readers.
In previous articles, on practical examples, I showed ways to solve classification problems (credit scoring problem) and the basics of text information analysis (passport problem). Today I would like to touch on another class of problems, namely regression recovery. Problems of this class are usually used in forecasting.
For an example of solving a forecasting problem, I took the Energy efficiency data set from the largest UCI repository. Traditionally, we will use Python with the analytical packages pandas and scikit-learn as tools.

Description of the data set and problem statement

Given a data set that describes the following room attributes:

It contains the characteristics of the room on the basis of which the analysis will be carried out, and the load values ​​that need to be predicted.

Preliminary data analysis

First, let's download our data and look at it:

From pandas import read_csv, DataFrame from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score from sklearn.cross_validation import train_test_split dataset = read_csv("EnergyEffici ency /ENB2012_data.csv",";") dataset.head()

X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7 2 0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7 3 0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7 4 0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7 5 0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7 2 0 0 20.84 28.28

Now let's see if any attributes are related to each other. This can be done by calculating the correlation coefficients for all columns. How to do this was described in the previous article:

Dataset.corr()

X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
X1 1.000000e+00 -9.919015e-01 -2.037817e-01 -8.688234e-01 8.277473e-01 0.000000 1.283986e-17 1.764620e-17 0.622272 0.634339
X2 -9.919015e-01 1.000000e+00 1.955016e-01 8.807195e-01 -8.581477e-01 0.000000 1.318356e-16 -3.558613e-16 -0.658120 -0.672999
X3 -2.037817e-01 1.955016e-01 1.000000e+00 -2.923165e-01 2.809757e-01 0.000000 -7.969726e-19 0.000000e+00 0.455671 0.427117
X4 -8.688234e-01 8.807195e-01 -2.923165e-01 1.000000e+00 -9.725122e-01 0.000000 -1.381805e-16 -1.079129e-16 -0.861828 -0.862547
X5 8.277473e-01 -8.581477e-01 2.809757e-01 -9.725122e-01 1.000000e+00 0.000000 1.861418e-18 0.000000e+00 0.889431 0.895785
X6 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000 0.000000e+00 0.000000e+00 -0.002587 0.014290
X7 1.283986e-17 1.318356e-16 -7.969726e-19 -1.381805e-16 1.861418e-18 0.000000 1.000000e+00 2.129642e-01 0.269841 0.207505
X8 1.764620e-17 -3.558613e-16 0.000000e+00 -1.079129e-16 0.000000e+00 0.000000 2.129642e-01 1.000000e+00 0.087368 0.050525
Y1 6.222722e-01 -6.581202e-01 4.556712e-01 -8.618283e-01 8.894307e-01 -0.002587 2.698410e-01 8.736759e-02 1.000000 0.975862
Y2 6.343391e-01 -6.729989e-01 4.271170e-01 -8.625466e-01 8.957852e-01 0.014290 2.075050e-01 5.052512e-02 0.975862 1.000000

As you can see from our matrix, the following columns correlate with each other (the value of the correlation coefficient is greater than 95%):
  • y1 --> y2
  • x1 --> x2
  • x4 --> x5
Now let's choose which columns of our pairs we can remove from our sample. To do this, in each pair, we select the columns that have a greater impact on the predicted values Y1 And Y2 and leave them and delete the rest.
As you can see, matrices with correlation coefficients on y1 ,y2 have more significance X2 And X5 , rather than X1 and X4, so we can remove the last columns.

Dataset = dataset.drop(["X1","X4"], axis=1) dataset.head()
In addition, you can notice that the fields Y1 And Y2 correlate very closely with each other. But, since we need to predict both values, we leave them “as is”.

Model selection

Let's separate the predicted values ​​from our sample:

Trg = dataset[["Y1","Y2"]] trn = dataset.drop(["Y1","Y2"], axis=1)
After processing the data, you can proceed to building a model. To build the model we will use the following methods:

The theory about these methods can be read in the course of lectures by K.V. Vorontsov on machine learning.
We will make the assessment using the coefficient of determination ( R-square). This coefficient is determined as follows:

Where is the conditional variance of the dependent quantity at by factor X.
The coefficient takes a value on an interval and the closer it is to 1, the stronger the dependence.
Well, now you can go directly to building a model and choosing a model. Let's put all our models in one list for ease of further analysis:

Models =
So the models are ready, now we will split our initial data into 2 subsamples: test And educational. Those who have read my previous articles know that this can be done using the train_test_split() function from the scikit-learn package:

Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
Now, since we need to predict 2 parameters, we need to build a regression for each of them. In addition, for further analysis, you can record the results obtained in a temporary DataFrame. You can do it like this:

#create temporary structures TestModels = DataFrame() tmp = () #for each model from the list for model in models: #get the model name m = str(model) tmp["Model"] = m[:m.index("( ")] #for each column of the result set for i in xrange(Ytrn.shape): #train the model model.fit(Xtrn, Ytrn[:,i]) #calculate the coefficient of determination tmp["R2_Y%s"%str(i +1)] = r2_score(Ytest[:,0], model.predict(Xtest)) #record the data and the final DataFrame TestModels = TestModels.append() #make an index by the model name TestModels.set_index("Model", inplace= True)
As you can see from the code above, the r2_score() function is used to calculate the coefficient.
So, the data for analysis has been received. Let's now plot the graphs and see which model showed the best result:

Fig, axes = plt.subplots(ncols=2, figsize=(10,4)) TestModels.R2_Y1.plot(ax=axes, kind="bar", title="R2_Y1") TestModels.R2_Y2.plot(ax=axes, kind="bar", color="green", title="R2_Y2") !}

Analysis of results and conclusions

From the graphs above, we can conclude that the method coped with the task better than others RandomForest(random forest). Its coefficients of determination are higher than others for both variables:
For further analysis, let's retrain our model:

Model = models model.fit(Xtrn, Ytrn)
Upon closer examination, the question may arise as to why the dependent sample was divided the previous time. Ytrn to variables (by columns), but now we don’t do that.
The point is that some methods, such as RandomForestRegressor, can deal with multiple predictor variables, while others (e.g. SVR) can only work with one variable. Therefore, during previous training, we used column partitioning to avoid errors in the process of building some models.
Choosing a model is, of course, good, but it would also be nice to have information on how each factor affects the predicted value. For this purpose, the model has the property feature_importances_.
Using it, you can see the weight of each factor in the final models:

Model.feature_importances_
array([ 0.40717901, 0.11394948, 0.34984766, 0.00751686, 0.09158358,
0.02992342])

In our case, it can be seen that the overall height and area affect the heating and cooling load the most. Their total contribution to the forecast model is about 72%.
It should also be noted that using the above diagram you can see the influence of each factor separately on heating and separately on cooling, but since these factors are very closely correlated with each other (), we made a general conclusion about both of them, which was written above .

Conclusion

In the article I tried to show the main stages in regression data analysis using Python and analytical packages pandas And scikit-learn.
It should be noted that the data set was specially selected in such a way as to be as formalized as possible and the primary processing of input data would be minimal. In my opinion, the article will be useful for those who are just starting their journey in data analysis, as well as for those who have a good theoretical basis, but are choosing tools for work.

Questions:

4. Estimation of parameters of a linear multiple regression model.

5. Assessing the quality of multiple linear regression.

6. Analysis and forecasting based on multifactor models.

Multiple regression is a generalization of pairwise regression. It is used to describe the relationship between the explained (dependent) variable Y and the explanatory (independent) variables X 1, X 2,..., X k. Multiple regression can be either linear or nonlinear, but linear multiple regression is most widespread in economics.

The theoretical linear multiple regression model has the form:

Let's denote the corresponding sample regression:

As in pairwise regression, the random term ε must satisfy the basic assumptions of regression analysis. Then, using OLS, the best unbiased and efficient estimates of the theoretical regression parameters are obtained. In addition, the variables X 1, X 2,…, X k must be uncorrelated (linearly independent) with each other. In order to write down formulas for estimating regression coefficients (2), obtained on the basis of least squares, we introduce the following notation:

Then we can write in vector-matrix form theoretical model:

and sample regression

OLS leads to the following formula for estimating the vector of sample regression coefficients:

(3)

To estimate multiple linear regression coefficients with two independent variables , we can solve the system of equations:

(4)

As in paired linear regression, the standard regression error S is calculated for multiple regression:

(5)

and standard errors of regression coefficients:

(6)

The significance of the coefficients is checked using the t-test.

having Student's extension with the number of degrees of freedom v= n-k-1.

To assess the quality of regression, the determination coefficient (index) is used:

, (8)

the closer to 1, the higher the quality of the regression.

To check the significance of the coefficient of determination, the Fisher test or F-statistic is used.



(9)

With v 1=k, v 2=n-k-1 degrees of freedom.

In multivariate regression, adding additional explanatory variables increases the coefficient of determination. To compensate for this increase, an adjusted (or normalized) coefficient of determination is introduced:

(10)

If the increase in the proportion of explained regression when adding a new variable is small, it may decrease. This means that adding a new variable is inappropriate.

Example 4:

Let us consider the dependence of the enterprise’s profit on the costs of new equipment and technology and on the costs of improving the skills of workers. Statistical data on 6 similar enterprises was collected. Data in millions of dollars. units are given in table 1.

Table 1

Build a two-factor linear regression and evaluate its significance. Let us introduce the following notation:

We transpose the matrix X:

Inversion of this matrix:

Thus, the dependence of profit on the costs of new equipment and machinery and on the costs of improving the skills of workers can be described by the following regression:

Using formula (5), where k=2, we calculate the standard regression error S=0.636.

We calculate the standard errors of the regression coefficients using formula (6):

Likewise:

Let's check the significance of the regression coefficients a 1, a 2. Let's calculate t calc.

Let's choose the significance level, the number of degrees of freedom

means coefficient a 1 significant

Let's evaluate the significance of coefficient a 2:

Coefficient a 2 insignificant

Let's calculate the coefficient of determination using formula (7). The profit of an enterprise depends by 96% on the costs of new equipment and technology and on advanced training by 4% on other and random factors. Let's check the significance of the coefficient of determination. Let's calculate F calculated:

That. the coefficient of determination is significant, the regression equation is significant.

Of great importance in analysis based on multivariate regression is the comparison of the influence of factors on the dependent indicator y. Regression coefficients are not used for this purpose due to differences in measurement units and varying degrees fluctuations. From these shortcomings, free elasticity coefficients:

Elasticity shows by what percentage on average the dependent indicator y changes when the variable changes by 1%, provided that the values ​​of the other variables remain unchanged. The larger , the greater the influence of the corresponding variable. As in pairwise regression, multiple regression distinguishes between point forecast and interval forecast. The point forecast (number) is obtained by substituting the predicted values ​​of the independent variables into the multiple regression equation. Let's denote by:

(12)

vector of predicted values ​​of independent variables, then the point forecast

The standard error of prediction in the case of multiple regression is determined as follows:

(15)

Let us choose the significance level α according to the Student distribution table. For the significance level α and the number of degrees of freedom ν = n-k-1, we find t cr. Then the true value y p with probability 1- α falls into the interval:


Topic 5:

Time series.

Questions:

4. Basic concepts of time series.

5. The main development trend is a trend.

6. Building an additive model.

Time series represent a set of values ​​of any indicator for several consecutive moments or periods of time.

The moment (or period) of time is denoted by t, and the value of the indicator at the moment of time is denoted by y(t) and is called row level .

Each level of the time series is formed under the influence of a large number of factors, which can be divided into 3 groups:

Long-term, constantly operating factors that have a decisive influence on the phenomenon being studied and form the main trend of the series - the trend T(t).

Short-term periodic factors that form seasonal fluctuations in the S(t) series.

Random factors that form random changes in the levels of the series ε(t).

Additive model time series is a model in which each level of the series is represented by the sum of the trend, seasonal and random components:

Multiplicative model is a model in which each level of the series is the product of the listed components:

The choice of one of the models is based on an analysis of the structure of seasonal fluctuations. If the amplitude of oscillations is approximately constant, then an additive model is built. If the amplitude increases, then the multiplicative model.

The main task of econometric analysis is to identify each of the listed components.

The main development trend (trend) called a smooth and stable change in the levels of a series over time, free from random and seasonal fluctuations.

The task of identifying the main development trends is called time series alignment .

Time series alignment methods include:

1) method of enlarging intervals,

2) method moving average,

3) analytical alignment.

1) The time periods to which the series levels relate are enlarged. Then the levels of the series are summed up over the enlarged intervals. Fluctuations in levels due to random reasons, cancel each other out. The general trend will emerge more clearly.

2) To determine the number of first levels of the series, the average value is calculated. Then the average is calculated from the same number of levels of the series, starting from the second level, etc. the average value slides along the dynamics series, moving forward by 1 term (point in time). The number of levels of the series by which the average is calculated can be even or odd. For an odd number, the moving average is referred to as the middle of the sliding period. For an even period, finding the average value is not compared with the determination of t, but a centering procedure is used, i.e. calculate the average of two consecutive moving averages.

3) Construction of an analytical function characterizing the dependence of the level of the series on time. The following functions are used to build trends:

Trend parameters are determined using least squares. The selection of the best function is based on the coefficient R 2 .

We will build an additive model using an example.

Example 7:

There are quarterly data on the volume of electricity consumption in a certain area for 4 years. Data in million kW in table 1.

Table 1

Build a time series model.

In this example, we consider the quarter number as the independent variable, and electricity consumption for the quarter as the dependent variable y(t).

From the scatterplot you can see that the trend is linear. One can also see the presence of seasonal fluctuations (period = 4) of the same amplitude, so we will build an additive model.

Model construction includes next steps:

1. Let's align the original series using the moving average method for 4 quarters and perform centering:

1.1. Let's sum up the levels of the series sequentially for every 4 quarters with a shift of 1 point in time.

1.2. Dividing the resulting amounts by 4 we find the moving averages.

1.3. We bring these values ​​into correspondence with actual points in time, for which we find the average value of two consecutive moving averages - centered moving averages.

2. Let's calculate the seasonal variation. Seasonal variation (t) = y(t) – centered moving average. Let's build table 2.

table 2

End-to-end block number t Electricity consumption Y(t) 4 quarter moving average Centered moving average Estimation of seasonal variation
6,0 - - -
4,4 6,1 - -
5,0 6,4 6,25 -1,25
9,0 6,5 6,45 2,55
7,2 6,75 6,625 0,575
: : : : :
6,6 8,35 8,375 -1,775
7,0 - - -
10,8 - - -

3. Based on the seasonal variation in Table 3, the seasonal component is calculated.

Indicators Year Number of quarter in the year I II III IV
- - -1,250 2,550
0,575 -2,075 -1,100 2,700
0,550 -2,025 -1,475 2,875
0,675 -1,775 - -
Total 1,8 -5,875 -3,825 8,125 Sum
Average 0,6 -1,958 -1,275 2,708 0,075
Seasonal component 0,581 -1,977 -1,294 2,690

4. Eliminate the seasonal component from initial levels row:

Conclusion:

The additive model explains 98.4% of the total variation in the levels of the original time series.



New on the site

>

Most popular