Home Orthopedics What is the least squares method? Approximation of experimental data

What is the least squares method? Approximation of experimental data

Approximation of experimental data is a method based on replacing experimentally obtained data with an analytical function that most closely passes or coincides at nodal points with the original values ​​(data obtained during an experiment or experiment). Currently, there are two ways to define an analytical function:

By constructing an n-degree interpolation polynomial that passes directly through all points a given data array. IN in this case the approximating function is represented as: an interpolation polynomial in Lagrange form or an interpolation polynomial in Newton form.

By constructing an n-degree approximating polynomial that passes in the immediate vicinity of points from a given data array. Thus, the approximating function smooths out all random noise (or errors) that may arise during the experiment: the measured values ​​during the experiment depend on random factors that fluctuate according to their own random laws(measurement or instrument errors, inaccuracy or experimental errors). In this case, the approximating function is determined using the method least squares.

Least square method(in the English literature Ordinary Least Squares, OLS) is a mathematical method based on determining an approximating function that is constructed in the closest proximity to points from a given array of experimental data. The closeness of the original and approximating functions F(x) is determined by a numerical measure, namely: the sum of squared deviations of experimental data from the approximating curve F(x) should be the smallest.

Approximating curve constructed using the least squares method

The least squares method is used:

To solve overdetermined systems of equations when the number of equations exceeds the number of unknowns;

To find a solution in the case of ordinary (not overridden) nonlinear systems equations;

To approximate point values ​​with some approximating function.

The approximating function using the least squares method is determined from the condition of the minimum sum of squared deviations of the calculated approximating function from a given array of experimental data. This criterion of the least squares method is written as the following expression:

The values ​​of the calculated approximating function at the nodal points,

A given array of experimental data at nodal points.

The quadratic criterion has a number of “good” properties, such as differentiability, providing a unique solution to the approximation problem with polynomial approximating functions.

Depending on the conditions of the problem, the approximating function is a polynomial of degree m

The degree of the approximating function does not depend on the number of nodal points, but its dimension must always be less than the dimension (number of points) of a given experimental data array.

∙ If the degree of the approximating function is m=1, then we approximate the tabular function with a straight line (linear regression).

∙ If the degree of the approximating function is m=2, then we approximate the table function quadratic parabola(quadratic approximation).

∙ If the degree of the approximating function is m=3, then we approximate the table function with a cubic parabola (cubic approximation).

IN general case when it is necessary to construct an approximating polynomial of degree m for given table values, the condition for the minimum sum of squared deviations over all nodal points is rewritten in the following form:

- unknown coefficients of the approximating polynomial of degree m;

The number of table values ​​specified.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables . As a result we get the following system equations:

Let's transform the resulting linear system equations: open the brackets and move the free terms to the right side of the expression. The resulting system of linear algebraic expressions will be written in the following form:

This system of linear algebraic expressions can be rewritten in matrix form:

The result was a system linear equations dimension m+1, which consists of m+1 unknowns. This system can be solved using any method for solving linear problems. algebraic equations(for example, by the Gaussian method). As a result of the solution, unknown parameters of the approximating function will be found that provide the minimum sum of squared deviations of the approximating function from the original data, i.e. best possible quadratic approximation. It should be remembered that if even one value of the source data changes, all coefficients will change their values, since they are completely determined by the source data.

Approximation of source data by linear dependence

(linear regression)

As an example, consider the technique for determining the approximating function, which is given in the form linear dependence. In accordance with the least squares method, the condition for the minimum of the sum of squared deviations is written in the following form:

Coordinates of table nodes;

Unknown coefficients of the approximating function, which is specified as a linear dependence.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables. As a result, we obtain the following system of equations:

Let us transform the resulting linear system of equations.

We solve the resulting system of linear equations. The coefficients of the approximating function in analytical form are determined as follows (Cramer’s method):

These coefficients ensure the construction of a linear approximating function in accordance with the criterion of minimizing the sum of squares of the approximating function from the given tabular values ​​(experimental data).

Algorithm for implementing the least squares method

1. Initial data:

An array of experimental data with the number of measurements N is specified

The degree of the approximating polynomial (m) is specified

2. Calculation algorithm:

2.1. The coefficients are determined for constructing a system of equations with dimensions

Coefficients of the system of equations ( left side equations)

- index of the column number of the square matrix of the system of equations

Free terms of the system of linear equations ( right part equations)

- index of the row number of the square matrix of the system of equations

2.2. Formation of a system of linear equations with dimension .

2.3. Solving a system of linear equations to determine the unknown coefficients of an approximating polynomial of degree m.

2.4. Determination of the sum of squared deviations of the approximating polynomial from the original values ​​at all nodal points

The found value of the sum of squared deviations is the minimum possible.

Approximation using other functions

It should be noted that when approximating source data in accordance with the least squares method, a logarithmic function is sometimes used as an approximating function, exponential function and a power function.

Logarithmic approximation

Let's consider the case when the approximating function is given by a logarithmic function of the form:

The essence of the least squares method is in finding the parameters of a trend model that best describes the tendency of development of any random phenomenon in time or space (a trend is a line that characterizes the tendency of this development). The task of the least squares method (LSM) comes down to finding not just some trend model, but to finding the best or optimal model. This model will be optimal if the sum of square deviations between the observed actual values ​​and the corresponding calculated trend values ​​is minimal (smallest):

Where - standard deviation between observed actual value

and the corresponding calculated trend value,

The actual (observed) value of the phenomenon being studied,

The calculated value of the trend model,

The number of observations of the phenomenon being studied.

MNC is used quite rarely on its own. As a rule, most often it is used only as a necessary technical technique in correlation studies. It should be remembered that the information basis of an MNC can only be reliable statistical series, and the number of observations should not be less than 4, otherwise the OLS smoothing procedures may lose common sense.

The MNC toolkit boils down to the following procedures:

First procedure. It turns out whether there is any tendency at all to change the resultant attribute when the selected factor-argument changes, or in other words, is there a connection between “ at " And " X ».

Second procedure. It is determined which line (trajectory) can best describe or characterize this trend.

Third procedure.

Example. Let's say we have information about the average sunflower yield for the farm under study (Table 9.1).

Table 9.1

Observation number

Productivity, c/ha

Since the level of technology in sunflower production in our country has remained virtually unchanged over the past 10 years, it means that, apparently, fluctuations in yield during the analyzed period were very much dependent on fluctuations in weather and climatic conditions. Is this really true?

First OLS procedure. The hypothesis about the existence of a trend in sunflower yield changes depending on changes in weather and climatic conditions over the analyzed 10 years is tested.

In this example, for " y " it is advisable to take the sunflower yield, and for " x » – number of the observed year in the analyzed period. Testing the hypothesis about the existence of any relationship between " x " And " y "can be done in two ways: manually and using computer programs. Of course, if available computer equipment this problem solves itself. But in order to better understand the MNC tools, it is advisable to test the hypothesis about the existence of a relationship between “ x " And " y » manually, when only a pen and an ordinary calculator are at hand. In such cases, the hypothesis about the existence of a trend is best checked visually by the location of the graphical image of the analyzed series of dynamics - the correlation field:

The correlation field in our example is located around a slowly increasing line. This in itself indicates the existence of a certain trend in changes in sunflower yields. It is impossible to talk about the presence of any tendency only when the correlation field looks like a circle, a circle, a strictly vertical or strictly horizontal cloud, or consists of chaotically scattered points. In all other cases, the hypothesis about the existence of a relationship between “ x " And " y ", and continue research.

Second OLS procedure. It is determined which line (trajectory) can best describe or characterize the trend of changes in sunflower yield over the analyzed period.

If you have computer technology, the selection of the optimal trend occurs automatically. When processing manually, the choice optimal function is carried out, as a rule, visually - by the location of the correlation field. That is, based on the type of graph, the equation of the line that best fits the empirical trend (the actual trajectory) is selected.

As is known, in nature there is a huge variety of functional dependencies, so it is extremely difficult to visually analyze even a small part of them. Fortunately, in real economic practice, most relationships can be described quite accurately either by a parabola, or a hyperbola, or a straight line. In this regard, with the “manual” option of selecting the best function, you can limit yourself to only these three models.

Hyperbola:

Second order parabola: :

It is easy to see that in our example, the trend in sunflower yield changes over the analyzed 10 years is best characterized by a straight line, so the regression equation will be the equation of a straight line.

Third procedure. Parameters are calculated regression equation characterizing a given line, or in other words, an analytical formula is determined that describes best model trend.

Finding the values ​​of the parameters of the regression equation, in our case the parameters and , is the core of the OLS. This process comes down to solving a system of normal equations.

(9.2)

This system of equations can be solved quite easily by the Gauss method. Let us recall that as a result of the solution, in our example, the values ​​of the parameters and are found. Thus, the found regression equation will have the following form:

It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities based on the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem using a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS interests us from the point of view of regression analysis (in Excel its methods are implemented using built-in functions), we should immediately move on to considering a specific problem.

So, let X be the retail space of a grocery store, measured in square meters, and Y be the annual turnover, measured in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has this or that retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built using data for n stores.

According to mathematical statistics, the results will be more or less correct if data on at least 5-6 objects is examined. In addition, “anomalous” results cannot be used. In particular, an elite small boutique can have a turnover that is several times greater than the turnover of large retail outlets of the “masmarket” class.

The essence of the method

The table data can be depicted on a Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution to the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n.

Of course, you can use a high-degree polynomial, but this option is not only difficult to implement, but also simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for the straight line y = ax + b, which best approximates the experimental data, or more precisely, the coefficients a and b.

Accuracy assessment

With any approximation, assessing its accuracy is of particular importance. Let us denote by e i the difference (deviation) between the functional and experimental values ​​for point x i, i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, you need to give preference to the one with smallest value sums e i at all considered points. However, not everything is so simple, since along with positive deviations there will also be negative ones.

The issue can be solved using deviation modules or their squares. The last method is the most widely used. It is used in many areas, including regression analysis (implemented in Excel using two built-in functions), and has long proven its effectiveness.

Least square method

Excel, as you know, has a built-in AutoSum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

IN mathematical notation it looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding the line that best describes specific dependency quantities X and Y, comes down to calculating the minimum of a function of two variables:

To do this, you need to equate the partial derivatives with respect to the new variables a and b to zero, and solve a primitive system consisting of two equations with 2 unknowns of the form:

After some simple transformations, including division by 2 and manipulation of sums, we get:

Solving it, for example, using Cramer’s method, we obtain a stationary point with certain coefficients a * and b *. This is the minimum, i.e. to predict what turnover a store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of ​​whether purchasing a specific area on store credit will pay off.

How to Implement Least Squares in Excel

Excel has a function for calculating values ​​using least squares. It has the following form: “TREND” (known Y values; known X values; new X values; constant). Let's apply the formula for calculating OLS in Excel to our table.

To do this, enter the “=” sign in the cell in which the result of the calculation using the least squares method in Excel should be displayed and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case, data for trade turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • both famous and unknown values x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, the formula contains the logical variable “Const”. If you enter 1 in the corresponding field, this will mean that you should carry out the calculations, assuming that b = 0.

If you need to find out the forecast for more than one x value, then after entering the formula you should not press “Enter”, but you need to type the combination “Shift” + “Control” + “Enter” on the keyboard.

Some features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables—TREND—can be used even by those who have never heard of least squares. It is enough just to know some of the features of its work. In particular:

  • If you arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values x will be treated by the program as a separate variable.
  • If the TREND window does not indicate a range with known x, then if the function is used in Excel program will treat it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of “predicted” values, the expression for calculating the trend must be entered as an array formula.
  • If new values ​​of x are not specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already specified parameters y.
  • The range containing the new x values ​​must have the same or more rows or columns as the range containing the given y values. In other words, it must be proportional to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values ​​of x and y be proportional. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

PREDICTION function

Implemented using several functions. One of them is called “PREDICTION”. It is similar to “TREND”, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know formulas in Excel for dummies that allow you to predict the future value of a particular indicator according to a linear trend.

Example.

Experimental data on the values ​​of variables X And at are given in the table.

As a result of their alignment, the function is obtained

Using least square method, approximate these data by a linear dependence y=ax+b(find parameters A And b). Find out which of the two lines better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the least squares method (LSM).

The task is to find the linear dependence coefficients at which the function of two variables A And b takes the smallest value. That is, given A And b the sum of squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, solving the example comes down to finding the extremum of a function of two variables.

Deriving formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding the partial derivatives of a function by variables A And b, we equate these derivatives to zero.

We solve the resulting system of equations using any method (for example by substitution method or Cramer's method) and obtain formulas for finding coefficients using the least squares method (LSM).

Given A And b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums ,,, and parameter n- amount of experimental data. We recommend calculating the values ​​of these amounts separately. Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill out the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​in the 2nd row for each number i.

The values ​​in the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients A And b. We substitute the corresponding values ​​from the last column of the table into them:

Hence, y = 0.165x+2.184- the desired approximating straight line.

It remains to find out which of the lines y = 0.165x+2.184 or better approximates the original data, that is, makes an estimate using the least squares method.

Error estimation of the least squares method.

To do this, you need to calculate the sum of squared deviations of the original data from these lines And , a smaller value corresponds to a line that better approximates the original data in the sense of the least squares method.

Since , then straight y = 0.165x+2.184 better approximates the original data.

Graphic illustration of the least squares (LS) method.

Everything is clearly visible on the graphs. The red line is the found straight line y = 0.165x+2.184, the blue line is , pink dots are the original data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating approximate values ​​of functions from their known values ​​at certain fixed points is widely used.

This kind of function approximation problem often arises:

    when constructing approximate formulas for calculating the values ​​of characteristic quantities of the process under study using tabular data obtained as a result of the experiment;

    in numerical integration, differentiation, solution differential equations etc.;

    if necessary, calculate the values ​​of functions at intermediate points of the considered interval;

    when determining the values ​​of characteristic quantities of a process outside the considered interval, in particular when forecasting.

If, to model a certain process specified by a table, we construct a function that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the task of constructing approximating functions itself will be called an approximation problem.

This article discusses the capabilities of the MS Excel package for solving this type of problem, in addition, it provides methods and techniques for constructing (creating) regressions for tabulated functions (which is the basis of regression analysis).

Excel has two options for building regressions.

    Adding selected regressions ( trend lines- trendlines) into a diagram built on the basis of a data table for the process characteristic under study (available only if there is a constructed diagram);

    Using the built-in statistical functions of the Excel worksheet, allowing you to obtain regressions (trend lines) directly from the source data table.

Adding trend lines to a chart

For a table of data that describes a process and is represented by a diagram, Excel has an effective regression analysis tool that allows you to:

    build on the basis of the least squares method and add five types of regressions to the diagram, which model the process under study with varying degrees of accuracy;

    add the constructed regression equation to the diagram;

    determine the degree of correspondence of the selected regression to the data displayed on the chart.

Based on chart data, Excel allows you to obtain linear, polynomial, logarithmic, power, exponential types of regressions, which are specified by the equation:

y = y(x)

where x is an independent variable that often takes the values ​​of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, a countdown of the time of the process under study (characteristics).

1 . Linear regression is good for modeling characteristics whose values ​​increase or decrease at a constant rate. This is the simplest model to construct for the process under study. It is constructed in accordance with the equation:

y = mx + b

where m is the tangent of the angle of inclination linear regression to the abscissa axis; b - coordinate of the point of intersection of linear regression with the ordinate axis.

2 . A polynomial trend line is useful for describing characteristics that have several distinct extremes (maxima and minima). The choice of polynomial degree is determined by the number of extrema of the characteristic under study. Thus, a second-degree polynomial can well describe a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is constructed in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where coefficients c0, c1, c2,... c6 are constants whose values ​​are determined during construction.

3 . The logarithmic trend line is successfully used when modeling characteristics whose values ​​initially change rapidly and then gradually stabilize.

y = c ln(x) + b

4 . A power-law trend line gives good results if the values ​​of the relationship under study are characterized by a constant change in the growth rate. An example of such a dependence is the graph of uniformly accelerated motion of a car. If there are zero or negative values ​​in the data, you cannot use a power trend line.

Constructed in accordance with the equation:

y = c xb

where coefficients b, c are constants.

5 . An exponential trend line should be used when the rate of change in the data is continuously increasing. For data containing zero or negative values, this type of approximation is also not applicable.

Constructed in accordance with the equation:

y = c ebx

where coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the reliability of the approximation: than closer value R2 to unity, the more reliably the trend line approximates the process under study. If necessary, the R2 value can always be displayed on the chart.

Determined by the formula:

To add a trend line to a data series:

    activate a chart based on a series of data, i.e. click within the chart area. The Diagram item will appear in the main menu;

    after clicking on this item, a menu will appear on the screen in which you should select the Add trend line command.

The same actions can be easily implemented by moving the mouse pointer over the graph corresponding to one of the data series and right-clicking; In the context menu that appears, select the Add trend line command. The Trendline dialog box will appear on the screen with the Type tab opened (Fig. 1).

After this you need:

Select the required trend line type on the Type tab (the Linear type is selected by default). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 . The Built on series field lists all data series in the chart in question. To add a trend line to a specific data series, select its name in the Built on series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

    change the name of the trend line in the Name of the approximating (smoothed) curve field.

    set the number of periods (forward or backward) for the forecast in the Forecast field;

    display the equation of the trend line in the diagram area, for which you should enable the show equation on the diagram checkbox;

    display the approximation reliability value R2 in the diagram area, for which you should enable the Place the approximation reliability value on the diagram (R^2) checkbox;

    set the intersection point of the trend line with the Y axis, for which you should enable the checkbox for the intersection of the curve with the Y axis at a point;

    Click the OK button to close the dialog box.

In order to start editing an already drawn trend line, there are three ways:

    use the Selected trend line command from the Format menu, having previously selected the trend line;

    select the Format trend line command from the context menu, which is called up by right-clicking on the trend line;

    double click on the trend line.

The Trend Line Format dialog box will appear on the screen (Fig. 3), containing three tabs: View, Type, Parameters, and the contents of the last two completely coincide with the similar tabs of the Trend Line dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete a trend line that has already been drawn, select the trend line to be deleted and press the Delete key.

The advantages of the considered regression analysis tool are:

    the relative ease of constructing a trend line on charts without creating a data table for it;

    a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

    the ability to predict the behavior of the process under study by an arbitrary (within the limits of common sense) number of steps forward and also backward;

    the ability to obtain the trend line equation in analytical form;

    the possibility, if necessary, of obtaining an assessment of the reliability of the approximation.

The disadvantages include the following:

    the construction of a trend line is carried out only if there is a diagram built on a series of data;

    the process of generating data series for the characteristic under study based on the trend line equations obtained for it is somewhat cluttered: the required regression equations are updated with each change in the values ​​of the original data series, but only within the diagram area, while data series, generated based on the old trend line equation, remains unchanged;

    In PivotChart reports, changing the view of a chart or associated PivotTable report does not preserve existing trendlines, meaning that before you draw trendlines or otherwise format a PivotChart report, you should ensure that the report layout meets the required requirements.

Trend lines can be used to supplement data series presented on charts such as graph, histogram, flat non-standardized area charts, bar charts, scatter charts, bubble charts, and stock charts.

You cannot add trend lines to data series in 3D, normalized, radar, pie, and donut charts.

Using Excel's built-in functions

Excel also has a regression analysis tool for plotting trend lines outside the chart area. There are a number of statistical worksheet functions you can use for this purpose, but all of them only allow you to build linear or exponential regressions.

Excel has several functions for constructing linear regression, in particular:

    TREND;

  • SLOPE and CUT.

As well as several functions for constructing an exponential trend line, in particular:

    LGRFPRIBL.

It should be noted that the techniques for constructing regressions using the TREND and GROWTH functions are almost the same. The same can be said about the pair of functions LINEST and LGRFPRIBL. For these four functions, creating a table of values ​​uses Excel features such as array formulas, which somewhat clutters the process of building regressions. Let us also note that the construction of linear regression, in our opinion, is most easily accomplished using the SLOPE and INTERCEPT functions, where the first of them determines the slope of the linear regression, and the second determines the segment intercepted by the regression on the y-axis.

The advantages of the built-in functions tool for regression analysis are:

    a fairly simple, uniform process of generating data series of the characteristic under study for all built-in statistical functions that define trend lines;

    standard methodology for constructing trend lines based on generated data series;

    the ability to predict the behavior of the process under study by the required number of steps forward or backward.

The disadvantages include the fact that Excel does not have built-in functions for creating other (except linear and exponential) types of trend lines. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts that are close to reality. In addition, when using the TREND and GROWTH functions, the equations of the trend lines are not known.

It should be noted that the authors did not set out to present the course of regression analysis with any degree of completeness. Its main task is to show, using specific examples, the capabilities of the Excel package when solving approximation problems; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how such problems can be solved relatively easily even by a user who does not have extensive knowledge of regression analysis.

Examples of solving specific problems

Let's look at solving specific problems using the listed Excel tools.

Problem 1

With a table of data on the profit of a motor transport enterprise for 1995-2002. you need to do the following:

    Build a diagram.

    Add linear and polynomial (quadratic and cubic) trend lines to the chart.

    Using the trend line equations, obtain tabular data on enterprise profits for each trend line for 1995-2004.

    Make a forecast for the enterprise's profit for 2003 and 2004.

The solution of the problem

    In the range of cells A4:C11 of the Excel worksheet, enter the worksheet shown in Fig. 4.

    Having selected the range of cells B4:C11, we build a diagram.

    We activate the constructed diagram and, according to the method described above, after selecting the type of trend line in the Trend Line dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the diagram. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the trend being added, and in the Forecast forward for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the approximation reliability value R2 in the diagram area, enable the show equation on the screen checkboxes and place the approximation reliability value (R^2) on the diagram. For better visual perception, we change the type, color and thickness of the constructed trend lines, for which we use the View tab of the Trend Line Format dialog box (see Fig. 3). The resulting diagram with added trend lines is shown in Fig. 5.

    To obtain tabular data on enterprise profits for each trend line for 1995-2004. Let's use the trend line equations presented in Fig. 5. To do this, in the cells of the range D3:F3, enter text information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the cell range D5:D13. It should be noted that each cell with a linear regression formula from the range of cells D4:D13 has as an argument a corresponding cell from the range A4:A13. Similarly, for quadratic regression, fill the range of cells E4:E13, and for cubic regression, fill the range of cells F4:F13. Thus, a forecast for the enterprise's profit for 2003 and 2004 has been compiled. using three trends. The resulting table of values ​​is shown in Fig. 6.

Problem 2

    Build a diagram.

    Add logarithmic, power and exponential trend lines to the chart.

    Derive the equations of the obtained trend lines, as well as the reliability values ​​of the approximation R2 for each of them.

    Using the trend line equations, obtain tabular data on the enterprise's profit for each trend line for 1995-2002.

    Make a forecast of the company's profit for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with logarithmic, power and exponential trend lines added to it (Fig. 7). Next, using the obtained trend line equations, we fill out a table of values ​​for the enterprise’s profit, including the predicted values ​​for 2003 and 2004. (Fig. 8).

In Fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the lowest value of approximation reliability

R2 = 0.8659

The highest values ​​of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Problem 3

With the table of data on the profit of a motor transport enterprise for 1995-2002, given in task 1, you must perform the following steps.

    Obtain data series for linear and exponential trend lines using the TREND and GROW functions.

    Using the TREND and GROWTH functions, make a forecast of the enterprise’s profit for 2003 and 2004.

    Construct a diagram for the original data and the resulting data series.

The solution of the problem

Let's use the worksheet for Problem 1 (see Fig. 4). Let's start with the TREND function:

    select the range of cells D4:D11, which should be filled with the values ​​of the TREND function corresponding to the known data on the profit of the enterprise;

    Call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click the OK button. The same operation can be performed by clicking the (Insert Function) button on the standard toolbar.

    In the Function Arguments dialog box that appears, enter the range of cells C4:C11 in the Known_values_y field; in the Known_values_x field - the range of cells B4:B11;

    To make the entered formula become an array formula, use the key combination + + .

The formula we entered in the formula bar will look like: =(TREND(C4:C11,B4:B11)).

As a result, the range of cells D4:D11 is filled with the corresponding values ​​of the TREND function (Fig. 9).

To make a forecast of the enterprise's profit for 2003 and 2004. necessary:

    select the range of cells D12:D13 where the values ​​predicted by the TREND function will be entered.

    call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4:C11; in the Known_values_x field - the range of cells B4:B11; and in the New_values_x field - the range of cells B12:B13.

    turn this formula into an array formula using the key combination Ctrl + Shift + Enter.

    The entered formula will look like: =(TREND(C4:C11;B4:B11;B12:B13)), and the range of cells D12:D13 will be filled with the predicted values ​​of the TREND function (see Fig. 9).

The data series is similarly filled in using the GROWTH function, which is used in the analysis of nonlinear dependencies and works in exactly the same way as its linear counterpart TREND.

Figure 10 shows the table in formula display mode.

For the initial data and the obtained data series, the diagram shown in Fig. eleven.

Problem 4

With the table of data on the receipt of applications for services by the dispatch service of a motor transport enterprise for the period from the 1st to the 11th of the current month, you must perform the following actions.

    Get data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

    Obtain a series of data for exponential regression using the LGRFPRIBL function.

    Using the above functions, make a forecast about the receipt of applications to the dispatch service for the period from the 12th to the 14th of the current month.

    Create a diagram for the original and received data series.

The solution of the problem

Note that, unlike the TREND and GROWTH functions, none of the functions listed above (SLOPE, INTERCEPT, LINEST, LGRFPRIB) are regression. These functions play only a supporting role, determining the necessary regression parameters.

For linear and exponential regressions built using the functions SLOPE, INTERCEPT, LINEST, LGRFPRIB, the appearance of their equations is always known, in contrast to linear and exponential regressions corresponding to the TREND and GROWTH functions.

1 . Let's build a linear regression with the equation:

y = mx+b

using the SLOPE and INTERCEPT functions, with the regression slope m determined by the SLOPE function, and the free term b by the INTERCEPT function.

To do this, we carry out the following actions:

    enter the original table into the cell range A4:B14;

    the value of parameter m will be determined in cell C19. Select the Slope function from the Statistical category; enter the range of cells B4:B14 in the known_values_y field and the range of cells A4:A14 in the known_values_x field. The formula will be entered in cell C19: =SLOPE(B4:B14,A4:A14);

    Using a similar technique, the value of parameter b in cell D19 is determined. And its contents will look like: =SEGMENT(B4:B14,A4:A14). Thus, the values ​​of the parameters m and b required for constructing a linear regression will be stored in cells C19, D19, respectively;

    Next, enter the linear regression formula in cell C4 in the form: =$C*A4+$D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change during possible copying). The absolute reference sign $ can be typed either from the keyboard or using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula into the range of cells C4:C17. We obtain the required data series (Fig. 12). Due to the fact that the number of requests is an integer, you should set the number format with the number of decimal places to 0 on the Number tab of the Cell Format window.

2 . Now let's build a linear regression given by the equation:

y = mx+b

using the LINEST function.

For this:

    Enter the LINEST function as an array formula in the cell range C20:D20: =(LINEST(B4:B14,A4:A14)). As a result, we obtain the value of parameter m in cell C20, and the value of parameter b in cell D20;

    enter the formula in cell D4: =$C*A4+$D;

    copy this formula using the fill marker into the cell range D4:D17 and get the desired data series.

3 . We build an exponential regression with the equation:

using the LGRFPRIBL function it is performed similarly:

    In the cell range C21:D21 we enter the LGRFPRIBL function as an array formula: =( LGRFPRIBL (B4:B14,A4:A14)). In this case, the value of parameter m will be determined in cell C21, and the value of parameter b will be determined in cell D21;

    the formula is entered into cell E4: =$D*$C^A4;

    using the fill marker, this formula is copied to the range of cells E4:E17, where the data series for exponential regression will be located (see Fig. 12).

In Fig. Figure 13 shows a table where you can see the functions we use with the required cell ranges, as well as formulas.

Magnitude R 2 called coefficient of determination.

The task of constructing a regression dependence is to find the vector of coefficients m of model (1) at which the coefficient R takes on the maximum value.

To assess the significance of R, Fisher's F test is used, calculated using the formula

Where n- sample size (number of experiments);

k is the number of model coefficients.

If F exceeds some critical value for the data n And k and the accepted confidence probability, then the value of R is considered significant. Tables critical values F are given in reference books on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n=2 for a simple linear model is equal to 1 (a single straight line can always be drawn through 2 points on a plane). However, if the experimental data are random variables, such a value of R should be trusted with great caution. Usually, to obtain significant R and reliable regression, they strive to ensure that the number of experiments significantly exceeds the number of model coefficients (n>k).

To build a linear regression model you need:

1) prepare a list of n rows and m columns containing experimental data (column containing the output value Y must be either first or last in the list); For example, let’s take the data from the previous task, adding a column called “Period No.”, number the period numbers from 1 to 12. (these will be the values X)

2) go to the menu Data/Data Analysis/Regression

If the "Data Analysis" item in the "Tools" menu is missing, then you should go to the "Add-Ins" item in the same menu and check the "Analysis package" checkbox.

3) in the "Regression" dialog box, set:

· input interval Y;

· input interval X;

· output interval - the upper left cell of the interval in which the calculation results will be placed (it is recommended to place them on a new worksheet);

4) click "Ok" and analyze the results.

Least square method used to estimate the parameters of the regression equation.
Number of lines (source data)

One of the methods for studying stochastic relationships between characteristics is regression analysis.
Regression analysis is the derivation of a regression equation, with the help of which the average value of a random variable (result attribute) is found if the value of another (or other) variables (factor-attributes) is known. It includes the following steps:

  1. selection of the form of connection (type of analytical regression equation);
  2. estimation of equation parameters;
  3. assessment of the quality of the analytical regression equation.
Most often, a linear form is used to describe the statistical relationship of features. The focus on linear relationships is explained by the clear economic interpretation of its parameters, the limited variation of variables, and the fact that in most cases nonlinear forms of relationships are converted (by logarithm or substitution of variables) into a linear form to perform calculations.
In the case of a linear pairwise relationship, the regression equation will take the form: y i =a+b·x i +u i . The parameters a and b of this equation are estimated from the data statistical observation x and y. The result of such an assessment is the equation: , where , are estimates of parameters a and b , is the value of the resulting attribute (variable) obtained from the regression equation (calculated value).

Most often used to estimate parameters least squares method (LSM).
The least squares method provides the best (consistent, efficient, and unbiased) estimates of the parameters of the regression equation. But only if certain assumptions regarding the random term (u) and the independent variable (x) are met (see OLS assumptions).

The problem of estimating the parameters of a linear pair equation using the least squares method is as follows: to obtain such estimates of parameters , , at which the sum of squared deviations of the actual values ​​of the resultant characteristic - y i from the calculated values ​​- is minimal.
Formally OLS test can be written like this: .

Classification of least squares methods

  1. Least square method.
  2. Maximum likelihood method (for a normal classical linear regression model, normality of regression residuals is postulated).
  3. The generalized least squares OLS method is used in the case of autocorrelation of errors and in the case of heteroscedasticity.
  4. Weighted least squares method ( special case OLS with heteroscedastic residuals).

Let's illustrate the point classical method least squares graphically. To do this, we will construct a scatter plot based on observational data (x i, y i, i=1;n) in a rectangular coordinate system (such a scatter plot is called a correlation field). Let's try to select a straight line that is closest to the points of the correlation field. According to the least squares method, the line is selected so that the sum of the squares of the vertical distances between the points of the correlation field and this line is minimal.

Mathematical notation for this problem: .
The values ​​of y i and x i =1...n are known to us; these are observational data. In the S function they represent constants. The variables in this function are the required estimates of the parameters - , . To find the minimum of a function of two variables, it is necessary to calculate the partial derivatives of this function for each of the parameters and equate them to zero, i.e. .
As a result, we obtain a system of 2 normal linear equations:
Deciding this system, we find the required parameter estimates:

The correctness of the calculation of the parameters of the regression equation can be checked by comparing the amounts (there may be some discrepancy due to rounding of calculations).
To calculate parameter estimates, you can build Table 1.
The sign of the regression coefficient b indicates the direction of the relationship (if b >0, the relationship is direct, if b<0, то связь обратная). Величина b показывает на сколько единиц изменится в среднем признак-результат -y при изменении признака-фактора - х на 1 единицу своего измерения.
Formally, the value of parameter a is the average value of y with x equal to zero. If the attribute-factor does not and cannot have a zero value, then the above interpretation of parameter a does not make sense.

Assessing the closeness of the relationship between characteristics carried out using the linear pair correlation coefficient - r x,y. It can be calculated using the formula: . In addition, the linear pair correlation coefficient can be determined through the regression coefficient b: .
The range of acceptable values ​​of the linear pair correlation coefficient is from –1 to +1. The sign of the correlation coefficient indicates the direction of the relationship. If r x, y >0, then the connection is direct; if r x, y<0, то связь обратная.
If this coefficient is close to unity in magnitude, then the relationship between the characteristics can be interpreted as a fairly close linear one. If its module is equal to one ê r x , y ê =1, then the relationship between the characteristics is functional linear. If features x and y are linearly independent, then r x,y is close to 0.
To calculate r x,y, you can also use Table 1.

Table 1

N observationsx iy ix i ∙y i
1 x 1y 1x 1 y 1
2 x 2y 2x 2 y 2
...
nx ny nx n y n
Column Sum∑x∑y∑xy
Average value
To assess the quality of the resulting regression equation, calculate the theoretical coefficient of determination - R 2 yx:

,
where d 2 is the variance of y explained by the regression equation;
e 2 - residual (unexplained by the regression equation) variance of y;
s 2 y - total (total) variance of y.
The coefficient of determination characterizes the proportion of variation (dispersion) of the resultant attribute y explained by regression (and, consequently, factor x) in the total variation (dispersion) y. The coefficient of determination R 2 yx takes values ​​from 0 to 1. Accordingly, the value 1-R 2 yx characterizes the proportion of variance y caused by the influence of other factors not taken into account in the model and specification errors.
With paired linear regression, R 2 yx =r 2 yx.

New on the site

>

Most popular