3. CHECKING THE HYPOTHESIS ABOUT EQUALITY OF AVERAGES
Used to test the proposition that the mean of two indicators represented by samples are significantly different. There are three types of test: one for related samples, and two for unrelated samples (with the same and different variances). If the samples are not related, then you first need to test the hypothesis of equality of variances to determine which criterion to use. Just as in the case of comparing variances, there are 2 ways to solve the problem, which we will consider using an example.
EXAMPLE 3. There is data on the number of sales of goods in two cities. Test at a significance level of 0.01 the statistical hypothesis that the average number of product sales in cities is different.
23 | 25 | 23 | 22 | 23 | 24 | 28 | 16 | 18 | 23 | 29 | 26 | 31 | 19 |
22 | 28 | 26 | 26 | 35 | 20 | 27 | 28 | 28 | 26 | 22 | 29 |
We use the Data Analysis package. Depending on the type of criterion, one of three is selected: “Paired two-sample t-test for means” - for connected samples, and “Two-sample t-test with equal variances” or “Two-sample t-test with different variances” - for disconnected samples. Call the test with the same variances, in the window that opens, in the “Variable Interval 1” and “Variable Interval 2” fields, enter links to the data (A1-N1 and A2-L2, respectively); if there are data labels, then check the box next to the “Labels” "(we don't have them, so the checkbox is not checked). Next, enter the significance level in the “Alpha” field - 0.01. The “Hypothetical mean difference” field is left blank. In the “Output Options” section, put a checkmark next to “Output interval” and, placing the cursor in the field that appears opposite the inscription, click the left button in cell B7. The result will be output starting from this cell. By clicking on “OK”, a table of results appears. Move the border between columns B and C, C and D, D and E by increasing the width of columns B, C and D so that all the labels fit. The procedure displays the main characteristics of the sample, t-statistics, critical values these statistics and critical levels significance "P(T<=t) одностороннее» и «Р(Т<=t) двухстороннее». Если по модулю t-статистика меньше критического, то средние показатели с заданной вероятностью равны. В нашем случае│-1,784242592│ < 2,492159469, следовательно, среднее число продаж значимо не отличается. Следует отметить, что если взять уровень значимости α=0,05, то результаты исследования будут совсем иными.
Two-sample t-test with equal variances |
||
Average | 23,57142857 | 26,41666667 |
Dispersion | 17,34065934 | 15,35606061 |
Observations | 14 | 12 |
Pooled Variance | 16,43105159 | |
Hypothetical mean difference | 0 | |
df | 24 | |
t-statistic | -1,784242592 | |
P(T<=t) одностороннее | 0,043516846 | |
t critical one-sided | 2,492159469 | |
P(T<=t) двухстороннее | 0,087033692 | |
t critical two-way | 2,796939498 |
Laboratory work No. 3
PAIRED LINEAR REGRESSION
Goal: To master the methods of constructing a linear equation of paired regression using a computer, to learn how to obtain and analyze the main characteristics of the regression equation.
Let's consider the methodology for constructing a regression equation using an example.
EXAMPLE. Samples of factors x i and y i are given. Using these samples, find the linear regression equation ỹ = ax + b. Find the pair correlation coefficient. Check the regression model for adequacy at the significance level a = 0.05.
X | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Y | 6,7 | 6,3 | 4,4 | 9,5 | 5,2 | 4,3 | 7,7 | 7,1 | 7,1 | 7,9 |
To find the coefficients a and b of the regression equation, use the SLOPE and INTERCEPT functions, categories “Statistical”. We enter the signature “a=” in A5 and enter the TILT function in the adjacent cell B5, place the cursor in the “Iz_value_y” field and set a link to cells B2-K2 by circling them with the mouse. The result is 0.14303. Let us now find the coefficient b. We enter the signature “b=” in A6, and in B6 the CUT function with the same parameters as the TILT functions. The result is 5.976364. therefore, the linear regression equation is y=0.14303x+5.976364.
Let's plot the regression equation. To do this, in the third line of the table we enter the values of the function at the given points X (first line) – y(x 1). To obtain these values, use the TREND function of the Statistical category. We enter the signature “Y(X)” in A3 and, placing the cursor in B3, call the TREND function. In the fields “From_value_y” and “From_value_x” we give a link to B2-K2 and B1-K1. in the “New_value_x” field we also enter a link to B1-K1. in the “Constant” field enter 1 if the regression equation has the form y=ax+b, and 0 if y=ax. In our case, we enter one. The TREND function is an array, so to display all its values, select area B3-K3 and press F2 and Ctrl+Shift+Enter. The result is the values of the regression equation at given points. We are building a schedule. Place the cursor in any free cell, call the diagram wizard, select the “Sharpened” category, the type of graph – line without dots (in the lower right corner), click “Next”, enter the link to B3-K3 in the “Diagnostic” field. go to the “Row” tab and in the “X Values” field enter the link to B1-K1, click “Finish”. The result is a straight regression line. Let's see how the graphs of experimental data and regression equations differ. To do this, place the cursor in any free cell, call the chart wizard, category “Graph”, graph type – broken line with dots (second from the top left), click “Next”, in the “Range” field enter a link to the second and third lines B2- K3. go to the “Row” tab and in the “X-axis labels” field, enter the link to B1-K1, click “Finish”. The result is two lines (Blue – original, red – regression equation). It can be seen that the lines differ little from each other.
a= | 0,14303 |
b= | 5,976364 |
To calculate the correlation coefficient r xy, use the PEARSON function. We place the graph so that they are located above line 25, and in A25 we make the signature “Correlation”, in B25 we call the PEARSON function, in the fields of which “Array 2” we enter a link to the source data B1-K1 and B2-K2. the result is 0.993821. the coefficient of determination R xy is the square of the correlation coefficient r xy . In A26 we sign “Determination”, and in B26 we write the formula “=B25*B25”. The result is 0.265207.
However, there is one function in Excel that calculates all the basic characteristics of linear regression. This is the LINEST function. Place the cursor in B28 and call the LINEST function, category “Statistical”. In the fields “From_value_y” and “From_value_x” we give a link to B2-K2 and B1-K1. the “Constant” field has the same meaning as the TREND function; in our case it is equal to 1. The “Stat” field must contain 1 if you need to display complete statistics about the regression. In our case, we put one there. The function returns an array of 2 columns and 5 rows. After entering, select cell B28-C32 with the mouse and press F2 and Ctrl+Shift+Enter. The result is a table of values, the numbers in which have the following meaning:
Coefficient a | Coefficient b |
Standard error m o | Standard error m h |
Determination coefficient R xy | Standard deviation |
F – statistics | Degrees of freedom n-2 |
Regression sum of squares S n 2 | Residual sum of squares S n 2 |
0,14303 | 5,976364 |
0,183849 | 0,981484 |
0,070335 | 1,669889 |
0,60525 | 8 |
1,687758 | 22,30824 |
Analysis of the result: in the first line - the coefficients of the regression equation, compare them with the calculated functions SLOPE and INTERCEPT. The second line is the standard errors of the coefficients. If one of them is greater in absolute value than the coefficient itself, then the coefficient is considered zero. The coefficient of determination characterizes the quality of the relationship between factors. The resulting value of 0.070335 indicates a very good relationship between the factors, F - statistics tests the hypothesis about the adequacy of the regression model. This number must be compared with the critical value, to obtain it we enter the signature “F-critical” in E33, and in F33 the function FRIST, the arguments of which we enter respectively “0.05” (significance level), “1” (number of factors X) and "8" (degrees of freedom).
F-critical | 5,317655 |
It can be seen that the F-statistic is less than the F-critical, which means that the regression model is not adequate. The last line shows the regression sum of squares and residual sums of squares . It is important that the regression sum (explained by regression) is much larger than the residual (not explained by regression, caused by random factors). In our case, this condition is not met, which indicates poor regression.
Conclusion: In the course of my work, I mastered the methods of constructing a linear equation of pair regression using a computer, learned to obtain and analyze the main characteristics of the regression equation.
Laboratory work No. 4
NONLINEAR REGRESSION
Goal: to master methods for constructing the main types of nonlinear pair regression equations using a computer (internal linear models), learn to obtain and analyze quality indicators of regression equations.
Let's consider the case when nonlinear models can be reduced to linear ones using data transformation (internal linear models).
EXAMPLE. Construct a regression equation y = f(x) for the sample x n y n (f = 1,2,…,10). As f(x), consider four types of functions - linear, power, exponential and hyperbola:
y = Ax + B; y = Ax B; y = Ae Bx; y = A/x + B.
It is necessary to find their coefficients A and B, and after comparing the quality indicators, select the function that best describes the dependence.
Profit Y | 0,3 | 1,2 | 2,8 | 5,2 | 8,1 | 11,0 | 16,8 | 16,9 | 24,7 | 29,4 |
Profit X | 0,25 | 0,50 | 0,75 | 1,00 | 1,25 | 1,50 | 1,75 | 2,00 | 2,25 | 2,50 |
Let's enter the data into the table along with the signatures (cells A1-K2). Let's leave three lines free below the table for entering the converted data, select the first five lines by swiping along the left gray border along the numbers from 1 to 5 and select a color (light - yellow or pink) to color the background of the cells. Next, starting from A6, we display the linear regression parameters. To do this, write “Linear” in cell A6 and enter the LINEST function in adjacent cell B6. In the “Izv_value_x” fields we give a link to B2-K2 and B1-K1, the next two fields take values of one. Next, circle the area below in 5 lines and to the left in 2 lines and press F2 and Ctrl+Shift+Enter. The result is a table with regression parameters, of which the coefficient of determination in the first column, third from the top, is of greatest interest. In our case, it is equal to R 1 = 0.951262. The value of the F-criterion, which allows checking the adequacy of the model F 1 = 156.1439
(fourth row, first column). The regression equation is
y = 12.96 x +6.18 (coefficients a and b are given in cells B6 and C6).
Linear | 12,96 | -6,18 |
1,037152 | 1,60884 | |
0,951262 | 2,355101 | |
156,1439 | 8 | |
866,052 | 44,372 |
Let us determine similar characteristics for other regressions and, as a result of comparing the coefficients of determination, we will find the best regression model. Let's consider hyperbolic regression. To obtain it, we transform the data. In the third line, in cell A3 we enter the signature “1/x” and in cell B3 we enter the formula “=1/B2”. Let's autofill this cell to area B3-K3. Let's get the characteristics of the regression model. In cell A12 we enter the signature “Hyperbola”, and in the adjacent LINEST function. In the fields “From_value_y” and “From_value_x2” we give a link to B1-K1 and the converted data of argument x – B3-K3, the next two fields take values of one. Next, circle the area below 5 lines and 2 lines to the left and press F2 and Ctrl+Shift+Enter. We get a table of regression parameters. Coefficient of determination in in this case is equal to R 2 = 0.475661, which is much worse than in the case of linear regression. The F-statistic is F2 = 7.257293. The regression equation is y = -6.25453x 18.96772.
Hyperbola | -6,25453 | 18,96772 |
2,321705 | 3,655951 | |
0,475661 | 7,724727 | |
7,257293 | 8 | |
433,0528 | 477,3712 |
Let's consider exponential regression. To linearize it, we obtain the equation , where ỹ = ln y, ã = b, = ln a. It can be seen that a data transformation needs to be done - replace y with ln y. Place the cursor in cell A4 and make the heading “ln y”. Place the cursor in B4 and enter the LN formula (category “Mathematical”). As an argument, we make reference to B1. Using autofill, we extend the formula to the fourth row to cells B4-K4. Next, in cell F6 we set the signature “Exponent” and in the adjacent G6 we enter the LINEST function, the arguments of which will be the transformed data B4-K4 (in the “Measured_value_y” field), and the remaining fields are the same as for the case of linear regression (B2-K2, eleven). Next, circle cells G6-H10 and press F2 and Ctrl+Shift+Enter. The result is R 3 = 0.89079, F 3 = 65.25304, which indicates a very good regression. To find the coefficients of the regression equation b = ã; put the cursor in J6 and make the heading “a=”, and in the neighboring K6 the formula “=EXP(H6)”, in J7 we give the heading “b=”, and in K7 the formula “=G6”. The regression equation is y = 0.511707· e 6.197909 x.
Exhibitor | 1,824212 | -0,67 | a= | 0,511707 | |
0,225827 | 0,350304 | b= | 6,197909 | ||
0,89079 | 0,512793 | ||||
65,25304 | 8 | ||||
17,15871 | 2,103652 |
Let's consider power regression. To linearize it, we obtain the equation ỹ = ã, where ỹ = ln y, = ln x, ã = b, = ln a. It can be seen that it is necessary to transform the data - replace y with ln y and replace x with ln x. We already have the line with ln y. Let's transform the x variables. In cell A5 we write the signature “ln x”, and in cell B5 we enter the formula LN (category “Mathematical”). As an argument, we make reference to B2. Using autofill, we extend the formula to the fifth row to cells B5-K5. Next, in cell F12 we set the signature “Power” and in the adjacent G12 we enter the LINEST function, the arguments of which will be the converted data B4-K4 (in the “From_value_y” field), and B5-K5 (in the “From_value_x” field), the remaining fields are ones. Next, free cells G12-H16 and press F2 and Ctrl+Shift+Enter. The result is R 4 = 0.997716, F 4 = 3494.117, which indicates good regression. To find the coefficients of the regression equation b = ã; put the cursor in J12 and make the heading “a=”, and in the neighboring K12 the formula “=EXP(H12)”, in J13 we give the heading “b=”, and in K13 the formula “=G12”. The regression equation is y = 4.90767/x+ 7.341268.
Power | 1,993512 | 1,590799 | a= | 4,90767 | |
0,033725 | 0,023823 | b= | 7,341268 | ||
0,997716 | 0,074163 | ||||
3494,117 | 8 | ||||
19,21836 | 0,044002 |
Let's check whether all equations adequately describe the data. To do this, you need to compare the F-statistics of each criterion with the critical value. To obtain it, we enter the signature “F-critical” in A21, and in B21 the function FRIST, the arguments of which we enter, respectively, “0.05” (significance level), “1” (the number of factors X in the line “Significance level 1”) and “ 8" (degree of freedom 2 = n – 2). The result is 5.317655. F – critical is greater than F – statistics, which means the model is adequate. The remaining regressions are also adequate. In order to determine which model best describes the data, we compare the determination indices for each model R 1, R 2, R 3, R 4. The largest is R4 = 0.997716. This means that the experimental data are better described by y = 4.90767/x + 7.341268.
Conclusion: In the course of my work, I mastered methods for constructing the main types of nonlinear pairwise regression equations using a computer (internal linear models), learned to obtain and analyze quality indicators of regression equations.
Y | 0,3 | 1,2 | 2,8 | 5,2 | 8,1 | 11 | 16,8 | 16,9 | 24,7 | 29,4 |
X | 0,25 | 0,5 | 0,75 | 1 | 1,25 | 1,5 | 1,75 | 2 | 2,25 | 2,5 |
1/x | 4 | 2 | 1,333333 | 1 | 0,8 | 0,666667 | 0,571429 | 0,5 | 0,444444 | 0,4 |
ln y | -1,20397 | 0,182322 | 1,029619 | 1,648659 | 2,0918641 | 2,397895 | 2,821379 | 2,827314 | 3,206803 | 3,380995 |
ln x | -1,38629 | -0,69315 | -0,28768 | 0 | 0,2231436 | 0,405465 | 0,559616 | 0,693147 | 0,81093 | 0,916291 |
Linear | 12,96 | -6,18 | Exhibitor | 1,824212 | -0,67 | a= | 0,511707 | |||
1,037152 | 1,60884 | 0,225827 | 0,350304 | b= | 6,197909 | |||||
0,951262 | 2,355101 | 0,89079 | 0,512793 | |||||||
156,1439 | 8 | 65,25304 | 8 | |||||||
866,052 | 44,372 | 17,15871 | 2,103652 | |||||||
Hyperbola | -6,25453 | 18,96772 | Power | 1,993512 | 1,590799 | a= | 4,90767 | |||
2,321705 | 3,655951 | 0,033725 | 0,023823 | b= | 7,341268 | |||||
0,475661 | 7,724727 | 0,997716 | 0,074163 | |||||||
7,257293 | 8 | 3494,117 | 8 | |||||||
433,0528 | 477,3712 | 19,21836 | 0,044002 | |||||||
F - critical | 5,317655 | |||||||||
Laboratory work No. 5
POLYNOMIAL REGRESSION
Purpose: Using experimental data, construct a regression equation of the form y = ax 2 + bx + c.
PROGRESS:
The dependence of the yield of a certain crop y i on the amount of mineral fertilizers applied to the soil x i is considered. It is assumed that this dependence is quadratic. It is necessary to find a regression equation of the form ỹ = ax 2 + bx + c.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
y | 29,8 | 58,8 | 72,2 | 101,5 | 141 | 135,1 | 156,6 | 181,7 | 216,6 | 208,2 |
Let's enter this data into the spreadsheet along with signatures in cells A1-K2. Let's build a graph. To do this, circle the data Y (cells B2-K2), call the chart wizard, select the chart type “Graph”, chart type – graph with dots (second from the top left), click “Next”, go to the “Series” tab and in the “ X-axis labels" make a link to B2-K2, click "Finish". The graph can be approximated by a polynomial of degree 2 y = ax 2 + bx + c. To find the coefficients a, b, c, you need to solve the system of equations:
Let's calculate the amounts. To do this, enter the signature “X^2” in cell A3, and enter the formula “= B1*B1” in cell B3 and transfer it to the entire line B3-K3 using Autofill. In cell A4 we enter the signature “X^3”, and in B4 the formula “=B1*B3” and Autofill transfer it to the entire line B4-K4. In cell A5 we enter “X^4”, and in B5 the formula “=B4*B1”, autofill the line. In cell A6 we enter “X*Y”, and in B8 the formula “=B2*B1”, autofill the line. In cell A7 we enter “X^2*Y”, and in B9 the formula “=B3*B2”, autofill the line. Now we count the amounts. Select column L with a different color by clicking on the header and selecting a color. Place the cursor in cell L1 and click on the autosum button with the ∑ icon to calculate the sum of the first row. Using AutoFill, we transfer the formula to cells L1-710.
Now we solve the system of equations. To do this, we introduce the main matrix of the system. In cell A13 we enter the signature “A=”, and in matrix cells B13-D15 we enter the links reflected in the table
B | C | D | |
13 | =L5 | =L4 | =L3 |
14 | =L3 | =L2 | =L1 |
15 | =L2 | =L1 | =9 |
We also introduce the right-hand sides of the system of equations. In G13 we enter the signature “B=”, and in H13-H15 we enter, respectively, links to cells “=L7”, “=L6”, “=L2”. We solve the system using the matrix method. From higher mathematics it is known that the solution is equal to A -1 B. Find the inverse matrix. To do this, enter the signature “A arr.” in cell J13. and, placing the cursor in K13, set the MOBR formula (category “Mathematical”). As the Array argument, we provide a reference to cells B13:D15. The result should also be a 4x4 matrix. To obtain it, circle cells K13-M15 with the mouse, selecting them and pressing F2 and Ctrl+Shift+Enter. The result is matrix A -1. Let us now find the product of this matrix and column B (cells H13-H15). We enter the signature “Coefficients” in cell A18 and in B18 we set the MULTIPLE function (category “Mathematical”). The arguments of the “Array 1” function are a link to matrix A-1 (cells K13-M15), and in the “Array 2” field we provide a link to column B (cells H13-H16). Next, select B18-B20 and press F2 and Ctrl+Shift+Enter. The resulting array is the coefficients of the regression equation a, b, c. As a result, we obtain a regression equation of the form: y = 1.201082x 2 – 5.619177x + 78.48095.
Let's build graphs of the original data and those obtained based on the regression equation. To do this, enter the signature “Regression” in cell A8 and enter the formula “=$B$18*B3+$B$19*B1+$B$20” in B8. Using AutoFill, we transfer the formula to cells B8-K8. To build a graph, select cells B8-K8 and, holding down the Ctrl key, also select cells B2-M2. Call the chart wizard, select the chart type “Graph”, chart type – graph with points (second from the top left), click “Next”, go to the “Series” tab and in the “X-axis labels” field make a link to B2-M2, click "Ready". It can be seen that the curves almost coincide.
CONCLUSION: in the process of work, based on experimental data, I learned to construct a regression equation of the form y = ax 2 + bx + c.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||
y | 29,8 | 58,8 | 72,2 | 101,5 | 141 | 135,1 | 156,6 | 181,7 | 216,6 | 208,2 | |||
X^2 | 0 | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 81 | |||
X^3 | 0 | 1 | 8 | 27 | 64 | 125 | 216 | 343 | 512 | 729 | |||
X^4 | 0 | 1 | 16 | 81 | 256 | 625 | 1296 | 2401 | 4096 | 6561 | |||
X*Y | 0 | 58,8 | 144,4 | 304,5 | 564 | 675,5 | 939,6 | 1271,9 | 1732,8 | 1873,8 | |||
X^2*Y | 0 | 58,8 | 288,8 | 913,5 | 2256 | 3377,5 | 5637,6 | 8903,3 | 13862,4 | 16864,2 | |||
Regression. | 78,48095 | 85,30121 | 94,52364 | 106,1482 | 120,175 | 136,6039 | 155,435 | 176,6682 | 200,3036 | 226,3412 | |||
A= | 15333 | 2025 | 285 | B= | 52162,1 | A Arr. | 0,003247 | -0,03247 | 0,059524 | ||||
2025 | 285 | 45 | 7565,3 | -0,03247 | 0,341342 | -0,67857 | |||||||
285 | 45 | 9 | 1301,5 | 0,059524 | -0,67857 | 1,619048 | |||||||
Coefficient | 1,201082 | a | |||||||||||
5,619177 |
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Lecture 6. Comparing two samples 6-1. Hypothesis of equality of means. Paired samples 6-2. Confidence interval for the difference in means. Paired samples 6-3. Hypothesis of equality of variances 6-4. Hypothesis of equality of shares 6-5. Confidence interval for the difference in proportions
2 Ivanov O.V., 2005 In this lecture... In the previous lecture we tested the hypothesis about the equality of the averages of two general populations and constructed confidence interval for the difference of means for the case of independent samples. Now we will consider the criterion for testing the hypothesis of equality of means and construct a confidence interval for the difference in means in the case of paired (dependent) samples. Then in section 6-3 the hypothesis of equality of variances will be tested, in section 6-4 - the hypothesis of equality of shares. Finally, we construct a confidence interval for the difference in proportions.
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equality of means. Paired samples Statement of the problem Hypotheses and statistics Sequence of actions Example
4 Ivanov O.V., 2005 Paired samples. Description of the problem What we have 1. Two simple random samples obtained from two general populations. The samples are paired (dependent). 2. Both samples have a size of n 30. If not, then both samples are taken from normally distributed populations. What we want is to test the hypothesis about the difference between the means of two populations:
5 Ivanov O.V., 2005 Statistics for paired samples To test the hypothesis, statistics are used: where is the difference between two values in one pair - the general average for paired differences - the sample average for paired differences - standard deviation differences for the sample - number of pairs
6 Ivanov O.V., 2005 Example. Training of students A group of 15 students took a test before and after the training. The test results are in the table. Let's test the hypothesis for paired samples for the absence of influence of training on students' preparation at a significance level of 0.05. Solution. Let's calculate the differences and their squares. StudentBeforeAfter Σ= 21 Σ= 145
7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for df = 15 – 1=14, we find the critical value t = 2.145 and write the critical region: t > 2.145. 2.145."> 2.145."> 2.145." title="7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. The significance level is set = 0.05. Step 3. By table for df = 15 – 1=14 we find the critical value t = 2.145 and write the critical region: t > 2.145."> title="7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for df = 15 – 1=14, we find the critical value t = 2.145 and write the critical region: t > 2.145."> !}
9 Ivanov O.V., 2005 Solution Statistics takes the value: Step 5. Compare the obtained value with the critical region. 1.889
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Confidence interval for the difference in means. Paired samples Problem statement Method for constructing a confidence interval Example
11 Ivanov O.V., 2005 Description of the problem What we have We have two random paired (dependent) samples of size n from two general populations. General populations have a normal distribution law with parameters 1, 1 and 2, 2 or the volumes of both samples are 30. What we want is to estimate the average value of paired differences for two general populations. To do this, construct a confidence interval for the average in the form:
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equality of variances Statement of the problem Hypotheses and statistics Sequence of actions Example
15 Ivanov O.V., 2005 During the study... The researcher may need to check the assumption that the variances of the two populations being studied are equal. In the case where these general populations have normal distribution, for this there is an F-test, also called the Fisher criterion. Unlike Student, Fischer did not work in a brewery.
16 Ivanov O.V., 2005 Description of the problem What we have 1. Two simple random samples obtained from two normally distributed populations. 2. The samples are independent. This means that there is no relationship between the sample subjects. What we want is to test the hypothesis of equality of population variances:
23 Ivanov O.V., 2005 Example A medical researcher wants to check whether there is a difference between the heart rate of smoking and non-smoking patients (number of beats per minute). The results of two randomly selected groups are shown below. Using α = 0.05, find out whether the doctor is right. Smokers Non-smokers
24 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for the number of degrees of freedom of the numerator 25 and denominator 17, we find the critical value f = 2.19 and the critical region: f > 2.19. Step 4. Using the sample, we calculate the statistics value: 2.19. Step 4. Using the sample, we calculate the statistics value: ">
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equal shares Statement of the problem Hypotheses and statistics Sequence of actions Example
27 Ivanov O.V., 2005 Question Out of 100 randomly selected students of the sociology faculty, 43 attend special courses. Out of 200 randomly selected economics students, 90 attend special courses. Does the proportion of students attending special courses differ between sociology and economics departments? It doesn't seem to be significantly different. How can I check this? The share of those attending special courses is the share of the attribute. 43 – number of “successes”. 43/100 – share of success. The terminology is the same as in Bernoulli's scheme.
28 Ivanov O.V., 2005 Description of the problem What we have 1. Two simple random samples obtained from two normally distributed populations. The samples are independent. 2. For samples, np 5 and nq 5 are fulfilled. This means that at least 5 elements of the sample have the studied characteristic value, and at least 5 do not. What we want is to test the hypothesis about the equality of the shares of a characteristic in two general populations:
31 Ivanov O.V., 2005 Example. Special courses of two faculties Out of 100 randomly selected students of the sociology faculty, 43 attend special courses. Of the 200 economics students, 90 attend special courses. At the significance level = 0.05, test the hypothesis that there is no difference between the proportion of students attending special courses in these two faculties. 33 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the normal distribution table, we find the critical values z = – 1.96 and z = 1.96, and construct the critical region: z 1.96. Step 4. Based on the sample, we calculate the value of the statistics.
34 Ivanov O.V., 2005 Solution Step 5. Compare the obtained value with the critical region. The resulting statistic value did not fall within the critical region. Step 6. Formulate the conclusion. There is no reason to reject the main hypothesis. The share of those attending special courses does not differ statistically significantly.
November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Confidence interval for the difference in proportions Statement of the problem Method for constructing a confidence interval Example
Consider two independent samples x 1, x 2, ….., x n and y 1, y 2, …, y n, extracted from normal populations with equal variances, with sample sizes n and m, respectively, and averages μ x, μ y and variance σ 2 are unknown. It is required to test the main hypothesis H 0: μ x = μ y with the competing H 1: μ x μ y.
As is known, sample averages will have the following properties: ~N(μ x, σ 2 /n), ~N(μ y, σ 2 /m).
Their difference is a normal value with the average and variance, so
~ (23).
Let us assume for a moment that the main hypothesis H 0 is correct: μ x – μ y =0. Then and dividing the value by its standard deviation, we obtain the standard normal sl. Size ~N(0,1).
It was previously noted that magnitude distributed according to the law with (n-1)th degree of freedom, a - according to the law with (m-1) degree of freedom. Taking into account the independence of these two sums, we find that they are total amount distributed according to the law with n+m-2 degrees of freedom.
Remembering step 7, we see that the fraction obeys the t-distribution (Student) with ν=m+n-2 degrees of freedom: Z=t. This fact occurs only when the hypothesis H 0 is true.
Replacing ξ and Q with their expressions, we obtain an expanded formula for Z:
(24)
The next Z value, called criterion statistics, allows you to make a decision with the following sequence of actions:
1. The area D=[-t β,ν , +t β,ν ] is established, containing β=1–α areas under the t ν distribution curve (Table 10).
2. The experimental value Z on of statistics Z is calculated using formula (24), for which the values x 1 and y 1 of specific samples, as well as their sample means and , are substituted instead of X 1 and Y 1 .
3. If Z on D, then the hypothesis H 0 is considered not to contradict experimental data and is accepted.
If Z on D, then hypothesis H 1 is accepted.
If the hypothesis H 0 is correct, then Z obeys the known t ν -distribution with zero mean and with a high probability β = 1–α falls into the D-region of acceptance of the hypothesis H 0 . When the observed, experimental value of Z on falls into D. We consider this as evidence in favor of the hypothesis H 0.
When Z 0 n lies outside D (as they say, lies in the critical region K), which is natural if the hypothesis H 1 is true, but unlikely if H 0 is true, then we can only reject the hypothesis H 0 by accepting H 1 .
Example 31.
Two grades of gasoline are compared: A and B. On 11 vehicles of the same power, grades A and B gasoline were tested once on a circular chassis. One car broke down en route and for it there is no data on gasoline B.
Gasoline consumption per 100 km
Table 12
i | ||||||||||||
X i | 10,51 | 11,86 | 10,5 | 9,1 | 9,21 | 10,74 | 10,75 | 10,3 | 11,3 | 11,8 | 10,9 | n=11 |
U i | 13,22 | 13,0 | 11,5 | 10,4 | 11,8 | 11,6 | 10,64 | 12,3 | 11,1 | 11,6 | - | m=10 |
The variance in the consumption of gasoline grades A and B is unknown and is assumed to be the same. Is it possible, at a significance level of α=0.05, to accept the hypothesis that the true average costs μ A and μ B of these types of gasoline are the same?
Solution. Testing the hypothesis H 0: μ A -μ B = 0 with a competing one. H 1:μ 1 μ 2 do the following:
1. Find the sample means and the sum of squared deviations Q.
;
;
2. Calculate the experimental value of the Z statistic
3. From Table 10 of the t-distribution we find the limit t β,ν for the number of degrees of freedom ν=m+n–2=19 and β=1–α=0.95. Table 10 has t 0.95.20 =2.09 and t 0.95.15 =2.13, but not t 0.95.19. We find by interpolation t 0.95.19 =2.09+ =2.10.
4. Check which of the two areas D or K contains the number Zon. Zon=-2.7 D=[-2.10; -2.10].
Since the observed value of Z on lies in the critical region, K = R\D, we discard it. H 0 and accept the hypothesis H 1. In this case, they say that their difference is significant. If, under all the conditions of this example, only Q had changed, say, Q had doubled, then our conclusion would have changed. Doubling Q would lead to a decrease in the value of Zon by a factor, and then the number Zon would fall into the admissible region D, so that the hypothesis H 0 would stand the test and be accepted. In this case, the discrepancy between and would be explained by the natural scatter of the data, and not by the fact that μ A μ B.
The theory of hypothesis testing is very extensive; hypotheses can be about the type of distribution law, about the homogeneity of samples, about the independence of next quantities, etc.
CRITERION c 2 (PEARSON)
The most common criterion in practice for testing a simple hypothesis. Applies when the distribution law is unknown. Consider a random variable X over which n independent tests. The realization x 1 , x 2 ,...,x n is obtained. It is necessary to test the hypothesis about the distribution law of this random variable.
Let's consider the case of a simple hypothesis. A simple hypothesis tests the fit of a sample with a population that is normally distributed (known). We build according to samples variation series x (1) , x (2) , ..., x (n) . We divide the interval into subintervals. Let these intervals be r. Then we will find the probability that X, as a result of the test, will fall into the interval Di, i=1 ,..., r if the hypothesis being tested is true.
The criterion does not check the truth of the probability density, but the truth of the numbers
With each interval Di we associate a random event A i - a hit in this interval (a hit as a result of a test on X of its implementation result in Di). Let's introduce random variables. m i is the number of tests out of n conducted in which the event A i occurred. m i are distributed according to the binomial law and if the hypothesis is true
Dm i =np i (1-p i)
Criterion c 2 has the form
p 1 +p 2 +...+p r =1
m 1 +m 2 +...+m r =n
If the hypothesis being tested is correct, then m i represents the frequency of occurrence of an event that has a probability pi in each of the n trials, therefore, we can consider m i as a random variable subject to the binomial law centered at point npi . When n is large, then we can assume that the frequency is distributed asymptotically normally with the same parameters. If the hypothesis is correct, we should expect that they will be asymptotically normally distributed
interconnected by relationship
As a measure of the discrepancy between sample data m 1 +m 2 +...+m r and theoretical np 1 +np 2 +...+np r, consider the value
c 2 - the sum of squares of asymptotically normal quantities associated linear dependence. We have previously encountered a similar case and know that the presence of a linear connection led to a decrease in the number of degrees of freedom by one.
If the hypothesis being tested is correct, then the criterion c 2 has a distribution that tends as n®¥ to the distribution of c 2 with r-1 degrees of freedom.
Let's assume that the hypothesis is false. Then there is a tendency for the sum terms to increase, i.e. if the hypothesis is incorrect, then this sum will fall into a certain region of large values of c 2. As a critical region, we take the region of positive values of the criterion
|
In the case of unknown distribution parameters, each parameter reduces the number of degrees of freedom for the Pearson criterion by one
8.1. The concept of dependent and independent samples.
Selecting a criterion for testing a hypothesis
is primarily determined by whether the samples under consideration are dependent or independent. Let us introduce the corresponding definitions.
Def. The samples are called independent, if the procedure for selecting units in the first sample is in no way connected with the procedure for selecting units in the second sample.
An example of two independent samples would be the samples discussed above of men and women working at the same enterprise (in the same industry, etc.).
Note that the independence of two samples does not at all mean that there is no requirement for a certain kind of similarity of these samples (their homogeneity). Thus, when studying the income level of men and women, we are unlikely to allow a situation where men are selected from among Moscow businessmen, and women from the aborigines of Australia. Women should also be Muscovites and, moreover, “businesswomen.” But here we are not talking about the dependence of samples, but about the requirement of homogeneity of the studied population of objects, which must be satisfied both when collecting and when analyzing sociological data.
Def. The samples are called dependent, or paired, if each unit of one sample is “linked” to a specific unit of the second sample.
This last definition will probably become clearer if we give an example of dependent samples.
Suppose we want to find out whether the social status of the father is, on average, lower than the social status of the son (we believe that we can measure this complex and ambiguously understood social characteristics person). It seems obvious that in such a situation it is advisable to select pairs of respondents (father, son) and assume that each element of the first sample (one of the fathers) is “tied” to a certain element of the second sample (his son). These two samples will be called dependent.
8.2. Hypothesis testing for independent samples
For independent samples, the choice of criterion depends on whether we know the general variances s 1 2 and s 2 2 of the characteristic under consideration for the samples being studied. We will consider this problem solved, assuming that the sample variances coincide with the general ones. In this case, the criterion is the value:
Before moving on to discussing the situation when the general variances (or at least one of them) are unknown to us, we note the following.
The logic for using criterion (8.1) is similar to that which we described when considering the “Chi-square” criterion (7.2). There is only one fundamental difference. Speaking about the meaning of criterion (7.2), we considered an infinite number of samples of size n, “drawn” from our population. Here, analyzing the meaning of criterion (8.1), we move on to considering an infinite number steam samples of size n 1 and n 2. For each pair, statistics of the form (8.1) are calculated. The totality of the obtained values of such statistics, in accordance with our notation, corresponds to a normal distribution (as we agreed, the letter z is used to denote such a criterion that the normal distribution corresponds to).
So, if the general variances are unknown to us, then we are forced to use their sample estimates s 1 2 and s 2 2 instead. However, in this case, the normal distribution should be replaced by the Student distribution - z should be replaced by t (as was the case in a similar situation when constructing a confidence interval for the mathematical expectation). However, with sufficiently large sample sizes (n 1, n 2 ³ 30), as we already know, the Student distribution practically coincides with the normal one. In other words, for large samples we can continue to use the criterion:
The situation is more complicated when the variances are unknown and the size of at least one sample is small. Then another factor comes into play. The type of criterion depends on whether we can consider the unknown variances of the characteristic under consideration in the two analyzed samples to be equal. To find out, we need to test the hypothesis:
H 0: s 1 2 = s 2 2. (8.3)
To test this hypothesis, the criterion is used
About the specifics of using this criterion we'll talk below, and now we will continue to discuss the algorithm for selecting a criterion that is used to test hypotheses about the equality of mathematical expectations.
If hypothesis (8.3) is rejected, then the criterion of interest to us takes the form:
(8.5)
(i.e., it differs from criterion (8.2), which was used for large samples, in that the corresponding statistics do not have a normal distribution, but a Student distribution). If hypothesis (8.3) is accepted, then the type of criterion used changes:
(8.6)
Let us summarize how a criterion is selected to test the hypothesis about the equality of general mathematical expectations based on the analysis of two independent samples.
known
unknown
sample size is large
H 0: s 1 = s 2 rejected
Accepted
8.3. Hypothesis testing for dependent samples
Let's move on to considering dependent samples. Let the sequences of numbers
X 1, X 2, …, X n;
Y 1 , Y 2 , … , Y n –
these are the values of the considered random one for the elements of two dependent samples. Let us introduce the notation:
D i = X i - Y i , i = 1, ... , n.
For dependent sample criterion that allows you to test a hypothesis
as follows:
Note that the just given expression for s D is nothing more than a new expression for famous formula, expressing the standard deviation. In this case we are talking about the standard deviation of the values of D i . A similar formula is often used in practice as a simpler (compared to the “head-on” calculation of the sum of squared deviations of the values of the value under consideration from the corresponding arithmetic mean) method of calculating dispersion.
If we compare the above formulas with those that we used when discussing the principles of constructing a confidence interval, it is easy to notice that testing the hypothesis of equality of means for the case of dependent samples is essentially testing the equality of the mathematical expectation of the values D i to zero. Magnitude
is the standard deviation for D i . Therefore, the value of the just described criterion t n -1 is essentially equal to the value of D i expressed as a fraction of the standard deviation. As we said above (when discussing methods for constructing confidence intervals), this indicator can be used to judge the probability of the considered value Di. The difference is that above we were talking about a simple arithmetic mean, normally distributed, and here we are talking about average differences, such averages have a Student distribution. But reasoning about the relationship between the probability of deviation of the sample arithmetic mean from zero (with mathematical expectation, equal to zero) with how many units s this deviation amounts to remain in force.
Example. The income of pharmacies in one of the city's microdistricts for a certain period amounted to 128; 192; 223; 398; 205; 266; 219; 260; 264; 98 (conventional units). In the neighboring microdistrict for the same time they were equal to 286; 240; 263; 266; 484; 223; 335.
For both samples, calculate the mean, corrected variance, and standard deviation. Find the range of variation, the average absolute (linear) deviation, the coefficient of variation, linear coefficient variations, oscillation coefficient.
Assuming that this random value has a normal distribution, determine the confidence interval for the general mean (in both cases).
Using the Fisher criterion, check the hypothesis of equality of general variances. Using the Student's test, check the hypothesis about the equality of general means (the alternative hypothesis is about their inequality).
In all calculations, the significance level is α = 0.05.
We carry out the solution using the calculator Testing the hypothesis of equality of variances.
1. Find the variation indicators for the first sample.
x | |x - x av | | (x - x avg) 2 |
98 | 127.3 | 16205.29 |
128 | 97.3 | 9467.29 |
192 | 33.3 | 1108.89 |
205 | 20.3 | 412.09 |
219 | 6.3 | 39.69 |
223 | 2.3 | 5.29 |
260 | 34.7 | 1204.09 |
264 | 38.7 | 1497.69 |
266 | 40.7 | 1656.49 |
398 | 172.7 | 29825.29 |
2253 | 573.6 | 61422.1 |
.
Variation indicators.
.
R = X max - X min
R = 398 - 98 = 300
Average linear deviation
Each value of the series differs from the other by an average of 57.36
Dispersion
Unbiased variance estimator
.
Each value of the series differs from the average value of 225.3 by an average of 78.37
.
.
The coefficient of variation
Since v>30%, but v or
Oscillation coefficient
.
.
Using the Student's table we find:
T table (n-1;α/2) = T table (9;0.025) = 2.262
(225.3 - 59.09;225.3 + 59.09) = (166.21;284.39)
2. Find the variation indicators for the second sample.
Let's rank the row. To do this, we sort its values in ascending order.
Table for calculating indicators.
x | |x - x av | | (x - x avg) 2 |
223 | 76.57 | 5863.18 |
240 | 59.57 | 3548.76 |
263 | 36.57 | 1337.47 |
266 | 33.57 | 1127.04 |
286 | 13.57 | 184.18 |
335 | 35.43 | 1255.18 |
484 | 184.43 | 34013.9 |
2097 | 439.71 | 47329.71 |
To evaluate the distribution series, we find the following indicators:
Distribution center indicators.
Simple arithmetic average
Variation indicators.
Absolute variations.
The range of variation is the difference between the maximum and minimum values of the primary series characteristic.
R = X max - X min
R = 484 - 223 = 261
Average linear deviation- calculated in order to take into account the differences of all units of the population under study.
Each value of the series differs from the other by an average of 62.82
Dispersion- characterizes the measure of dispersion around its average value (a measure of dispersion, i.e. deviation from the average).
Unbiased variance estimator- consistent estimate of variance (corrected variance).
Standard deviation.
Each value of the series differs from the average value of 299.57 by an average of 82.23
Estimation of standard deviation.
Relative Variation Measures.
Relative indicators of variation include: coefficient of oscillation, linear coefficient of variation, relative linear deviation.
The coefficient of variation- a measure of the relative dispersion of population values: shows what proportion of the average value of this value is its average dispersion.
Since v ≤ 30%, the population is homogeneous and the variation is weak. The results obtained can be trusted.
Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.
Oscillation coefficient- reflects the relative fluctuation of the extreme values of the characteristic around the average.
Interval estimation of the population center.
Confidence interval for general mean.
Determine the t kp value using the Student distribution table
Using the Student's table we find:
T table (n-1;α/2) = T table (6;0.025) = 2.447
(299.57 - 82.14;299.57 + 82.14) = (217.43;381.71)
With a probability of 0.95, it can be stated that the average value with a larger sample size will not fall outside the found interval.
We test the hypothesis of equality of variances:
H 0: D x = D y ;
H 1: D x Let's find the observed value of the Fisher criterion:
Since s y 2 > s x 2, then s b 2 = s y 2, s m 2 = s x 2
Number of degrees of freedom:
f 1 = n y – 1 = 7 – 1 = 6
f 2 = n x – 1 = 10 – 1 = 9
Using the table of critical points of the Fisher–Snedecor distribution at a significance level of α = 0.05 and given numbers of degrees of freedom, we find F cr (6;9) = 3.37
Because F obs. We test the hypothesis about the equality of general means:
Let's find the experimental value of the Student's criterion:
Number of degrees of freedom f = n x + n y – 2 = 10 + 7 – 2 = 15
Determine the t kp value using the Student distribution table
Using the Student's table we find:
T table (f;α/2) = T table (15;0.025) = 2.131
Using the table of critical points of the Student distribution at a significance level of α = 0.05 and a given number of degrees of freedom, we find t cr = 2.131
Because t obs.