Home Removal Testing statistical hypotheses in MS EXCEL about the equality of the mean value of the distribution (variance is unknown). Testing the hypothesis about the equality of the means of two or more populations

Testing statistical hypotheses in MS EXCEL about the equality of the mean value of the distribution (variance is unknown). Testing the hypothesis about the equality of the means of two or more populations

3. CHECKING THE HYPOTHESIS ABOUT EQUALITY OF AVERAGES

Used to test the proposition that the mean of two indicators represented by samples are significantly different. There are three types of test: one for related samples, and two for unrelated samples (with the same and different variances). If the samples are not related, then you first need to test the hypothesis of equality of variances to determine which criterion to use. Just as in the case of comparing variances, there are 2 ways to solve the problem, which we will consider using an example.

EXAMPLE 3. There is data on the number of sales of goods in two cities. Test at a significance level of 0.01 the statistical hypothesis that the average number of product sales in cities is different.

23 25 23 22 23 24 28 16 18 23 29 26 31 19
22 28 26 26 35 20 27 28 28 26 22 29

We use the Data Analysis package. Depending on the type of criterion, one of three is selected: “Paired two-sample t-test for means” - for connected samples, and “Two-sample t-test with equal variances” or “Two-sample t-test with different variances” - for disconnected samples. Call the test with the same variances, in the window that opens, in the “Variable Interval 1” and “Variable Interval 2” fields, enter links to the data (A1-N1 and A2-L2, respectively); if there are data labels, then check the box next to the “Labels” "(we don't have them, so the checkbox is not checked). Next, enter the significance level in the “Alpha” field - 0.01. The “Hypothetical mean difference” field is left blank. In the “Output Options” section, put a checkmark next to “Output interval” and, placing the cursor in the field that appears opposite the inscription, click the left button in cell B7. The result will be output starting from this cell. By clicking on “OK”, a table of results appears. Move the border between columns B and C, C and D, D and E by increasing the width of columns B, C and D so that all the labels fit. The procedure displays the main characteristics of the sample, t-statistics, critical values these statistics and critical levels significance "P(T<=t) одностороннее» и «Р(Т<=t) двухстороннее». Если по модулю t-статистика меньше критического, то средние показатели с заданной вероятностью равны. В нашем случае│-1,784242592│ < 2,492159469, следовательно, среднее число продаж значимо не отличается. Следует отметить, что если взять уровень значимости α=0,05, то результаты исследования будут совсем иными.



Two-sample t-test with equal variances

Average 23,57142857 26,41666667
Dispersion 17,34065934 15,35606061
Observations 14 12
Pooled Variance 16,43105159
Hypothetical mean difference 0
df 24
t-statistic -1,784242592
P(T<=t) одностороннее 0,043516846
t critical one-sided 2,492159469
P(T<=t) двухстороннее 0,087033692
t critical two-way 2,796939498

Laboratory work No. 3

PAIRED LINEAR REGRESSION

Goal: To master the methods of constructing a linear equation of paired regression using a computer, to learn how to obtain and analyze the main characteristics of the regression equation.

Let's consider the methodology for constructing a regression equation using an example.

EXAMPLE. Samples of factors x i and y i are given. Using these samples, find the linear regression equation ỹ = ax + b. Find the pair correlation coefficient. Check the regression model for adequacy at the significance level a = 0.05.

X 0 1 2 3 4 5 6 7 8 9
Y 6,7 6,3 4,4 9,5 5,2 4,3 7,7 7,1 7,1 7,9

To find the coefficients a and b of the regression equation, use the SLOPE and INTERCEPT functions, categories “Statistical”. We enter the signature “a=” in A5 and enter the TILT function in the adjacent cell B5, place the cursor in the “Iz_value_y” field and set a link to cells B2-K2 by circling them with the mouse. The result is 0.14303. Let us now find the coefficient b. We enter the signature “b=” in A6, and in B6 the CUT function with the same parameters as the TILT functions. The result is 5.976364. therefore, the linear regression equation is y=0.14303x+5.976364.

Let's plot the regression equation. To do this, in the third line of the table we enter the values ​​of the function at the given points X (first line) – y(x 1). To obtain these values, use the TREND function of the Statistical category. We enter the signature “Y(X)” in A3 and, placing the cursor in B3, call the TREND function. In the fields “From_value_y” and “From_value_x” we give a link to B2-K2 and B1-K1. in the “New_value_x” field we also enter a link to B1-K1. in the “Constant” field enter 1 if the regression equation has the form y=ax+b, and 0 if y=ax. In our case, we enter one. The TREND function is an array, so to display all its values, select area B3-K3 and press F2 and Ctrl+Shift+Enter. The result is the values ​​of the regression equation at given points. We are building a schedule. Place the cursor in any free cell, call the diagram wizard, select the “Sharpened” category, the type of graph – line without dots (in the lower right corner), click “Next”, enter the link to B3-K3 in the “Diagnostic” field. go to the “Row” tab and in the “X Values” field enter the link to B1-K1, click “Finish”. The result is a straight regression line. Let's see how the graphs of experimental data and regression equations differ. To do this, place the cursor in any free cell, call the chart wizard, category “Graph”, graph type – broken line with dots (second from the top left), click “Next”, in the “Range” field enter a link to the second and third lines B2- K3. go to the “Row” tab and in the “X-axis labels” field, enter the link to B1-K1, click “Finish”. The result is two lines (Blue – original, red – regression equation). It can be seen that the lines differ little from each other.

a= 0,14303
b= 5,976364

To calculate the correlation coefficient r xy, use the PEARSON function. We place the graph so that they are located above line 25, and in A25 we make the signature “Correlation”, in B25 we call the PEARSON function, in the fields of which “Array 2” we enter a link to the source data B1-K1 and B2-K2. the result is 0.993821. the coefficient of determination R xy is the square of the correlation coefficient r xy . In A26 we sign “Determination”, and in B26 we write the formula “=B25*B25”. The result is 0.265207.

However, there is one function in Excel that calculates all the basic characteristics of linear regression. This is the LINEST function. Place the cursor in B28 and call the LINEST function, category “Statistical”. In the fields “From_value_y” and “From_value_x” we give a link to B2-K2 and B1-K1. the “Constant” field has the same meaning as the TREND function; in our case it is equal to 1. The “Stat” field must contain 1 if you need to display complete statistics about the regression. In our case, we put one there. The function returns an array of 2 columns and 5 rows. After entering, select cell B28-C32 with the mouse and press F2 and Ctrl+Shift+Enter. The result is a table of values, the numbers in which have the following meaning:



Coefficient a

Coefficient b

Standard error m o

Standard error m h

Determination coefficient R xy

Standard deviation

F – statistics

Degrees of freedom n-2

Regression sum of squares S n 2

Residual sum of squares S n 2

0,14303 5,976364
0,183849 0,981484
0,070335 1,669889
0,60525 8
1,687758 22,30824

Analysis of the result: in the first line - the coefficients of the regression equation, compare them with the calculated functions SLOPE and INTERCEPT. The second line is the standard errors of the coefficients. If one of them is greater in absolute value than the coefficient itself, then the coefficient is considered zero. The coefficient of determination characterizes the quality of the relationship between factors. The resulting value of 0.070335 indicates a very good relationship between the factors, F - statistics tests the hypothesis about the adequacy of the regression model. This number must be compared with the critical value, to obtain it we enter the signature “F-critical” in E33, and in F33 the function FRIST, the arguments of which we enter respectively “0.05” (significance level), “1” (number of factors X) and "8" (degrees of freedom).

F-critical 5,317655

It can be seen that the F-statistic is less than the F-critical, which means that the regression model is not adequate. The last line shows the regression sum of squares and residual sums of squares . It is important that the regression sum (explained by regression) is much larger than the residual (not explained by regression, caused by random factors). In our case, this condition is not met, which indicates poor regression.

Conclusion: In the course of my work, I mastered the methods of constructing a linear equation of pair regression using a computer, learned to obtain and analyze the main characteristics of the regression equation.


Laboratory work No. 4

NONLINEAR REGRESSION

Goal: to master methods for constructing the main types of nonlinear pair regression equations using a computer (internal linear models), learn to obtain and analyze quality indicators of regression equations.

Let's consider the case when nonlinear models can be reduced to linear ones using data transformation (internal linear models).

EXAMPLE. Construct a regression equation y = f(x) for the sample x n y n (f = 1,2,…,10). As f(x), consider four types of functions - linear, power, exponential and hyperbola:

y = Ax + B; y = Ax B; y = Ae Bx; y = A/x + B.

It is necessary to find their coefficients A and B, and after comparing the quality indicators, select the function that best describes the dependence.

Profit Y 0,3 1,2 2,8 5,2 8,1 11,0 16,8 16,9 24,7 29,4
Profit X 0,25 0,50 0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50

Let's enter the data into the table along with the signatures (cells A1-K2). Let's leave three lines free below the table for entering the converted data, select the first five lines by swiping along the left gray border along the numbers from 1 to 5 and select a color (light - yellow or pink) to color the background of the cells. Next, starting from A6, we display the linear regression parameters. To do this, write “Linear” in cell A6 and enter the LINEST function in adjacent cell B6. In the “Izv_value_x” fields we give a link to B2-K2 and B1-K1, the next two fields take values ​​of one. Next, circle the area below in 5 lines and to the left in 2 lines and press F2 and Ctrl+Shift+Enter. The result is a table with regression parameters, of which the coefficient of determination in the first column, third from the top, is of greatest interest. In our case, it is equal to R 1 = 0.951262. The value of the F-criterion, which allows checking the adequacy of the model F 1 = 156.1439

(fourth row, first column). The regression equation is

y = 12.96 x +6.18 (coefficients a and b are given in cells B6 and C6).

Linear 12,96 -6,18
1,037152 1,60884
0,951262 2,355101
156,1439 8
866,052 44,372

Let us determine similar characteristics for other regressions and, as a result of comparing the coefficients of determination, we will find the best regression model. Let's consider hyperbolic regression. To obtain it, we transform the data. In the third line, in cell A3 we enter the signature “1/x” and in cell B3 we enter the formula “=1/B2”. Let's autofill this cell to area B3-K3. Let's get the characteristics of the regression model. In cell A12 we enter the signature “Hyperbola”, and in the adjacent LINEST function. In the fields “From_value_y” and “From_value_x2” we give a link to B1-K1 and the converted data of argument x – B3-K3, the next two fields take values ​​of one. Next, circle the area below 5 lines and 2 lines to the left and press F2 and Ctrl+Shift+Enter. We get a table of regression parameters. Coefficient of determination in in this case is equal to R 2 = 0.475661, which is much worse than in the case of linear regression. The F-statistic is F2 = 7.257293. The regression equation is y = -6.25453x 18.96772.

Hyperbola -6,25453 18,96772
2,321705 3,655951
0,475661 7,724727
7,257293 8
433,0528 477,3712

Let's consider exponential regression. To linearize it, we obtain the equation , where ỹ = ln y, ã = b, = ln a. It can be seen that a data transformation needs to be done - replace y with ln y. Place the cursor in cell A4 and make the heading “ln y”. Place the cursor in B4 and enter the LN formula (category “Mathematical”). As an argument, we make reference to B1. Using autofill, we extend the formula to the fourth row to cells B4-K4. Next, in cell F6 we set the signature “Exponent” and in the adjacent G6 we enter the LINEST function, the arguments of which will be the transformed data B4-K4 (in the “Measured_value_y” field), and the remaining fields are the same as for the case of linear regression (B2-K2, eleven). Next, circle cells G6-H10 and press F2 and Ctrl+Shift+Enter. The result is R 3 = 0.89079, F 3 = 65.25304, which indicates a very good regression. To find the coefficients of the regression equation b = ã; put the cursor in J6 and make the heading “a=”, and in the neighboring K6 the formula “=EXP(H6)”, in J7 we give the heading “b=”, and in K7 the formula “=G6”. The regression equation is y = 0.511707· e 6.197909 x.

Exhibitor 1,824212 -0,67 a= 0,511707
0,225827 0,350304 b= 6,197909
0,89079 0,512793
65,25304 8
17,15871 2,103652

Let's consider power regression. To linearize it, we obtain the equation ỹ = ã, where ỹ = ln y, = ln x, ã = b, = ln a. It can be seen that it is necessary to transform the data - replace y with ln y and replace x with ln x. We already have the line with ln y. Let's transform the x variables. In cell A5 we write the signature “ln x”, and in cell B5 we enter the formula LN (category “Mathematical”). As an argument, we make reference to B2. Using autofill, we extend the formula to the fifth row to cells B5-K5. Next, in cell F12 we set the signature “Power” and in the adjacent G12 we enter the LINEST function, the arguments of which will be the converted data B4-K4 (in the “From_value_y” field), and B5-K5 (in the “From_value_x” field), the remaining fields are ones. Next, free cells G12-H16 and press F2 and Ctrl+Shift+Enter. The result is R 4 = 0.997716, F 4 = 3494.117, which indicates good regression. To find the coefficients of the regression equation b = ã; put the cursor in J12 and make the heading “a=”, and in the neighboring K12 the formula “=EXP(H12)”, in J13 we give the heading “b=”, and in K13 the formula “=G12”. The regression equation is y = 4.90767/x+ 7.341268.

Power 1,993512 1,590799 a= 4,90767
0,033725 0,023823 b= 7,341268
0,997716 0,074163
3494,117 8
19,21836 0,044002

Let's check whether all equations adequately describe the data. To do this, you need to compare the F-statistics of each criterion with the critical value. To obtain it, we enter the signature “F-critical” in A21, and in B21 the function FRIST, the arguments of which we enter, respectively, “0.05” (significance level), “1” (the number of factors X in the line “Significance level 1”) and “ 8" (degree of freedom 2 = n – 2). The result is 5.317655. F – critical is greater than F – statistics, which means the model is adequate. The remaining regressions are also adequate. In order to determine which model best describes the data, we compare the determination indices for each model R 1, R 2, R 3, R 4. The largest is R4 = 0.997716. This means that the experimental data are better described by y = 4.90767/x + 7.341268.

Conclusion: In the course of my work, I mastered methods for constructing the main types of nonlinear pairwise regression equations using a computer (internal linear models), learned to obtain and analyze quality indicators of regression equations.

Y 0,3 1,2 2,8 5,2 8,1 11 16,8 16,9 24,7 29,4
X 0,25 0,5 0,75 1 1,25 1,5 1,75 2 2,25 2,5
1/x 4 2 1,333333 1 0,8 0,666667 0,571429 0,5 0,444444 0,4
ln y -1,20397 0,182322 1,029619 1,648659 2,0918641 2,397895 2,821379 2,827314 3,206803 3,380995
ln x -1,38629 -0,69315 -0,28768 0 0,2231436 0,405465 0,559616 0,693147 0,81093 0,916291
Linear 12,96 -6,18 Exhibitor 1,824212 -0,67 a= 0,511707
1,037152 1,60884 0,225827 0,350304 b= 6,197909
0,951262 2,355101 0,89079 0,512793
156,1439 8 65,25304 8
866,052 44,372 17,15871 2,103652
Hyperbola -6,25453 18,96772 Power 1,993512 1,590799 a= 4,90767
2,321705 3,655951 0,033725 0,023823 b= 7,341268
0,475661 7,724727 0,997716 0,074163
7,257293 8 3494,117 8
433,0528 477,3712 19,21836 0,044002
F - critical 5,317655

Laboratory work No. 5

POLYNOMIAL REGRESSION

Purpose: Using experimental data, construct a regression equation of the form y = ax 2 + bx + c.

PROGRESS:

The dependence of the yield of a certain crop y i on the amount of mineral fertilizers applied to the soil x i is considered. It is assumed that this dependence is quadratic. It is necessary to find a regression equation of the form ỹ = ax 2 + bx + c.

x 0 1 2 3 4 5 6 7 8 9
y 29,8 58,8 72,2 101,5 141 135,1 156,6 181,7 216,6 208,2

Let's enter this data into the spreadsheet along with signatures in cells A1-K2. Let's build a graph. To do this, circle the data Y (cells B2-K2), call the chart wizard, select the chart type “Graph”, chart type – graph with dots (second from the top left), click “Next”, go to the “Series” tab and in the “ X-axis labels" make a link to B2-K2, click "Finish". The graph can be approximated by a polynomial of degree 2 y = ax 2 + bx + c. To find the coefficients a, b, c, you need to solve the system of equations:

Let's calculate the amounts. To do this, enter the signature “X^2” in cell A3, and enter the formula “= B1*B1” in cell B3 and transfer it to the entire line B3-K3 using Autofill. In cell A4 we enter the signature “X^3”, and in B4 the formula “=B1*B3” and Autofill transfer it to the entire line B4-K4. In cell A5 we enter “X^4”, and in B5 the formula “=B4*B1”, autofill the line. In cell A6 we enter “X*Y”, and in B8 the formula “=B2*B1”, autofill the line. In cell A7 we enter “X^2*Y”, and in B9 the formula “=B3*B2”, autofill the line. Now we count the amounts. Select column L with a different color by clicking on the header and selecting a color. Place the cursor in cell L1 and click on the autosum button with the ∑ icon to calculate the sum of the first row. Using AutoFill, we transfer the formula to cells L1-710.

Now we solve the system of equations. To do this, we introduce the main matrix of the system. In cell A13 we enter the signature “A=”, and in matrix cells B13-D15 we enter the links reflected in the table

B C D
13 =L5 =L4 =L3
14 =L3 =L2 =L1
15 =L2 =L1 =9

We also introduce the right-hand sides of the system of equations. In G13 we enter the signature “B=”, and in H13-H15 we enter, respectively, links to cells “=L7”, “=L6”, “=L2”. We solve the system using the matrix method. From higher mathematics it is known that the solution is equal to A -1 B. Find the inverse matrix. To do this, enter the signature “A arr.” in cell J13. and, placing the cursor in K13, set the MOBR formula (category “Mathematical”). As the Array argument, we provide a reference to cells B13:D15. The result should also be a 4x4 matrix. To obtain it, circle cells K13-M15 with the mouse, selecting them and pressing F2 and Ctrl+Shift+Enter. The result is matrix A -1. Let us now find the product of this matrix and column B (cells H13-H15). We enter the signature “Coefficients” in cell A18 and in B18 we set the MULTIPLE function (category “Mathematical”). The arguments of the “Array 1” function are a link to matrix A-1 (cells K13-M15), and in the “Array 2” field we provide a link to column B (cells H13-H16). Next, select B18-B20 and press F2 and Ctrl+Shift+Enter. The resulting array is the coefficients of the regression equation a, b, c. As a result, we obtain a regression equation of the form: y = 1.201082x 2 – 5.619177x + 78.48095.

Let's build graphs of the original data and those obtained based on the regression equation. To do this, enter the signature “Regression” in cell A8 and enter the formula “=$B$18*B3+$B$19*B1+$B$20” in B8. Using AutoFill, we transfer the formula to cells B8-K8. To build a graph, select cells B8-K8 and, holding down the Ctrl key, also select cells B2-M2. Call the chart wizard, select the chart type “Graph”, chart type – graph with points (second from the top left), click “Next”, go to the “Series” tab and in the “X-axis labels” field make a link to B2-M2, click "Ready". It can be seen that the curves almost coincide.

CONCLUSION: in the process of work, based on experimental data, I learned to construct a regression equation of the form y = ax 2 + bx + c.





Empirical distribution density of the random analyzed variable and calculation of its characteristics. Determine the range of available data, i.e. difference between the largest and smallest sample values ​​(R = Xmax – Xmin): Selecting the number of grouping intervals k for the number of observations n<100 – ориентировочное значение интервалов можно рассчитать с использованием формулы Хайнхольда и Гаеде: ...

Data, one can reliably judge the statistical relationships that exist between the variables that are being studied in this experiment. All methods of mathematical and statistical analysis are conventionally divided into primary and secondary. Primary methods are those that can be used to obtain indicators that directly reflect the results of measurements made in an experiment. Accordingly, under...

General purpose processors (for example, Excel, Lotus 1-2-3, etc.), as well as some databases. Western statistical packages (SPSS, SAS, BMDP, etc.) have the following capabilities: Allow you to process gigantic amounts of data. Includes tools for describing tasks in a built-in language. They make it possible to build information processing systems on their basis for entire enterprises. Allow...



Massage course and for 1-2 months after it. 1.2 Forms of therapeutic massage The form of therapeutic massage is divided into general and private. These forms are typical for all types and methods of massage. Both private and general massage can be performed by a massage therapist in the form of mutual massage, couples massage or self-massage. 1.2.1 General massage General massage is such a massage session (regardless of...

x 0 1 2 3 4 5 6 7 8 9
y 29,8 58,8 72,2 101,5 141 135,1 156,6 181,7 216,6 208,2
X^2 0 1 4 9 16 25 36 49 64 81
X^3 0 1 8 27 64 125 216 343 512 729
X^4 0 1 16 81 256 625 1296 2401 4096 6561
X*Y 0 58,8 144,4 304,5 564 675,5 939,6 1271,9 1732,8 1873,8
X^2*Y 0 58,8 288,8 913,5 2256 3377,5 5637,6 8903,3 13862,4 16864,2
Regression. 78,48095 85,30121 94,52364 106,1482 120,175 136,6039 155,435 176,6682 200,3036 226,3412
A= 15333 2025 285 B= 52162,1 A Arr. 0,003247 -0,03247 0,059524
2025 285 45 7565,3 -0,03247 0,341342 -0,67857
285 45 9 1301,5 0,059524 -0,67857 1,619048
Coefficient 1,201082 a
5,619177

November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Lecture 6. Comparing two samples 6-1. Hypothesis of equality of means. Paired samples 6-2. Confidence interval for the difference in means. Paired samples 6-3. Hypothesis of equality of variances 6-4. Hypothesis of equality of shares 6-5. Confidence interval for the difference in proportions


2 Ivanov O.V., 2005 In this lecture... In the previous lecture we tested the hypothesis about the equality of the averages of two general populations and constructed confidence interval for the difference of means for the case of independent samples. Now we will consider the criterion for testing the hypothesis of equality of means and construct a confidence interval for the difference in means in the case of paired (dependent) samples. Then in section 6-3 the hypothesis of equality of variances will be tested, in section 6-4 - the hypothesis of equality of shares. Finally, we construct a confidence interval for the difference in proportions.


November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equality of means. Paired samples Statement of the problem Hypotheses and statistics Sequence of actions Example


4 Ivanov O.V., 2005 Paired samples. Description of the problem What we have 1. Two simple random samples obtained from two general populations. The samples are paired (dependent). 2. Both samples have a size of n 30. If not, then both samples are taken from normally distributed populations. What we want is to test the hypothesis about the difference between the means of two populations:


5 Ivanov O.V., 2005 Statistics for paired samples To test the hypothesis, statistics are used: where is the difference between two values ​​in one pair - the general average for paired differences - the sample average for paired differences - standard deviation differences for the sample - number of pairs


6 Ivanov O.V., 2005 Example. Training of students A group of 15 students took a test before and after the training. The test results are in the table. Let's test the hypothesis for paired samples for the absence of influence of training on students' preparation at a significance level of 0.05. Solution. Let's calculate the differences and their squares. StudentBeforeAfter Σ= 21 Σ= 145


7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for df = 15 – 1=14, we find the critical value t = 2.145 and write the critical region: t > 2.145. 2.145."> 2.145."> 2.145." title="7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. The significance level is set = 0.05. Step 3. By table for df = 15 – 1=14 we find the critical value t = 2.145 and write the critical region: t > 2.145."> title="7 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for df = 15 – 1=14, we find the critical value t = 2.145 and write the critical region: t > 2.145."> !}




9 Ivanov O.V., 2005 Solution Statistics takes the value: Step 5. Compare the obtained value with the critical region. 1.889


November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Confidence interval for the difference in means. Paired samples Problem statement Method for constructing a confidence interval Example


11 Ivanov O.V., 2005 Description of the problem What we have We have two random paired (dependent) samples of size n from two general populations. General populations have a normal distribution law with parameters 1, 1 and 2, 2 or the volumes of both samples are 30. What we want is to estimate the average value of paired differences for two general populations. To do this, construct a confidence interval for the average in the form:






November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equality of variances Statement of the problem Hypotheses and statistics Sequence of actions Example


15 Ivanov O.V., 2005 During the study... The researcher may need to check the assumption that the variances of the two populations being studied are equal. In the case where these general populations have normal distribution, for this there is an F-test, also called the Fisher criterion. Unlike Student, Fischer did not work in a brewery.


16 Ivanov O.V., 2005 Description of the problem What we have 1. Two simple random samples obtained from two normally distributed populations. 2. The samples are independent. This means that there is no relationship between the sample subjects. What we want is to test the hypothesis of equality of population variances:














23 Ivanov O.V., 2005 Example A medical researcher wants to check whether there is a difference between the heart rate of smoking and non-smoking patients (number of beats per minute). The results of two randomly selected groups are shown below. Using α = 0.05, find out whether the doctor is right. Smokers Non-smokers


24 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the table for the number of degrees of freedom of the numerator 25 and denominator 17, we find the critical value f = 2.19 and the critical region: f > 2.19. Step 4. Using the sample, we calculate the statistics value: 2.19. Step 4. Using the sample, we calculate the statistics value: ">




November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Hypothesis of equal shares Statement of the problem Hypotheses and statistics Sequence of actions Example


27 Ivanov O.V., 2005 Question Out of 100 randomly selected students of the sociology faculty, 43 attend special courses. Out of 200 randomly selected economics students, 90 attend special courses. Does the proportion of students attending special courses differ between sociology and economics departments? It doesn't seem to be significantly different. How can I check this? The share of those attending special courses is the share of the attribute. 43 – number of “successes”. 43/100 – share of success. The terminology is the same as in Bernoulli's scheme.


28 Ivanov O.V., 2005 Description of the problem What we have 1. Two simple random samples obtained from two normally distributed populations. The samples are independent. 2. For samples, np 5 and nq 5 are fulfilled. This means that at least 5 elements of the sample have the studied characteristic value, and at least 5 do not. What we want is to test the hypothesis about the equality of the shares of a characteristic in two general populations:






31 Ivanov O.V., 2005 Example. Special courses of two faculties Out of 100 randomly selected students of the sociology faculty, 43 attend special courses. Of the 200 economics students, 90 attend special courses. At the significance level = 0.05, test the hypothesis that there is no difference between the proportion of students attending special courses in these two faculties. 33 Ivanov O.V., 2005 Solution Step 1. Main and alternative hypotheses: Step 2. Significance level =0.05 is set. Step 3. Using the normal distribution table, we find the critical values ​​z = – 1.96 and z = 1.96, and construct the critical region: z 1.96. Step 4. Based on the sample, we calculate the value of the statistics.


34 Ivanov O.V., 2005 Solution Step 5. Compare the obtained value with the critical region. The resulting statistic value did not fall within the critical region. Step 6. Formulate the conclusion. There is no reason to reject the main hypothesis. The share of those attending special courses does not differ statistically significantly.


November 5, 2012 November 5, 2012 November 5, 2012 November 5, 2012 Confidence interval for the difference in proportions Statement of the problem Method for constructing a confidence interval Example





Consider two independent samples x 1, x 2, ….., x n and y 1, y 2, …, y n, extracted from normal populations with equal variances, with sample sizes n and m, respectively, and averages μ x, μ y and variance σ 2 are unknown. It is required to test the main hypothesis H 0: μ x = μ y with the competing H 1: μ x μ y.

As is known, sample averages will have the following properties: ~N(μ x, σ 2 /n), ~N(μ y, σ 2 /m).

Their difference is a normal value with the average and variance, so

~ (23).

Let us assume for a moment that the main hypothesis H 0 is correct: μ x – μ y =0. Then and dividing the value by its standard deviation, we obtain the standard normal sl. Size ~N(0,1).

It was previously noted that magnitude distributed according to the law with (n-1)th degree of freedom, a - according to the law with (m-1) degree of freedom. Taking into account the independence of these two sums, we find that they are total amount distributed according to the law with n+m-2 degrees of freedom.

Remembering step 7, we see that the fraction obeys the t-distribution (Student) with ν=m+n-2 degrees of freedom: Z=t. This fact occurs only when the hypothesis H 0 is true.

Replacing ξ and Q with their expressions, we obtain an expanded formula for Z:

(24)

The next Z value, called criterion statistics, allows you to make a decision with the following sequence of actions:

1. The area D=[-t β,ν , +t β,ν ] is established, containing β=1–α areas under the t ν distribution curve (Table 10).

2. The experimental value Z on of statistics Z is calculated using formula (24), for which the values ​​x 1 and y 1 of specific samples, as well as their sample means and , are substituted instead of X 1 and Y 1 .

3. If Z on D, then the hypothesis H 0 is considered not to contradict experimental data and is accepted.

If Z on D, then hypothesis H 1 is accepted.

If the hypothesis H 0 is correct, then Z obeys the known t ν -distribution with zero mean and with a high probability β = 1–α falls into the D-region of acceptance of the hypothesis H 0 . When the observed, experimental value of Z on falls into D. We consider this as evidence in favor of the hypothesis H 0.

When Z 0 n lies outside D (as they say, lies in the critical region K), which is natural if the hypothesis H 1 is true, but unlikely if H 0 is true, then we can only reject the hypothesis H 0 by accepting H 1 .

Example 31.

Two grades of gasoline are compared: A and B. On 11 vehicles of the same power, grades A and B gasoline were tested once on a circular chassis. One car broke down en route and for it there is no data on gasoline B.

Gasoline consumption per 100 km

Table 12

i
X i 10,51 11,86 10,5 9,1 9,21 10,74 10,75 10,3 11,3 11,8 10,9 n=11
U i 13,22 13,0 11,5 10,4 11,8 11,6 10,64 12,3 11,1 11,6 - m=10

The variance in the consumption of gasoline grades A and B is unknown and is assumed to be the same. Is it possible, at a significance level of α=0.05, to accept the hypothesis that the true average costs μ A and μ B of these types of gasoline are the same?

Solution. Testing the hypothesis H 0: μ A -μ B = 0 with a competing one. H 1:μ 1 μ 2 do the following:

1. Find the sample means and the sum of squared deviations Q.

;

;

2. Calculate the experimental value of the Z statistic

3. From Table 10 of the t-distribution we find the limit t β,ν for the number of degrees of freedom ν=m+n–2=19 and β=1–α=0.95. Table 10 has t 0.95.20 =2.09 and t 0.95.15 =2.13, but not t 0.95.19. We find by interpolation t 0.95.19 =2.09+ =2.10.

4. Check which of the two areas D or K contains the number Zon. Zon=-2.7 D=[-2.10; -2.10].

Since the observed value of Z on lies in the critical region, K = R\D, we discard it. H 0 and accept the hypothesis H 1. In this case, they say that their difference is significant. If, under all the conditions of this example, only Q had changed, say, Q had doubled, then our conclusion would have changed. Doubling Q would lead to a decrease in the value of Zon by a factor, and then the number Zon would fall into the admissible region D, so that the hypothesis H 0 would stand the test and be accepted. In this case, the discrepancy between and would be explained by the natural scatter of the data, and not by the fact that μ A μ B.

The theory of hypothesis testing is very extensive; hypotheses can be about the type of distribution law, about the homogeneity of samples, about the independence of next quantities, etc.

CRITERION c 2 (PEARSON)

The most common criterion in practice for testing a simple hypothesis. Applies when the distribution law is unknown. Consider a random variable X over which n independent tests. The realization x 1 , x 2 ,...,x n is obtained. It is necessary to test the hypothesis about the distribution law of this random variable.

Let's consider the case of a simple hypothesis. A simple hypothesis tests the fit of a sample with a population that is normally distributed (known). We build according to samples variation series x (1) , x (2) , ..., x (n) . We divide the interval into subintervals. Let these intervals be r. Then we will find the probability that X, as a result of the test, will fall into the interval Di, i=1 ,..., r if the hypothesis being tested is true.

The criterion does not check the truth of the probability density, but the truth of the numbers

With each interval Di we associate a random event A i - a hit in this interval (a hit as a result of a test on X of its implementation result in Di). Let's introduce random variables. m i is the number of tests out of n conducted in which the event A i occurred. m i are distributed according to the binomial law and if the hypothesis is true

Dm i =np i (1-p i)

Criterion c 2 has the form

p 1 +p 2 +...+p r =1

m 1 +m 2 +...+m r =n

If the hypothesis being tested is correct, then m i represents the frequency of occurrence of an event that has a probability pi in each of the n trials, therefore, we can consider m i as a random variable subject to the binomial law centered at point npi . When n is large, then we can assume that the frequency is distributed asymptotically normally with the same parameters. If the hypothesis is correct, we should expect that they will be asymptotically normally distributed

interconnected by relationship

As a measure of the discrepancy between sample data m 1 +m 2 +...+m r and theoretical np 1 +np 2 +...+np r, consider the value

c 2 - the sum of squares of asymptotically normal quantities associated linear dependence. We have previously encountered a similar case and know that the presence of a linear connection led to a decrease in the number of degrees of freedom by one.

If the hypothesis being tested is correct, then the criterion c 2 has a distribution that tends as n®¥ to the distribution of c 2 with r-1 degrees of freedom.

Let's assume that the hypothesis is false. Then there is a tendency for the sum terms to increase, i.e. if the hypothesis is incorrect, then this sum will fall into a certain region of large values ​​of c 2. As a critical region, we take the region of positive values ​​of the criterion


In the case of unknown distribution parameters, each parameter reduces the number of degrees of freedom for the Pearson criterion by one

8.1. The concept of dependent and independent samples.

Selecting a criterion for testing a hypothesis

is primarily determined by whether the samples under consideration are dependent or independent. Let us introduce the corresponding definitions.

Def. The samples are called independent, if the procedure for selecting units in the first sample is in no way connected with the procedure for selecting units in the second sample.

An example of two independent samples would be the samples discussed above of men and women working at the same enterprise (in the same industry, etc.).

Note that the independence of two samples does not at all mean that there is no requirement for a certain kind of similarity of these samples (their homogeneity). Thus, when studying the income level of men and women, we are unlikely to allow a situation where men are selected from among Moscow businessmen, and women from the aborigines of Australia. Women should also be Muscovites and, moreover, “businesswomen.” But here we are not talking about the dependence of samples, but about the requirement of homogeneity of the studied population of objects, which must be satisfied both when collecting and when analyzing sociological data.

Def. The samples are called dependent, or paired, if each unit of one sample is “linked” to a specific unit of the second sample.

This last definition will probably become clearer if we give an example of dependent samples.

Suppose we want to find out whether the social status of the father is, on average, lower than the social status of the son (we believe that we can measure this complex and ambiguously understood social characteristics person). It seems obvious that in such a situation it is advisable to select pairs of respondents (father, son) and assume that each element of the first sample (one of the fathers) is “tied” to a certain element of the second sample (his son). These two samples will be called dependent.

8.2. Hypothesis testing for independent samples

For independent samples, the choice of criterion depends on whether we know the general variances s 1 2 and s 2 2 of the characteristic under consideration for the samples being studied. We will consider this problem solved, assuming that the sample variances coincide with the general ones. In this case, the criterion is the value:

Before moving on to discussing the situation when the general variances (or at least one of them) are unknown to us, we note the following.

The logic for using criterion (8.1) is similar to that which we described when considering the “Chi-square” criterion (7.2). There is only one fundamental difference. Speaking about the meaning of criterion (7.2), we considered an infinite number of samples of size n, “drawn” from our population. Here, analyzing the meaning of criterion (8.1), we move on to considering an infinite number steam samples of size n 1 and n 2. For each pair, statistics of the form (8.1) are calculated. The totality of the obtained values ​​of such statistics, in accordance with our notation, corresponds to a normal distribution (as we agreed, the letter z is used to denote such a criterion that the normal distribution corresponds to).

So, if the general variances are unknown to us, then we are forced to use their sample estimates s 1 2 and s 2 2 instead. However, in this case, the normal distribution should be replaced by the Student distribution - z should be replaced by t (as was the case in a similar situation when constructing a confidence interval for the mathematical expectation). However, with sufficiently large sample sizes (n 1, n 2 ³ 30), as we already know, the Student distribution practically coincides with the normal one. In other words, for large samples we can continue to use the criterion:

The situation is more complicated when the variances are unknown and the size of at least one sample is small. Then another factor comes into play. The type of criterion depends on whether we can consider the unknown variances of the characteristic under consideration in the two analyzed samples to be equal. To find out, we need to test the hypothesis:

H 0: s 1 2 = s 2 2. (8.3)

To test this hypothesis, the criterion is used

About the specifics of using this criterion we'll talk below, and now we will continue to discuss the algorithm for selecting a criterion that is used to test hypotheses about the equality of mathematical expectations.

If hypothesis (8.3) is rejected, then the criterion of interest to us takes the form:

(8.5)

(i.e., it differs from criterion (8.2), which was used for large samples, in that the corresponding statistics do not have a normal distribution, but a Student distribution). If hypothesis (8.3) is accepted, then the type of criterion used changes:

(8.6)

Let us summarize how a criterion is selected to test the hypothesis about the equality of general mathematical expectations based on the analysis of two independent samples.

known

unknown

sample size is large

H 0: s 1 = s 2 rejected

Accepted

8.3. Hypothesis testing for dependent samples

Let's move on to considering dependent samples. Let the sequences of numbers

X 1, X 2, …, X n;

Y 1 , Y 2 , … , Y n –

these are the values ​​of the considered random one for the elements of two dependent samples. Let us introduce the notation:

D i = X i - Y i , i = 1, ... , n.

For dependent sample criterion that allows you to test a hypothesis

as follows:

Note that the just given expression for s D is nothing more than a new expression for famous formula, expressing the standard deviation. In this case we are talking about the standard deviation of the values ​​of D i . A similar formula is often used in practice as a simpler (compared to the “head-on” calculation of the sum of squared deviations of the values ​​of the value under consideration from the corresponding arithmetic mean) method of calculating dispersion.

If we compare the above formulas with those that we used when discussing the principles of constructing a confidence interval, it is easy to notice that testing the hypothesis of equality of means for the case of dependent samples is essentially testing the equality of the mathematical expectation of the values ​​D i to zero. Magnitude

is the standard deviation for D i . Therefore, the value of the just described criterion t n -1 is essentially equal to the value of D i expressed as a fraction of the standard deviation. As we said above (when discussing methods for constructing confidence intervals), this indicator can be used to judge the probability of the considered value Di. The difference is that above we were talking about a simple arithmetic mean, normally distributed, and here we are talking about average differences, such averages have a Student distribution. But reasoning about the relationship between the probability of deviation of the sample arithmetic mean from zero (with mathematical expectation, equal to zero) with how many units s this deviation amounts to remain in force.

Example. The income of pharmacies in one of the city's microdistricts for a certain period amounted to 128; 192; 223; 398; 205; 266; 219; 260; 264; 98 (conventional units). In the neighboring microdistrict for the same time they were equal to 286; 240; 263; 266; 484; 223; 335.
For both samples, calculate the mean, corrected variance, and standard deviation. Find the range of variation, the average absolute (linear) deviation, the coefficient of variation, linear coefficient variations, oscillation coefficient.
Assuming that this random value has a normal distribution, determine the confidence interval for the general mean (in both cases).
Using the Fisher criterion, check the hypothesis of equality of general variances. Using the Student's test, check the hypothesis about the equality of general means (the alternative hypothesis is about their inequality).
In all calculations, the significance level is α = 0.05.

We carry out the solution using the calculator Testing the hypothesis of equality of variances.
1. Find the variation indicators for the first sample.

x|x - x av |(x - x avg) 2
98 127.3 16205.29
128 97.3 9467.29
192 33.3 1108.89
205 20.3 412.09
219 6.3 39.69
223 2.3 5.29
260 34.7 1204.09
264 38.7 1497.69
266 40.7 1656.49
398 172.7 29825.29
2253 573.6 61422.1


.



Variation indicators.
.

R = X max - X min
R = 398 - 98 = 300
Average linear deviation


Each value of the series differs from the other by an average of 57.36
Dispersion


Unbiased variance estimator


.

Each value of the series differs from the average value of 225.3 by an average of 78.37
.

.

The coefficient of variation

Since v>30%, but v or

Oscillation coefficient

.
.


Using the Student's table we find:
T table (n-1;α/2) = T table (9;0.025) = 2.262

(225.3 - 59.09;225.3 + 59.09) = (166.21;284.39)

2. Find the variation indicators for the second sample.
Let's rank the row. To do this, we sort its values ​​in ascending order.
Table for calculating indicators.

x|x - x av |(x - x avg) 2
223 76.57 5863.18
240 59.57 3548.76
263 36.57 1337.47
266 33.57 1127.04
286 13.57 184.18
335 35.43 1255.18
484 184.43 34013.9
2097 439.71 47329.71

To evaluate the distribution series, we find the following indicators:
Distribution center indicators.
Simple arithmetic average


Variation indicators.
Absolute variations.
The range of variation is the difference between the maximum and minimum values ​​of the primary series characteristic.
R = X max - X min
R = 484 - 223 = 261
Average linear deviation- calculated in order to take into account the differences of all units of the population under study.


Each value of the series differs from the other by an average of 62.82
Dispersion- characterizes the measure of dispersion around its average value (a measure of dispersion, i.e. deviation from the average).


Unbiased variance estimator- consistent estimate of variance (corrected variance).


Standard deviation.

Each value of the series differs from the average value of 299.57 by an average of 82.23
Estimation of standard deviation.

Relative Variation Measures.
Relative indicators of variation include: coefficient of oscillation, linear coefficient of variation, relative linear deviation.
The coefficient of variation- a measure of the relative dispersion of population values: shows what proportion of the average value of this value is its average dispersion.

Since v ≤ 30%, the population is homogeneous and the variation is weak. The results obtained can be trusted.
Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation coefficient- reflects the relative fluctuation of the extreme values ​​of the characteristic around the average.

Interval estimation of the population center.
Confidence interval for general mean.

Determine the t kp value using the Student distribution table
Using the Student's table we find:
T table (n-1;α/2) = T table (6;0.025) = 2.447

(299.57 - 82.14;299.57 + 82.14) = (217.43;381.71)
With a probability of 0.95, it can be stated that the average value with a larger sample size will not fall outside the found interval.
We test the hypothesis of equality of variances:
H 0: D x = D y ;
H 1: D x Let's find the observed value of the Fisher criterion:

Since s y 2 > s x 2, then s b 2 = s y 2, s m 2 = s x 2
Number of degrees of freedom:
f 1 = n y – 1 = 7 – 1 = 6
f 2 = n x – 1 = 10 – 1 = 9
Using the table of critical points of the Fisher–Snedecor distribution at a significance level of α = 0.05 and given numbers of degrees of freedom, we find F cr (6;9) = 3.37
Because F obs. We test the hypothesis about the equality of general means:


Let's find the experimental value of the Student's criterion:


Number of degrees of freedom f = n x + n y – 2 = 10 + 7 – 2 = 15
Determine the t kp value using the Student distribution table
Using the Student's table we find:
T table (f;α/2) = T table (15;0.025) = 2.131
Using the table of critical points of the Student distribution at a significance level of α = 0.05 and a given number of degrees of freedom, we find t cr = 2.131
Because t obs.



New on the site

>

Most popular