Home Coated tongue The variation series consists of: Definition of variation series

The variation series consists of: Definition of variation series

As a result of mastering this chapter, the student must: know

  • indicators of variation and their relationship;
  • basic laws of distribution of characteristics;
  • the essence of the consent criteria; be able to
  • calculate indices of variation and goodness-of-fit criteria;
  • determine distribution characteristics;
  • evaluate the basic numerical characteristics of statistical distribution series;

own

  • methods of statistical analysis of distribution series;
  • basics analysis of variance;
  • techniques for checking statistical distribution series for compliance with the basic laws of distribution.

Variation indicators

At statistical research characteristics of various statistical aggregates, the study of variation in the characteristics of individual statistical units population, as well as the nature of the distribution of units across this characteristic. Variation - these are differences in individual values ​​of a characteristic among units of the population being studied. The study of variation is of great practical importance. By the degree of variation, one can judge the limits of variation of a characteristic, the homogeneity of the population for a given characteristic, the typicality of the average, and the relationship of factors that determine the variation. Variation indicators are used to characterize and organize statistical populations.

The results of the summary and grouping of statistical observation materials, presented in the form of statistical distribution series, represent an ordered distribution of units of the population under study into groups according to grouping (variing) criteria. If a qualitative characteristic is taken as the basis for the grouping, then such a distribution series is called attributive(distribution by profession, gender, color, etc.). If a distribution series is constructed on a quantitative basis, then such a series is called variational(distribution by height, weight, salary, etc.). To construct a variation series means to organize the quantitative distribution of population units by characteristic values, count the number of population units with these values ​​(frequency), and arrange the results in a table.

Instead of the frequency of a variant, it is possible to use its ratio to the total volume of observations, which is called frequency (relative frequency).

There are two types variation series: discrete and interval. Discrete series- This is a variation series, the construction of which is based on characteristics with discontinuous change (discrete characteristics). The latter include the number of employees at the enterprise, tariff category, number of children in the family, etc. A discrete variation series represents a table that consists of two columns. The first column indicates the specific value of the attribute, and the second column indicates the number of units in the population with a specific value of the attribute. If a characteristic has a continuous change (amount of income, length of service, cost of fixed assets of the enterprise, etc., which within certain limits can take on any values), then for this characteristic it is possible to construct interval variation series. When constructing an interval variation series, the table also has two columns. The first indicates the value of the attribute in the interval “from - to” (options), the second indicates the number of units included in the interval (frequency). Frequency (repetition frequency) - the number of repetitions of a particular variant of attribute values. Intervals can be closed or open. Closed intervals are limited on both sides, i.e. have both a lower (“from”) and an upper (“to”) boundary. Open intervals have one boundary: either upper or lower. If the options are arranged in ascending or descending order, then the rows are called ranked.

For variation series, there are two types of frequency response options: accumulated frequency and accumulated frequency. The accumulated frequency shows how many observations the value of the characteristic took values ​​less than a given one. The accumulated frequency is determined by summing the frequency values ​​of a characteristic for a given group with all frequencies of previous groups. The accumulated frequency characterizes specific gravity units of observation in which the characteristic values ​​do not exceed the upper limit of the data group. Thus, the accumulated frequency shows the proportion of options in the totality that have a value no greater than the given one. Frequency, frequency, absolute and relative densities, accumulated frequency and frequency are characteristics of the magnitude of the variant.

Variations in the characteristics of statistical units of the population, as well as the nature of the distribution, are studied using indicators and characteristics of the variation series, which include the average level of the series, the average linear deviation, the standard deviation, dispersion, coefficients of oscillation, variation, asymmetry, kurtosis, etc.

Average values ​​are used to characterize the distribution center. The average is a generalizing statistical characteristic in which the typical level of a characteristic possessed by members of the population being studied is quantified. However, there may be cases of coincidence of arithmetic means with different distribution patterns, therefore, as statistical characteristics of variation series, the so-called structural means are calculated - mode, median, as well as quantiles, which divide the distribution series into equal parts (quartiles, deciles, percentiles, etc. ).

Fashion - This is the value of a characteristic that occurs in the distribution series more often than its other values. For discrete series, this is the option with the highest frequency. In interval variation series, in order to determine the mode, it is necessary to first determine the interval in which it is located, the so-called modal interval. In the variation series with at equal intervals the modal interval is determined by the highest frequency, in series with unequal intervals - but the highest distribution density. The formula is then used to determine the mode in rows at equal intervals

where Mo is the fashion value; xMo - lower limit of the modal interval; h- modal interval width; / Mo - frequency of the modal interval; / Mo j is the frequency of the premodal interval; / Mo+1 is the frequency of the post-modal interval, and for a series with unequal intervals in this calculation formula, instead of the frequencies / Mo, / Mo, / Mo, distribution densities should be used Mind 0 _| , Mind 0> UMO+"

If there is a single mode, then the probability distribution of the random variable is called unimodal; if there is more than one mode, it is called multimodal (polymodal, multimodal), in the case of two modes - bimodal. As a rule, multimodality indicates that the distribution under study does not obey the law normal distribution. Homogeneous populations, as a rule, are characterized by single-vertex distributions. Multivertex also indicates the heterogeneity of the population being studied. The appearance of two or more vertices makes it necessary to regroup the data in order to identify more homogeneous groups.

In an interval variation series, the mode can be determined graphically using a histogram. To do this, draw two intersecting lines from the top points of the highest column of the histogram to the top points of two adjacent columns. Then, from the point of their intersection, a perpendicular is lowered onto the abscissa axis. The value of the feature on the x-axis corresponding to the perpendicular is the mode. In many cases, when characterizing a population as a generalized indicator, preference is given to the mode rather than the arithmetic mean.

Median - This central importance characteristic, it is possessed by the central member of the ranked distribution series. In discrete series, to find the value of the median, first determine its serial number. To do this, if the number of units is odd, one is added to the sum of all frequencies, and the number is divided by two. If there are an even number of units in a row, there will be two median units, so in this case the median is defined as the average of the values ​​of the two median units. Thus, the median in a discrete variation series is the value that divides the series into two parts containing the same number of options.

In interval series, after determining the serial number of the median, the medial interval is found using the accumulated frequencies (frequencies), and then using the formula for calculating the median, the value of the median itself is determined:

where Me is the median value; x Me - lower limit of the median interval; h- width of the median interval; - the sum of the frequencies of the distribution series; /D - accumulated frequency of the pre-median interval; / Me - frequency of the median interval.

The median can be found graphically using a cumulate. To do this, on the scale of accumulated frequencies (frequencies) of the cumulate, from the point corresponding to the ordinal number of the median, a straight line is drawn parallel to the abscissa axis until it intersects with the cumulate. Next, from the point of intersection of the indicated line with the cumulate, a perpendicular is lowered to the abscissa axis. The value of the attribute on the x-axis corresponding to the drawn ordinate (perpendicular) is the median.

The median is characterized by the following properties.

  • 1. It does not depend on those attribute values ​​that are located on either side of it.
  • 2. It has the property of minimality, which means that the sum of absolute deviations of the attribute values ​​from the median represents a minimum value compared to the deviation of the attribute values ​​from any other value.
  • 3. When combining two distributions with known medians, it is impossible to predict in advance the value of the median of the new distribution.

These properties of the median are widely used in designing point locations. queuing- schools, clinics, gas stations, water points, etc. For example, if it is planned to build a clinic in a certain block of the city, then it would be more expedient to locate it at a point in the block that halves not the length of the block, but the number of residents.

The ratio of the mode, median and arithmetic mean indicates the nature of the distribution of the characteristic in the aggregate and allows us to assess the symmetry of the distribution. If x Me then there is a right-sided asymmetry of the series. With normal distribution X - Me - Mo.

K. Pearson based alignment various types curves determined that for moderately asymmetric distributions the following approximate relationships between the arithmetic mean, median and mode are valid:

where Me is the median value; Mo - meaning of fashion; x arithm - the value of the arithmetic mean.

If there is a need to study the structure of the variation series in more detail, then calculate characteristic values ​​similar to the median. Such characteristic values ​​divide all distribution units into equal numbers; they are called quantiles or gradients. Quantiles are divided into quartiles, deciles, percentiles, etc.

Quartiles divide the population into four equal parts. The first quartile is calculated similarly to the median using the formula for calculating the first quartile, having previously determined the first quarterly interval:

where Qi is the value of the first quartile; xQ^- lower limit of the first quartile range; h- width of the first quarter interval; /, - frequencies of the interval series;

Cumulative frequency in the interval preceding the first quartile interval; Jq ( - frequency of the first quartile interval.

The first quartile shows that 25% of the population units are less than its value, and 75% are more. The second quartile is equal to the median, i.e. Q 2 = Me.

By analogy, the third quartile is calculated, having first found the third quarterly interval:

where is the lower limit of the third quartile range; h- width of the third quartile interval; /, - frequencies of the interval series; /X" - accumulated frequency in the interval preceding

G

third quartile interval; Jq is the frequency of the third quartile interval.

The third quartile shows that 75% of the population units are less than its value, and 25% are more.

The difference between the third and first quartiles is the interquartile range:

where Aq is the value of the interquartile range; Q 3 - third quartile value; Q, is the value of the first quartile.

Deciles divide the population into 10 equal parts. A decile is a value of a characteristic in a distribution series that corresponds to tenths of the population size. By analogy with quartiles, the first decile shows that 10% of the population units are less than its value, and 90% are greater, and the ninth decile reveals that 90% of the population units are less than its value, and 10% are greater. The ratio of the ninth and first deciles, i.e. The decile coefficient is widely used in the study of income differentiation to measure the ratio of the income levels of the 10% most affluent and 10% of the least affluent population. Percentiles divide the ranked population into 100 equal parts. The calculation, meaning, and application of percentiles are similar to deciles.

Quartiles, deciles and others structural characteristics can be determined graphically by analogy with the median using cumulates.

To measure the size of variation, the following indicators are used: range of variation, average linear deviation, standard deviation, dispersion. The magnitude of the variation range depends entirely on the randomness of the distribution of the extreme members of the series. This indicator is of interest in cases where it is important to know what the amplitude of fluctuations in the values ​​of a characteristic is:

Where R- the value of the range of variation; x max - maximum value of the attribute; x tt - minimum value of the attribute.

When calculating the range of variation, the value of the vast majority of series members is not taken into account, while the variation is associated with each value of the series member. Indicators that are averages obtained from deviations of individual values ​​of a characteristic from their average value do not have this drawback: the average linear deviation and the standard deviation. There is a direct relationship between individual deviations from the average and the variability of a particular trait. The stronger the fluctuation, the more absolute dimensions deviations from the average.

The average linear deviation is the arithmetic mean of absolute values deviations of individual options from their average value.

Average Linear Deviation for Ungrouped Data

where /pr is the value of the average linear deviation; x, - is the value of the attribute; X - P - number of units in the population.

Average linear deviation of the grouped series

where / vz - the value of the average linear deviation; x, is the value of the attribute; X - the average value of the characteristic for the population being studied; / - the number of population units in a separate group.

Signs of deviations in in this case are ignored, otherwise the sum of all deviations will be equal to zero. The average linear deviation, depending on the grouping of the analyzed data, is calculated using various formulas: for grouped and ungrouped data. The average linear deviation, due to its conditionality, separately from other indicators of variation, is used in practice relatively rarely (in particular, to characterize the fulfillment of contractual obligations for uniformity of delivery; in turnover analysis foreign trade, composition of workers, rhythm of production, quality of products, taking into account technological features production, etc.).

The standard deviation characterizes how much on average the individual values ​​of the characteristic being studied deviate from the average value of the population, and is expressed in units of measurement of the characteristic being studied. The standard deviation, being one of the main measures of variation, is widely used in assessing the limits of variation of a characteristic in a homogeneous population, in determining the ordinate values ​​of a normal distribution curve, as well as in calculations related to the organization of sample observation and establishing the accuracy of sample characteristics. The standard deviation of ungrouped data is calculated using the following algorithm: each deviation from the mean is squared, all squares are summed, after which the sum of squares is divided by the number of terms of the series and the square root is extracted from the quotient:

where a Iip is the value of the standard deviation; Xj- attribute value; X- the average value of the characteristic for the population being studied; P - number of units in the population.

For grouped analyzed data, the standard deviation of the data is calculated using the weighted formula

Where - standard deviation value; Xj- attribute value; X - the average value of the characteristic for the population being studied; f x - the number of population units in a particular group.

The expression under the root in both cases is called variance. Thus, dispersion is calculated as the average square of deviations of attribute values ​​from their average value. For unweighted (simple) attribute values, the variance is determined as follows:

For weighted characteristic values

There is also a special simplified method for calculating variance: in general

for unweighted (simple) characteristic values for weighted characteristic values
using the zero-based method

where a 2 is the dispersion value; x, - is the value of the attribute; X - average value of the characteristic, h- group interval value, t 1 - weight (A =

Dispersion has an independent expression in statistics and refers to the number the most important indicators variations. It is measured in units corresponding to the square of the units of measurement of the characteristic being studied.

The dispersion has the following properties.

  • 1. The variance of a constant value is zero.
  • 2. Reducing all values ​​of a characteristic by the same value A does not change the value of the dispersion. This means that the average square of deviations can be calculated not from given values ​​of a characteristic, but from their deviations from some constant number.
  • 3. Reducing any characteristic values ​​in k times reduces the dispersion by k 2 times, and the standard deviation is in k times, i.e. all values ​​of the attribute can be divided by some constant number (say, by the value of the series interval), the standard deviation can be calculated, and then multiplied by a constant number.
  • 4. If we calculate the average square of deviations from any value And differing to one degree or another from the arithmetic mean, then it will always be greater than the average square of the deviations calculated from the arithmetic mean. The average square of the deviations will be greater by a very certain amount - by the square of the difference between the average and this conventionally taken value.

Variation of an alternative characteristic consists in the presence or absence of the studied property in units of the population. Quantitatively, the variation of an alternative attribute is expressed by two values: the presence of a unit of the studied property is denoted by one (1), and its absence is denoted by zero (0). The proportion of units that have the property under study is denoted by P, and the proportion of units that do not have this property is denoted by G. Thus, the variance of an alternative attribute is equal to the product of the proportion of units possessing this property (P) by the proportion of units not possessing this property (G). The greatest variation in the population is achieved in cases where part of the population, constituting 50% of the total volume of the population, has a characteristic, and another part of the population, also equal to 50%, does not have this characteristic, and the dispersion reaches a maximum value of 0.25, t .e. P = 0.5, G= 1 - P = 1 - 0.5 = 0.5 and o 2 = 0.5 0.5 = 0.25. The lower limit of this indicator is zero, which corresponds to a situation in which there is no variation in the aggregate. The practical application of the variance of an alternative characteristic is to construct confidence intervals when conducting sample observation.

How less value variance and standard deviation, the more homogeneous the population and the more typical the average will be. In the practice of statistics, there is often a need to compare variations various signs. For example, it is interesting to compare variations in the age of workers and their qualifications, length of service and wages, cost and profit, length of service and labor productivity, etc. For such comparisons, indicators of absolute variability of characteristics are unsuitable: it is impossible to compare the variability of work experience, expressed in years, with the variation of wages, expressed in rubles. To carry out such comparisons, as well as comparisons of the variability of the same characteristic in several populations with different arithmetic averages, variation indicators are used - the oscillation coefficient, linear coefficient variations and coefficient of variation, which show the extent to which extreme values ​​fluctuate around the average.

Oscillation coefficient:

Where V R - oscillation coefficient value; R- value of the range of variation; X -

Linear coefficient of variation".

Where Vj- the value of the linear coefficient of variation; I - the value of the average linear deviation; X - the average value of the characteristic for the population being studied.

The coefficient of variation:

Where V a - coefficient of variation value; a is the value of the standard deviation; X - the average value of the characteristic for the population being studied.

The coefficient of oscillation is the percentage ratio of the range of variation to the average value of the characteristic being studied, and the linear coefficient of variation is the ratio of the average linear deviation to the average value of the characteristic being studied, expressed as a percentage. The coefficient of variation is the percentage of the standard deviation to the average value of the characteristic being studied. As a relative value, expressed as a percentage, the coefficient of variation is used to compare the degree of variation of various characteristics. Using the coefficient of variation, the homogeneity of a statistical population is assessed. If the coefficient of variation is less than 33%, then the population under study is homogeneous and the variation is weak. If the coefficient of variation is more than 33%, then the population under study is heterogeneous, the variation is strong, and the average value is atypical and cannot be used as a general indicator of this population. In addition, coefficients of variation are used to compare the variability of one trait in different populations. For example, to assess the variation in the length of service of workers at two enterprises. The higher the coefficient value, the more significant the variation of the characteristic.

Based on the calculated quartiles, it is also possible to calculate the relative indicator of quarterly variation using the formula

where Q 2 And

The interquartile range is determined by the formula

The quartile deviation is used instead of the range of variation to avoid the disadvantages associated with using extreme values:

For unequally interval variation series, the distribution density is also calculated. It is defined as the quotient of the corresponding frequency or frequency divided by the value of the interval. In unequal interval series, absolute and relative distribution densities are used. The absolute distribution density is the frequency per unit length of the interval. Relative distribution density - frequency per unit interval length.

All of the above is true for distribution series whose distribution law is well described by the normal distribution law or is close to it.

The concept of a variation series. The first step in systematizing statistical observation materials is to count the number of units that have a particular characteristic. By arranging the units in ascending or descending order of their quantitative characteristic and counting the number of units with a specific value of the characteristic, we obtain a variation series. A variation series characterizes the distribution of units of a certain statistical population according to some quantitative characteristic.

The variation series consists of two columns, the left column contains the values ​​of the varying characteristic, called variants and denoted (x), and the right column contains absolute numbers showing how many times each variant occurs. The indicators in this column are called frequencies and are designated (f).

The variation series can be schematically presented in the form of Table 5.1:

Table 5.1

Type of variation series

Options (x)

Frequencies (f)

In the right column, relative indicators can also be used, characterizing the share of the frequency of individual options in the total sum of frequencies. These relative indicators are called frequencies and are conventionally denoted by , i.e. . The sum of all frequencies is equal to one. Frequencies can also be expressed as percentages, and then their sum will be equal to 100%.

Varying signs may be different character. Variants of some characteristics are expressed in integers, for example, the number of rooms in an apartment, the number of books published, etc. These signs are called discontinuous or discrete. Variants of other characteristics can take on any values ​​within certain limits, such as, for example, the implementation of planned tasks, wage etc. These signs are called continuous.

Discrete variation series. If the variants of the variation series are expressed in the form discrete quantities, then such a variation series is called discrete, it appearance presented in table. 5.2:

Table 5.2

Distribution of students according to exam grades

Ratings (x)

Number of students (f)

In % of total ()

The nature of the distribution in discrete series is depicted graphically in the form of a distribution polygon, Fig. 5.1.

Rice. 5.1. Distribution of students according to grades obtained in the exam.

Interval variation series. For continuous characteristics, variation series are constructed as interval ones, i.e. the values ​​of the characteristic in them are expressed in the form of intervals “from and to”. In this case, the minimum value of the characteristic in such an interval is called the lower limit of the interval, and the maximum is called upper limit interval.

Interval variation series are constructed both for discontinuous characteristics (discrete) and for those varying over a large range. Interval rows can be with equal or unequal intervals. In economic practice, most unequal intervals are used, progressively increasing or decreasing. This need arises especially in cases where the fluctuation of a characteristic occurs unevenly and within large limits.

Let's consider the type of interval series with equal intervals, table. 5.3:

Table 5.3

Distribution of workers by production

Output, t.r. (X)

Number of workers (f)

Cumulative frequency (f´)

The interval distribution series is graphically depicted in the form of a histogram, Fig. 5.2.

Fig.5.2. Distribution of workers by production

Accumulated (cumulative) frequency. In practice, there is a need to transform distribution series into cumulative series, built according to accumulated frequencies. With their help, you can determine structural averages that facilitate the analysis of distribution series data.

Cumulative frequencies are determined by sequentially adding to the frequencies (or frequencies) of the first group these indicators of subsequent groups of the distribution series. Cumulates and ogives are used to illustrate distribution series. To construct them, the values ​​of the discrete characteristic (or the ends of the intervals) are marked on the abscissa axis, and the cumulative totals of frequencies (cumulates) are marked on the ordinate axis, Fig. 5.3.

Rice. 5.3. Cumulative distribution of workers by production

If the scales of frequencies and options are reversed, i.e. the abscissa axis reflects the accumulated frequencies, and the ordinate axis shows the values ​​of the variants, then the curve characterizing the change in frequencies from group to group will be called the distribution ogive, Fig. 5.4.

Rice. 5.4. Ogiva of distribution of workers by production

Variation series with equal intervals provide one of the most important requirements for statistical series distributions, ensuring their comparability in time and space.

Distribution density. However, the frequencies of individual unequal intervals in the named series are not directly comparable. In such cases, to ensure the necessary comparability, the distribution density is calculated, i.e. determine how many units in each group are per unit of interval value.

When constructing a graph of the distribution of a variation series with unequal intervals, the height of the rectangles is determined in proportion not to the frequencies, but to the density indicators of the distribution of the values ​​of the characteristic being studied in the corresponding intervals.

Drawing up a variation series and its graphical representation is the first step in processing the initial data and the first stage in the analysis of the population being studied. Next step in the analysis of variation series is the determination of the main general indicators, called the characteristics of the series. These characteristics should give an idea of ​​the average value of the characteristic among population units.

average value. The average value is a generalized characteristic of the characteristic being studied in the population under study, reflecting its typical level per unit of the population under specific conditions of place and time.

The average value is always named and has the same dimension as the characteristic of individual units of the population.

Before calculating average values, it is necessary to group the units of the population under study, identifying qualitatively homogeneous groups.

The average calculated for the population as a whole is called the overall average, and for each group - group averages.

There are two types of averages: power (arithmetic mean, harmonic mean, geometric mean, quadratic mean); structural (mode, median, quartiles, deciles).

The choice of average for calculation depends on the purpose.

Types of power averages and methods for their calculation. In the practice of statistical processing collected material arise various tasks, which require different averages to solve.

Mathematical statistics derives various averages from power average formulas:

where is the average value; x – individual options (feature values); z – exponent (with z = 1 – arithmetic mean, z = 0 geometric mean, z = - 1 – harmonic mean, z = 2 – square mean).

However, the question of what type of average should be applied in each individual case is resolved by specific analysis the population being studied.

The most common type of average in statistics is arithmetic mean. It is calculated in cases where the volume of the averaged characteristic is formed as the sum of its values ​​for individual units of the statistical population being studied.

Depending on the nature of the source data, the arithmetic mean is determined in various ways:

If the data is ungrouped, then the calculation is carried out using the simple average formula

Calculation of the arithmetic mean in discrete series occurs according to formula 3.4.

Calculation of the arithmetic mean in an interval series. In an interval variation series, where the value of a characteristic in each group is conventionally taken to be the middle of the interval, the arithmetic mean may differ from the mean calculated from ungrouped data. Moreover, the larger the interval in the groups, the greater the possible deviations of the average calculated from grouped data from the average calculated from ungrouped data.

When calculating the average over an interval variation series, to perform the necessary calculations, one moves from the intervals to their midpoints. And then the average is calculated using the weighted arithmetic average formula.

Properties of the arithmetic mean. The arithmetic mean has some properties that make it possible to simplify calculations; let’s consider them.

1. The arithmetic mean of constant numbers is equal to this constant number.

If x = a. Then .

2. If the weights of all options are changed proportionally, i.e. increase or decrease by the same number of times, then the arithmetic mean of the new series will not change.

If all weights f are reduced by k times, then .

3. The sum of positive and negative deviations of individual options from the average, multiplied by the weights, is equal to zero, i.e.

If, then. From here.

If all options are reduced or increased by any number, then the arithmetic mean of the new series will decrease or increase by the same amount.

Let's reduce all options x on a, i.e. x´ = xa.

Then

The arithmetic mean of the original series can be obtained by adding to the reduced mean the number previously subtracted from the options a, i.e. .

5. If all options are reduced or increased in k times, then the arithmetic mean of the new series will decrease or increase by the same amount, i.e. V k once.

Let it be then .

Hence, i.e. to obtain the average of the original series, the arithmetic average of the new series (with reduced options) must be increased by k once.

Harmonic mean. The harmonic mean is the reciprocal of the arithmetic mean. It is used when statistical information does not contain frequencies for individual variants of the population, but is presented as their product (M = xf). The harmonic mean will be calculated using formula 3.5

The practical application of the harmonic mean is to calculate some indices, in particular, the price index.

Geometric mean. When using geometric mean, individual values ​​of a characteristic are, as a rule, relative values ​​of dynamics, constructed in the form of chain values, as a ratio to the previous level of each level in a series of dynamics. The average thus characterizes the average growth rate.

Average geometric quantity is also used to determine the equidistant value from the maximum and minimum values ​​of a characteristic. For example, Insurance Company concludes contracts for the provision of auto insurance services. Depending on the specific insured event insurance payment can range from $10,000 to $100,000 per year. The average amount of insurance payments will be USD.

The geometric mean is a quantity used as the average of ratios or in distribution series, represented as geometric progression, when z = 0. This average is convenient to use when attention is paid not to absolute differences, but to the ratios of two numbers.

The formulas for calculation are as follows

where are the variants of the characteristic being averaged; – product of options; f– frequency of options.

The geometric mean is used in calculations of average annual growth rates.

Mean square. The mean square formula is used to measure the degree of fluctuation of individual values ​​of a characteristic around the arithmetic mean in the distribution series. Thus, when calculating variation indicators, the average is calculated from the squared deviations of individual values ​​of a characteristic from the arithmetic mean.

The root mean square value is calculated using the formula

In economic research, the modified mean square is widely used in calculating indicators of variation of a characteristic, such as dispersion and standard deviation.

Majority rule. There is the following relationship between power averages - the larger the exponent, the greater the value of the average, Table 5.4:

Table 5.4

Relationship between averages

z value

Relationship between averages

This relationship is called the majorancy rule.

Structural averages. To characterize the structure of the population, special indicators are used, which can be called structural averages. These indicators include mode, median, quartiles and deciles.

Fashion. Mode (Mo) is the most frequently occurring value of a characteristic among population units. The mode is the value of the attribute that corresponds to the maximum point of the theoretical distribution curve.

Fashion is widely used in commercial practice when studying consumer demand (when determining the sizes of clothes and shoes that are in wide demand), and recording prices. There may be several mods in total.

Calculation of mode in a discrete series. In a discrete series, mode is the variant with the highest frequency. Let's consider finding a mode in a discrete series.

Calculation of mode in an interval series. In an interval variation series, the mode is approximately considered to be the central variant of the modal interval, i.e. the interval that has the highest frequency (frequency). Within the interval, you need to find the value of the attribute that is the mode. For an interval series, the mode will be determined by the formula

where is the lower limit of the modal interval; – the value of the modal interval; – frequency corresponding to the modal interval; – frequency preceding the modal interval; – frequency of the interval following the modal one.

Median. Median () is the value of the attribute of the middle unit of the ranked series. A ranked series is a series in which the attribute values ​​are written in ascending or descending order. Or the median is a value that divides the number of an ordered variation series into two equal parts: one part has a value of the varying characteristic that is less than the average option, and the other has a value that is greater.

To find the median, first determine its ordinal number. To do this, if the number of units is odd, one is added to the sum of all frequencies and everything is divided by two. With an even number of units, the median is found as the value of the attribute of a unit, the serial number of which is determined by the total sum of frequencies divided by two. Knowing the serial number of the median, it is easy to find its value using the accumulated frequencies.

Calculation of the median in a discrete series. According to the sample survey, data on the distribution of families by number of children was obtained, table. 5.5. To determine the median, we first determine its ordinal number

In these families the number of children is equal to 2, therefore = 2. Thus, in 50% of families the number of children does not exceed 2.

– accumulated frequency preceding the median interval;

On the one hand, this is a very positive property because in this case, the effect of all causes affecting all units of the population under study is taken into account. On the other hand, even one observation included in the source data by chance can significantly distort the idea of ​​the level of development of the trait being studied in the population under consideration (especially in short series).

Quartiles and deciles. By analogy with finding the median in variation series, you can find the value of a characteristic for any unit of the ranked series. So, in particular, you can find the value of the attribute for units dividing a series into 4 equal parts, into 10, etc.

Quartiles. The options that divide the ranked series into four equal parts are called quartiles.

In this case, they distinguish: the lower (or first) quartile (Q1) - the value of the attribute for a unit of the ranked series, dividing the population in the ratio of ¼ to ¾ and the upper (or third) quartile (Q3) - the value of the attribute for the unit of the ranked series, dividing the population in the ratio ¾ to ¼.

– frequencies of quartile intervals (lower and upper)

The intervals containing Q1 and Q3 are determined by the accumulated frequencies (or frequencies).

Deciles. In addition to quartiles, deciles are calculated - options that divide the ranked series into 10 equal parts.

They are designated by D, the first decile D1 divides the series in the ratio of 1/10 and 9/10, the second D2 - 2/10 and 8/10, etc. They are calculated according to the same scheme as the median and quartiles.

Both the median, quartiles, and deciles belong to the so-called ordinal statistics, which is understood as an option that occupies a certain ordinal place in the ranked series.

​ Variation series - a series in which are compared (by degree of increase or decrease) options and corresponding frequencies

​Options are individual quantitative expressions of a characteristic. Indicated by a Latin letter V . The classical understanding of the term “variant” assumes that each unique value of a characteristic is called a variant, without taking into account the number of repetitions.

For example, in the variation series of systolic blood pressure indicators measured in ten patients:

110, 120, 120, 130, 130, 130, 140, 140, 160, 170;

There are only 6 values ​​available:

110, 120, 130, 140, 160, 170.

​Frequency is a number indicating how many times an option is repeated. Denoted by a Latin letter P . The sum of all frequencies (which, of course, is equal to the number of all those studied) is denoted as n.

    In our example, the frequencies will take the following values:
  • for option 110 frequency P = 1 (value 110 occurs in one patient),
  • for option 120 frequency P = 2 (value 120 occurs in two patients),
  • for option 130 frequency P = 3 (value 130 occurs in three patients),
  • for option 140 frequency P = 2 (value 140 occurs in two patients),
  • for option 160 frequency P = 1 (value 160 occurs in one patient),
  • for option 170 frequency P = 1 (value 170 occurs in one patient),

Types of variation series:

  1. simple- this is a series in which each option occurs only once (all frequencies are equal to 1);
  2. suspended- a series in which one or more options appear more than once.

The variation series is used to describe large arrays of numbers; it is in this form that the collected data of most medical studies are initially presented. In order to characterize the variation series, special indicators are calculated, including average values, indicators of variability (the so-called dispersion), and indicators of the representativeness of sample data.

Variation series indicators

1) The arithmetic mean is a general indicator characterizing the size of the characteristic being studied. The arithmetic mean is denoted as M , is the most common type of average. The arithmetic mean is calculated as the ratio of the sum of the indicator values ​​of all observation units to the number of all subjects studied. The method for calculating the arithmetic mean differs for a simple and weighted variation series.

Formula for calculation simple arithmetic average:

Formula for calculation weighted arithmetic average:

M = Σ(V * P)/ n

​ 2) Mode is another average value of the variation series, corresponding to the most frequently repeated option. Or, to put it another way, this is the option that corresponds to the highest frequency. Denoted as Mo . The mode is calculated only for weighted series, since in simple rows none of the options is repeated and all frequencies are equal to one.

For example, in the variation series of heart rate values:

80, 84, 84, 86, 86, 86, 90, 94;

the mode value is 86, since this option occurs 3 times, therefore its frequency is the highest.

3) Median - the value of the option dividing the variation series in half: on both sides of it there is an equal number of options. The median, like the arithmetic mean and mode, refers to average values. Denoted as Me

4) Standard deviation (synonyms: standard deviation, sigma deviation, sigma) - a measure of the variability of the variation series. It is an integral indicator that combines all cases of deviation from the average. In fact, it answers the question: how far and how often do variants spread from the arithmetic mean. Denoted by a Greek letter σ ("sigma").

If the population size is more than 30 units, the standard deviation is calculated using the following formula:

For small populations - 30 observation units or less - the standard deviation is calculated using a different formula:

Variation series: definition, types, main characteristics. Calculation method
mode, median, arithmetic mean in medical and statistical research
(show with a conditional example).

A variation series is a series of numerical values ​​of the characteristic being studied, differing from each other in magnitude and arranged in a certain sequence (in ascending or descending order). Each numerical value of a series is called a variant (V), and the numbers showing how often a particular variant occurs in a given series are called frequency (p).

The total number of observation cases that make up the variation series is denoted by the letter n. The difference in the meaning of the characteristics being studied is called variation. If a varying characteristic does not have a quantitative measure, the variation is called qualitative, and the distribution series is called attributive (for example, distribution by disease outcome, health status, etc.).

If a varying characteristic has a quantitative expression, such variation is called quantitative, and the distribution series is called variational.

Variation series are divided into discontinuous and continuous - based on the nature of the quantitative characteristic; simple and weighted - based on the frequency of occurrence of the variant.

In a simple variation series, each option occurs only once (p=1), in a weighted series, the same option occurs several times (p>1). Examples of such series will be discussed further in the text. If the quantitative characteristic is continuous, i.e. Between integer quantities there are intermediate fractional quantities; the variation series is called continuous.

For example: 10.0 – 11.9

14.0 – 15.9, etc.

If the quantitative characteristic is discontinuous, i.e. its individual values ​​(variants) differ from each other by an integer and do not have intermediate fractional values; the variation series is called discontinuous or discrete.

Using the heart rate data from the previous example

for 21 students, we will construct a variation series (Table 1).

Table 1

Distribution of medical students by heart rate (bpm)

Thus, to construct a variation series means the available numeric values(options) systematize, organize, i.e. arrange in a certain sequence (in ascending or descending order) with their corresponding frequencies. In the example under consideration, the options are arranged in ascending order and expressed as integer discontinuous (discrete) numbers, each option occurs several times, i.e. we are dealing with a weighted, discontinuous or discrete variation series.

As a rule, if the number of observations in the statistical population we are studying does not exceed 30, then it is enough to arrange all the values ​​of the characteristic being studied in an ascending variation series, as in Table. 1, or descending order.

With a large number of observations (n>30), the number of occurring variants can be very large; in this case, an interval or grouped variation series is compiled, in which, to simplify subsequent processing and clarify the nature of the distribution, the variants are combined into groups.

Typically the number of group options ranges from 8 to 15.

There should be at least 5 of them, because... otherwise it will be too rough, excessive enlargement, which distorts the overall picture of variation and greatly affects the accuracy of average values. When the number of group variants is more than 20-25, the accuracy of calculating average values ​​increases, but the characteristics of the variation of the characteristic are significantly distorted and mathematical processing becomes more complicated.

When compiling a grouped series, it is necessary to take into account

− option groups must be arranged in a certain order (ascending or descending);

− intervals in option groups must be the same;

− the values ​​of the interval boundaries should not coincide, because it will be unclear which groups to classify individual variants into;

− it is necessary to take into account the qualitative features of the collected material when setting interval limits (for example, when studying the weight of adults, an interval of 3-4 kg is acceptable, and for children in the first months of life it should not exceed 100 g)

Let's construct a grouped (interval) series characterizing data on the pulse rate (beats per minute) for 55 medical students before the exam: 64, 66, 60, 62,

64, 68, 70, 66, 70, 68, 62, 68, 70, 72, 60, 70, 74, 62, 70, 72, 72,

64, 70, 72, 76, 76, 68, 70, 58, 76, 74, 76, 76, 82, 76, 72, 76, 74,

79, 78, 74, 78, 74, 78, 74, 74, 78, 76, 78, 76, 80, 80, 80, 78, 78.

To build a grouped series you need:

1. Determine the size of the interval;

2. Determine the middle, beginning and end of the groups of the variation series.

● The size of the interval (i) is determined by the number of supposed groups (r), the number of which is set depending on the number of observations (n) according to a special table

Number of groups depending on the number of observations:

In our case, for 55 students, you can create from 8 to 10 groups.

The value of the interval (i) is determined by the following formula -

i = V max-V min/r

In our example, the value of the interval is 82-58/8= 3.

If the interval value is a fractional number, the result should be rounded to a whole number.

There are several types of averages:

● arithmetic mean,

● geometric mean,

● harmonic mean,

● root mean square,

● average progressive,

● median

IN medical statistics Arithmetic averages are most often used.

The arithmetic mean (M) is a generalizing value that determines what is typical for the entire population. The main methods for calculating M are: the arithmetic mean method and the method of moments (conditional deviations).

The arithmetic mean method is used to calculate the simple arithmetic mean and the weighted arithmetic mean. The choice of method for calculating the arithmetic mean depends on the type of variation series. In the case of a simple variation series, in which each option occurs only once, the arithmetic mean simple is determined by the formula:

where: M – arithmetic mean value;

V – value of the varying characteristic (variants);

Σ – indicates the action – summation;

n – total number of observations.

An example of calculating the simple arithmetic average. Respiratory rate (number of breathing movements per minute) in 9 men aged 35 years: 20, 22, 19, 15, 16, 21, 17, 23, 18.

To determine the average level of respiratory rate in men aged 35 years, it is necessary:

1. Construct a variation series, arranging all options in ascending or descending order. We have obtained a simple variation series, because option values ​​occur only once.

M = ∑V/n = 171/9 = 19 breaths per minute

Conclusion. The respiratory rate in men aged 35 years is on average 19 breathing movements in a minute.

If individual values ​​of a variant are repeated, there is no need to write down each variant in a line; it is enough to list the occurring sizes of the variant (V) and next to it indicate the number of their repetitions (p). Such a variation series, in which the options are, as it were, weighed by the number of frequencies corresponding to them, is called a weighted variation series, and the calculated average value is the weighted arithmetic mean.

The weighted arithmetic mean is determined by the formula: M= ∑Vp/n

where n is the number of observations, equal to the sum frequencies – Σр.

An example of calculating the arithmetic weighted average.

The duration of disability (in days) in 35 patients with acute respiratory diseases (ARI) treated by a local doctor during the first quarter of the current year was: 6, 7, 5, 3, 9, 8, 7, 5, 6, 4, 9, 8, 7, 6, 6, 9, 6, 5, 10, 8, 7, 11, 13, 5, 6, 7, 12, 4, 3, 5, 2, 5, 6, 6, 7 days .

The method for determining the average duration of disability in patients with acute respiratory infections is as follows:

1. Let's construct a weighted variation series, because Individual values ​​of the option are repeated several times. To do this, you can arrange all options in ascending or descending order with their corresponding frequencies.

In our case, the options are arranged in ascending order

2. Calculate the arithmetic weighted average using the formula: M = ∑Vp/n = 233/35 = 6.7 days

Distribution of patients with acute respiratory infections by duration of disability:

Duration of disability (V) Number of patients (p) Vp
∑p = n = 35 ∑Vp = 233

Conclusion. The duration of disability in patients with acute respiratory diseases averaged 6.7 days.

Mode (Mo) is the most common option in the variation series. For the distribution presented in the table, the mode corresponds to an option equal to 10; it occurs more often than others - 6 times.

Distribution of patients by length of stay in a hospital bed (in days)

V
p

Sometimes it is difficult to determine the exact magnitude of a mode because there may be several “most common” observations in the data being studied.

Median (Me) is a nonparametric indicator that divides the variation series into two equal halves: the same number of variants is located on both sides of the median.

For example, for the distribution shown in the table, the median is 10, because on both sides of this value there are 14 options, i.e. number 10 occupies central position in this series is its median.

Given that the number of observations in this example is even (n=34), the median can be determined as follows:

Me = 2+3+4+5+6+5+4+3+2/2 = 34/2 = 17

This means that the middle of the series falls on the seventeenth option, which corresponds to a median equal to 10. For the distribution presented in the table, the arithmetic mean is equal to:

M = ∑Vp/n = 334/34 = 10.1

So, for 34 observations from table. 8, we got: Mo=10, Me=10, arithmetic mean (M) is 10.1. In our example, all three indicators turned out to be equal or close to each other, although they are completely different.

The arithmetic mean is the resultant sum of all influences; all options without exception, including extreme ones, often atypical for a given phenomenon or population, take part in its formation.

The mode and median, unlike the arithmetic mean, do not depend on the value of all individual values ​​of the varying characteristic (the values ​​of the extreme variants and the degree of dispersion of the series). The arithmetic mean characterizes the entire mass of observations, the mode and median characterize the bulk

A special place in statistical analysis belongs to the determination of the average level of the characteristic or phenomenon being studied. The average level of a trait is measured by average values.

The average value characterizes the general quantitative level of the characteristic being studied and is a group property of the statistical population. It levels out, weakens random deviations of individual observations in one direction or another and highlights the main, typical property of the characteristic being studied.

Averages are widely used:

1. To assess the health status of the population: characteristics of physical development (height, weight, circumference chest etc.), identifying the prevalence and duration various diseases, analysis demographic indicators(natural population movement, average life expectancy, population reproduction, average population size, etc.).

2. To study the activities of medical institutions, medical personnel and assessing the quality of their work, planning and determining the population’s needs for various types medical care(average number of requests or visits per resident per year, average duration the patient's stay in the hospital, average duration examination of the patient, average availability of doctors, beds, etc.).

3. To characterize the sanitary and epidemiological state (average air dust content in the workshop, average area per person, average consumption of proteins, fats and carbohydrates, etc.).

4. To determine medical and physiological indicators in normal and pathological conditions, when processing laboratory data, to establish the reliability of the results sample survey in social and hygienic, clinical, experimental studies.

The calculation of average values ​​is performed on the basis of variation series. Variation series is a qualitatively homogeneous statistical set, the individual units of which characterize the quantitative differences of the characteristic or phenomenon being studied.

Quantitative variation can be of two types: discontinuous (discrete) and continuous.

A discontinuous (discrete) attribute is expressed only as an integer and cannot have any intermediate values ​​(for example, the number of visits, the population of the site, the number of children in the family, the severity of the disease in points, etc.).

A continuous characteristic can take on any values ​​within certain limits, including fractional ones, and is expressed only approximately (for example, weight - for adults it can be limited to kilograms, and for newborns - grams; height, arterial pressure, time spent seeing the patient, etc.).



The digital value of each individual characteristic or phenomenon included in the variation series is called a variant and is designated by the letter V . Other notations are also found in the mathematical literature, for example x or y.

A variation series, where each option is indicated once, is called simple. Such series are used in most statistical problems in the case of computer data processing.

As the number of observations increases, repeating variant values ​​tend to occur. In this case it is created grouped variation series, where the number of repetitions is indicated (frequency, denoted by the letter “ R »).

Ranked variation series consists of options arranged in ascending or descending order. Both simple and grouped series can be compiled with ranking.

Interval variation series compiled in order to simplify subsequent calculations performed without the use of a computer, with a very large number of observation units (more than 1000).

Continuous variation series includes option values, which can be any value.

If in a variation series the values ​​of a characteristic (variants) are given in the form of individual specific numbers, then such a series is called discrete.

General characteristics the values ​​of the characteristic reflected in the variation series are the average values. Among them, the most used are: arithmetic mean M, fashion Mo and median Me. Each of these characteristics is unique. They cannot replace each other and only together they represent the features of the variation series quite fully and in a condensed form.

Fashion (Mo) name the value of the most frequently occurring options.

Median (Me) – this is the value of the option dividing the ranked variation series in half (on each side of the median there is half of the option). In rare cases, when there is a symmetrical variation series, the mode and median are equal to each other and coincide with the value of the arithmetic mean.

Most typical characteristic value option is arithmetic mean value( M ). In mathematical literature it is denoted .

Arithmetic mean (M, ) is a general quantitative characteristic of a certain characteristic of the phenomena being studied, constituting a qualitatively homogeneous statistical population. There are simple and weighted arithmetic averages. The simple arithmetic mean is calculated for a simple variation series by summing all the options and dividing this sum by total option included in this variation series. Calculations are carried out according to the formula:

,

Where: M - simple arithmetic mean;

Σ V - amount option;

n- number of observations.

In the grouped variation series, the weighted arithmetic mean is determined. The formula for calculating it:

,

Where: M - arithmetic weighted average;

Σ Vp - the sum of the products of the variant by their frequencies;

n- number of observations.

With a large number of observations, in the case of manual calculations, the method of moments can be used.

The arithmetic mean has the following properties:

· sum of deviations from the average ( Σ d ) is equal to zero (see Table 15);

· when multiplying (dividing) all options by the same factor (divisor), the arithmetic mean is multiplied (divided) by the same factor (divisor);

· if you add (subtract) the same number to all options, the arithmetic mean increases (decreases) by the same number.

Arithmetic averages, taken by themselves, without taking into account the variability of the series from which they are calculated, may not fully reflect the properties of the variation series, especially when comparison with other averages is necessary. Averages close in value can be obtained from series with varying degrees scattering. The closer the individual options are to each other in terms of their quantitative characteristics, the less dispersion (oscillation, variability) series, the more typical its average.

The main parameters that allow us to assess the variability of a trait are:

· Scope;

· Amplitude;

· Standard deviation;

· The coefficient of variation.

The variability of a trait can be approximately judged by the range and amplitude of the variation series. The range indicates the maximum (V max) and minimum (V min) options in the series. Amplitude (A m) is the difference between these options: A m = V max - V min.

The main, generally accepted measure of the variability of a variation series is dispersion (D ). But the most often used is a more convenient parameter calculated on the basis of dispersion - the standard deviation ( σ ). It takes into account the magnitude of the deviation ( d ) of each variation series from its arithmetic mean ( d=V - M ).

Since deviations from the average can be positive and negative, when summed they give the value “0” (S d=0). To avoid this, the deviation values ​​( d) are raised to the second power and averaged. Thus, the dispersion of a variation series is the mean square of deviations of a variant from the arithmetic mean and is calculated by the formula:

.

She happens to be the most important characteristic variability and is used to calculate many statistical tests.

Since dispersion is expressed as the square of deviations, its value cannot be used in comparison with the arithmetic mean. For these purposes it is used standard deviation, which is designated by the sign “Sigma” ( σ ). It characterizes the average deviation of all variants of a variation series from the arithmetic mean value in the same units as the average value itself, so they can be used together.

The standard deviation is determined by the formula:

The specified formula is applied when the number of observations ( n ) more than 30. With a smaller number n the standard deviation value will have an error associated with the mathematical offset ( n - 1). In this regard, a more accurate result can be obtained by taking into account such a bias in the formula for calculating the standard deviation:

standard deviation (s ) is an estimate of the standard deviation of a random variable X regarding her mathematical expectation based on an unbiased estimate of its variance.

With values n > 30 standard deviation ( σ ) and standard deviation ( s ) will be the same ( σ =s ). Therefore, in most practical manuals these criteria are considered to have different meanings. IN Excel program calculating the standard deviation can be done with the function =STDEV(range). And in order to calculate the standard deviation, you need to create an appropriate formula.

The mean square or standard deviation allows you to determine how much the values ​​of a characteristic may differ from the average value. Suppose there are two cities with the same average daily temperature in summer. One of these cities is located on the coast, and the other on the continent. It is known that in cities located on the coast, the differences in daytime temperatures are smaller than in cities located inland. Therefore, the standard deviation of daytime temperatures for the coastal city will be less than for the second city. In practice, this means that the average air temperature of each specific day in a city located on the continent will differ more from the average than in a city on the coast. In addition, the standard deviation allows you to evaluate possible temperature deviations from the average with the required level of probability.

According to probability theory, in phenomena that obey the normal distribution law, there is a strict relationship between the values ​​of the arithmetic mean, standard deviation and options ( three sigma rule). For example, 68.3% of the values ​​of a varying characteristic are within M ± 1 σ , 95.5% - within M ± 2 σ and 99.7% - within M ± 3 σ .

The value of the standard deviation allows us to judge the nature of the homogeneity of the variation series and the study group. If the value of the standard deviation is small, then this indicates a fairly high homogeneity of the phenomenon being studied. The arithmetic mean in this case should be considered quite characteristic for a given variation series. However, too small a sigma value makes one think about an artificial selection of observations. With a very large sigma, the arithmetic mean characterizes the variation series to a lesser extent, which indicates significant variability of the characteristic or phenomenon being studied or the heterogeneity of the group under study. However, comparison of the value of the standard deviation is possible only for features of the same dimension. Indeed, if we compare the diversity of weights of newborn children and adults, we will always get higher sigma values ​​in adults.

Comparison of the variability of features of different dimensions can be done using coefficient of variation. It expresses diversity as a percentage of the mean, allowing comparisons between different traits. The coefficient of variation in the medical literature is indicated by the sign “ WITH ", and in mathematical " v"and calculated by the formula:

.

Values ​​of the coefficient of variation of less than 10% indicate small scattering, from 10 to 20% - about average, more than 20% - about strong scattering around the arithmetic mean.

The arithmetic mean is usually calculated based on data from a sample population. With repeated studies, under the influence of random phenomena, the arithmetic mean may change. This is due to the fact that, as a rule, only part of the possible units of observation is studied, that is, the sample population. Information about all possible units representing the phenomenon being studied can be obtained by studying the entire population, which is not always possible. At the same time, for the purpose of generalizing experimental data, the value of the average in the general population is of interest. Therefore, in order to formulate a general conclusion about the phenomenon being studied, the results obtained on the basis of the sample population must be transferred to the general population using statistical methods.

To determine the degree of agreement between a sample study and the general population, it is necessary to estimate the magnitude of the error that inevitably arises during sample observation. This error is called " The error of representativeness"or "Average error of the arithmetic mean." It is actually the difference between the averages obtained from the sample statistical observation, and similar values ​​that would be obtained during a continuous study of the same object, i.e. when studying a general population. Since the sample mean is a random variable, such a forecast is performed with a level of probability acceptable to the researcher. IN medical research it is at least 95%.

The representativeness error cannot be confused with registration errors or attention errors (slips, miscalculations, typos, etc.), which should be minimized by adequate methods and tools used during the experiment.

The magnitude of the representativeness error depends on both the sample size and the variability of the trait. How larger number observations, the closer the sample is to the population and the smaller the error. The more variable the sign, the greater the statistical error.

In practice, to determine the representativeness error in variation series, the following formula is used:

,

Where: m – error of representativeness;

σ – standard deviation;

n– number of observations in the sample.

From the formula it is clear that the size average error is directly proportional to the standard deviation, i.e., the variability of the trait being studied, and inversely proportional to the square root of the number of observations.

When performing statistical analysis based on calculating relative values, constructing a variation series is not necessary. In this case, the determination of the average error for relative indicators can be performed using a simplified formula:

,

Where: R– the value of the relative indicator, expressed as a percentage, ppm, etc.;

q– the reciprocal of P and expressed as (1-P), (100-P), (1000-P), etc., depending on the basis on which the indicator is calculated;

n– number of observations in the sample population.

However, the specified formula for calculating the representativeness error for relative values ​​can only be applied when the value of the indicator is less than its base. In a number of cases of calculating intensive indicators, this condition is not met, and the indicator can be expressed as a number of more than 100% or 1000%. In such a situation, a variation series is constructed and the representativeness error is calculated using the formula for average values ​​based on the standard deviation.

Forecasting the value of the arithmetic mean in the population is performed by indicating two values ​​– the minimum and maximum. These extreme values possible deviations, within which the desired average value of the population can fluctuate are called “ Trust boundaries».

The postulates of probability theory have proven that with a normal distribution of a characteristic with a probability of 99.7%, the extreme values ​​of deviations of the average will not be greater than the value of triple the representativeness error ( M ± 3 m ); in 95.5% – no more than twice the average error of the average value ( M ± 2 m ); in 68.3% – no more than one average error ( M ± 1 m ) (Fig. 9).

P%

Rice. 9. Probability density of normal distribution.

Note that the above statement is only true for a feature that obeys the normal Gaussian distribution law.

Majority experimental research, including in the field of medicine, is associated with measurements, the results of which can take almost any value in a given interval, therefore, as a rule, they are described by a model of continuous random variables. In this regard, most statistical methods consider continuous distributions. One of these distributions, which has a fundamental role in mathematical statistics, is normal, or Gaussian, distribution.

This is due to a number of reasons.

1. First of all, many experimental observations can be successfully described using the normal distribution. It should be immediately noted that there are no distributions of empirical data that would be exactly normal, since a normally distributed random value is in the range from to , which never occurs in practice. However, the normal distribution very often works well as an approximation.

Whether measurements of weight, height and other physiological parameters of the human body are carried out - everywhere the results are influenced by a very large number of random factors ( natural causes and measurement errors). Moreover, as a rule, the effect of each of these factors is insignificant. Experience shows that the results in such cases will be approximately normally distributed.

2. Many distributions associated with random sampling become normal as the volume of the latter increases.

3. The normal distribution is well suited as an approximation of other continuous distributions (for example, skewed).

4. The normal distribution has a number of favorable mathematical properties, which largely provided it wide application in statistics.

At the same time, it should be noted that in medical data there are many experimental distributions that cannot be described by a normal distribution model. For this purpose, statistics have developed methods that are commonly called “Nonparametric”.

The choice of a statistical method that is suitable for processing data from a particular experiment should be made depending on whether the obtained data belongs to the normal distribution law. Testing the hypothesis for the subordination of a sign to the normal distribution law is carried out using a frequency distribution histogram (graph), as well as a number of statistical criteria. Among them:

Asymmetry criterion ( b );

Criterion for testing for kurtosis ( g );

Shapiro–Wilks test ( W ) .

An analysis of the nature of the data distribution (also called a test for normality of distribution) is carried out for each parameter. To confidently judge whether the distribution of a parameter corresponds to the normal law, a sufficiently large number of observation units (at least 30 values) is required.

For a normal distribution, the skewness and kurtosis criteria take the value 0. If the distribution is shifted to the right b > 0 (positive asymmetry), with b < 0 - график распределения смещен влево (отрицательная асимметрия). Критерий асимметрии проверяет форму кривой распределения. В случае нормального закона g =0. At g > 0 the distribution curve is sharper if g < 0 пик более сглаженный, чем функция нормального распределения.

To check for normality using the Shapiro–Wilks test, you need to find the value of this criterion using statistical tables at required level significance and depending on the number of observation units (degrees of freedom). Appendix 1. The normality hypothesis is rejected at small values ​​of this criterion, as a rule, at w <0,8.



New on the site

>

Most popular