Home Prosthetics and implantation Critical values of the Spearman rank correlation coefficient. Correlation analysis using the Spearman method (Spearman ranks)

Prosthetics and implantation

Critical values of the Spearman rank correlation coefficient. Correlation analysis using the Spearman method (Spearman ranks)

The rank correlation coefficient, proposed by K. Spearman, refers to a nonparametric measure of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distributions of characteristics in the population. This coefficient determines the degree of closeness of connection between ordinal characteristics, which in this case represent the ranks of the compared quantities.

The Spearman correlation coefficient also lies in the range of +1 and -1. It, like the Pearson coefficient, can be positive and negative, characterizing the direction of the relationship between two characteristics measured on a rank scale.

In principle, the number of ranked features (qualities, traits, etc.) can be any, but the process of ranking more than 20 features is difficult. It is possible that this is why the table of critical values of the rank correlation coefficient was calculated only for forty ranked features (n< 40, табл. 20 приложения 6).

Spearman's rank correlation coefficient is calculated using the formula:

where n is the number of ranked features (indicators, subjects);

D is the difference between the ranks for two variables for each subject;

Sum of squared rank differences.

Using the rank correlation coefficient, consider the following example.

Example: A psychologist finds out how individual indicators of readiness for school, obtained before the start of school among 11 first-graders, are related to each other and their average performance at the end of the school year.

To solve this problem, we ranked, firstly, the values of indicators of school readiness obtained upon admission to school, and, secondly, the final indicators of academic performance at the end of the year for these same students on average. We present the results in the table. 13.

Table 13

Student no.
Ranks of school readiness indicators
Average annual performance ranks

We substitute the obtained data into the formula and perform the calculation. We get:

To find the significance level, refer to the table. 20 of Appendix 6, which contains critical values for coefficients rank correlation.

We emphasize that in table. 20 Appendix 6, as in the table for linear correlation Pearson, all values of correlation coefficients are given in absolute value. Therefore, the sign of the correlation coefficient is taken into account only when interpreting it.

Finding the significance levels in this table is carried out by the number n, i.e. by the number of subjects. In our case n = 11. For this number we find:

0.61 for P 0.05

0.76 for P 0.01

We construct the corresponding ``significance axis'':

The resulting correlation coefficient coincided with the critical value for the significance level of 1%. Consequently, it can be argued that the indicators of school readiness and the final grades of first-graders are connected by a positive correlation - in other words, the higher the indicator of school readiness, the better the first-grader studies. In terms of statistical hypotheses, the psychologist must reject the null hypothesis of similarity and accept the alternative hypothesis of differences, which suggests that the relationship between indicators of school readiness and average academic performance is different from zero.

The case of identical (equal) ranks

If there are identical ranks, the formula for calculating the Spearman linear correlation coefficient will be slightly different. In this case, two new terms are added to the formula for calculating correlation coefficients, taking into account the same ranks. They are called equal rank corrections and are added to the numerator of the calculation formula.

where n is the number of identical ranks in the first column,

k is the number of identical ranks in the second column.

If there are two groups of identical ranks in any column, then the correction formula becomes somewhat more complicated:

where n is the number of identical ranks in the first group of the ranked column,

k is the number of identical ranks in the second group of the ranked column. Modification of the formula in general case is this:

Example: A psychologist, using a mental development test (MDT), conducts a study of intelligence in 12 9th grade students. At the same time, he asks teachers of literature and mathematics to rank these same students according to indicators mental development. The task is to determine how objective indicators of mental development (SHTUR data) and expert assessments of teachers are related to each other.

We present the experimental data of this problem and the additional columns necessary to calculate the Spearman correlation coefficient in the form of a table. 14.

Table 14

Student no.	Ranks of testing using SHTURA	Expert assessments of teachers in mathematics	Expert assessments of teachers on literature	D (second and third columns)	D (second and fourth columns)	(second and third columns)	(second and fourth columns)

Since the same ranks were used in the ranking, it is necessary to check the correctness of the ranking in the second, third and fourth columns of the table. Summing each of these columns gives the same total - 78.

We check by calculation formula. The check gives:

The fifth and sixth columns of the table show the values of the difference in ranks between the psychologist’s expert assessments on the SHTUR test for each student and the values of the teachers’ expert assessments, respectively, in mathematics and literature. The sum of the rank difference values must be equal to zero. Summing the D values in the fifth and sixth columns gave the desired result. Therefore, the subtraction of ranks was carried out correctly. A similar check must be done every time when conducting complex types of ranking.

Before starting the calculation using the formula, it is necessary to calculate corrections for the same ranks for the second, third and fourth columns of the table.

In our case, in the second column of the table there are two identical ranks, therefore, according to the formula, the value of the correction D1 will be:

The third column contains three identical ranks, therefore, according to the formula, the value of the correction D2 will be:

In the fourth column of the table there are two groups of three identical ranks, therefore, according to the formula, the value of the correction D3 will be:

Before proceeding to the solution of the problem, let us recall that the psychologist clarifies two questions - how are the values of ranks on the SHtUR test related to expert assessments in mathematics and literature. That is why the calculation is carried out twice.

We calculate the first ranking coefficient taking into account additives according to the formula. We get:

Let's calculate without taking into account the additive:

As we can see, the difference in the values of the correlation coefficients turned out to be very insignificant.

We calculate the second ranking coefficient taking into account additives according to the formula. We get:

Let's calculate without taking into account the additive:

Again, the differences were very minor. Since the number of students in both cases is the same, according to Table. 20 of Appendix 6 we find the critical values at n = 12 for both correlation coefficients at once.

0.58 for P 0.05

0.73 for P 0.01

We plot the first value on the ``significance axis'':

In the first case, the obtained rank correlation coefficient is in the zone of significance. Therefore, the psychologist must reject the null hypothesis that the correlation coefficient is similar to zero and accept the alternative hypothesis that the correlation coefficient is significantly different from zero. In other words, the obtained result suggests that the higher the students’ expert assessments on the SHTUR test, the higher their expert assessments in mathematics.

We plot the second value on the ``significance axis'':

In the second case, the rank correlation coefficient is in the zone of uncertainty. Therefore, a psychologist can accept the null Hypothesis that the correlation coefficient is similar to zero and reject the alternative Hypothesis that the correlation coefficient is significantly different from zero. In this case, the result obtained suggests that students’ expert assessments on the SHTUR test are not related to expert assessments on literature.

To apply the Spearman correlation coefficient, the following conditions must be met:

1. The variables being compared must be obtained on an ordinal (rank) scale, but can also be measured on an interval and ratio scale.

2. The nature of the distribution of correlated quantities does not matter.

3. The number of varying characteristics in the compared variables X and Y must be the same.

Tables for determining the critical values of the Spearman correlation coefficient (Table 20, Appendix 6) are calculated from the number of characteristics equal to n = 5 to n = 40, and with a larger number of compared variables, the table for the Pearson correlation coefficient should be used (Table 19, Appendix 6). Finding critical values is carried out at k = n.

Correlation analysis is a method that allows one to detect dependencies between a certain number of random variables. The purpose of correlation analysis is to identify an assessment of the strength of connections between such random variables or signs characterizing certain real processes.

Today we propose to consider how Spearman correlation analysis is used to visually display the forms of communication in practical trading.

Spearman correlation or basis of correlation analysis

In order to understand what correlation analysis is, you first need to understand the concept of correlation.

At the same time, if the price starts to move in the direction you need, you need to unlock your positions in time.

For this strategy, which is based on correlation analysis, trading instruments with a high degree of correlation are best suited (EUR/USD and GBP/USD, EUR/AUD and EUR/NZD, AUD/USD and NZD/USD, CFD contracts and the like) .

Video: Application of Spearman correlation in the Forex market

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used in cases where:
- variables have ranking scale measurements;
- the data distribution is too different from normal or not known at all;
- samples have a small volume (N< 30).

The interpretation of the Spearman rank correlation coefficient is no different from the Pearson coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically justify their areas of application, let’s compare their formulas.

Pearson correlation coefficient:

Spearman correlation coefficient:

As you can see, the formulas differ significantly. Let's compare the formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, but the Spearman formula does not. Thus, to obtain an adequate result using the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are parameters normal distribution ). This is not relevant for the Spearman formula.

An element of the Pearson formula is the standardization of each series in z-scale.

As you can see, the conversion of variables to the Z-scale is present in the formula for the Pearson correlation coefficient. Accordingly, for the Pearson coefficient, the scale of the data does not matter at all: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values is, they will all be converted to standard z-values that are the same in scale.

Such normalization does not occur in the Spearman coefficient, therefore

A MANDATORY CONDITION FOR USING THE SPEARMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF THE TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking results in the values of these series acquiring the same minimum = 1 (minimum rank) and a maximum equal to the number of values (maximum, last rank = N, i.e., the maximum number of cases in the sample).

In what cases can you do without ranking?

These are cases when the data is initially ranking scale. For example, Rokeach’s test of value orientations.

Also, these are cases when the number of value options is small and the sample contains a fixed minimum and maximum. For example, in a semantic differential, minimum = 1, maximum = 7.

Example of calculating Spearman's rank correlation coefficient

Rokeach’s test of value orientations was carried out on two samples X and Y. Objective: to find out how close the hierarchies of values of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked by table of critical values. According to the table, with N=18, the obtained value is significant at the p level<=0,005

Spearman and Kendal rank correlation coefficients

For variables belonging to an ordinal scale or for variables not subject to a normal distribution, as well as for variables belonging to an interval scale, the Spearman rank correlation is calculated instead of the Pearson coefficient. To do this, individual variable values are assigned ranks, which are subsequently processed using appropriate formulas. To detect rank correlation, clear the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate rank correlation, it is necessary to have two rows of values,

which can be ranked. Such series of values could be:

1) two signs measured in the same group subjects;

2) two individual hierarchies of characteristics, identified in two subjects using the same

set of features;

3) two group hierarchies of characteristics,

4) individual and group hierarchy of features.

First, the indicators are ranked separately for each of the characteristics.

As a rule, a lower rank is assigned to a lower attribute value.

In the first case (two characteristics), individual values are ranked according to the first

characteristic obtained by different subjects, and then individual values for the second

sign.

If two characteristics are positively related, then subjects with low ranks

one of them will have low ranks in the other, and subjects who have high ranks in

one of the characteristics will also have high ranks for the other characteristic. To calculate rs

differences need to be determined (d) between the ranks obtained by a given subject in both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

The smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no correspondence. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

high ranks on another basis will correspond, and vice versa. The greater the discrepancy

between the ranks of subjects on two variables, the closer rs is to -1.

In the second case (two individual profiles), individual ones are ranked

values obtained by each of the 2 subjects according to a certain (the same for them

both) set of features. The first rank will receive the attribute with the most low value; second rank –

a sign with a higher value, etc. Obviously, all characteristics must be measured in

the same units, otherwise ranking is impossible. For example, it is impossible

rank the indicators on the Cattell Personality Inventory (16PF), if they are expressed in

“raw” points, since the ranges of values are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which factor will take first place in

expression until we bring all the values to a single scale (most often this is the wall scale).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if one subject’s factor E (dominance) has the lowest rank, then

another test subject, it should have a low rank if one test subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, etc.

In the third case (two group profiles), the group average values are ranked,

obtained in 2 groups of subjects according to a specific set, identical for both groups

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In case 4 (individual and group profiles), they are ranked separately

individual values of the subject and group average values for the same set

signs that are obtained, as a rule, by excluding this individual subject - he

does not participate in the average group profile with which his individual profile will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the resulting correlation coefficient is determined

by the number of ranked values N. In the first case, this quantity will coincide with

sample size n. In the second case, the number of observations will be the number of features,

making up the hierarchy. In the third and fourth cases, N is also the number of compared

characteristics, and not the number of subjects in groups. Detailed explanations are given in the examples. If

the absolute value of rs reaches or exceeds a critical value, correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first applies to case 1, the second to the other three

First version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

Second version of hypotheses

H0: The correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. For each variable, at least 5 observations must be presented. Upper

the sampling boundary is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs for a large number of identical

ranks for one or both compared variables gives rough values. Ideally

both correlated series must represent two sequences of divergent

values. If this condition is not met, an amendment must be made to

same ranks.

Spearman's rank correlation coefficient is calculated using the formula:

If both compared rank series contain groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to make corrections for the same

Ta and TV ranks:

Ta = Σ (a3 – a)/12,

Тв = Σ (в3 – в)/12,

Where A - the volume of each group of identical ranks in rank row A, in – volume of each

groups of identical ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Point-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous scale. The point biserial correlation coefficient rpb is calculated using the formula:

Here x 1 is the average value over X objects with a value of “one” over Y;

x 0 – average value over X objects with a value of “zero” over Y;

s x – standard deviation of all values along X;

n 1 – number of objects “one” in Y, n 0 – number of objects “zero” in Y;

n = n 1 + n 0 – sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x– overall average value for the variable X.

Point biserial correlation coefficient rpb varies from –1 to +1. Its value is zero if variables with one Y have an average Y, equal to the average of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student’s t-test. Empirical significance

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ tα(df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient differs significantly from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > tα(n– 2). Reliability of the relationship calculated using the point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Point biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the point biserial r. This stat. shows the relationship between two variables, one of which is supposedly continuous and normally distributed, and the other is discrete in the strict sense of the word. The point biserial correlation coefficient is denoted by r pbis Since in r pbis dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is determined arbitrarily. Therefore, for all practical purposes. goals r pbis considered in the range from 0.00 to +1.00.

There is also the case where two variables are assumed to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Basic (exact) formulas and procedures for calculation r tet quite complex. Therefore, with practical This method uses approximations r tet,obtained on the basis of abbreviated procedures and tables.

/on-line/dictionary/dictionary.php?term=511

POINT BISERIAL COEFFICIENT is the correlation coefficient between two variables, one measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testing as an indicator of the quality of a test task - reliability and consistency with the overall test score.

To correlate variables measured in dichotomous and interval scale use point-biserial correlation coefficient.
The point-biserial correlation coefficient is a method of correlation analysis of the relationship of variables, one of which is measured on a scale of names and takes only 2 values (for example, men/women, correct answer/false answer, feature present/not present), and the second on a scale ratios or interval scale. Formula for calculating the point-biserial correlation coefficient:

Where:
m1 and m0 are the average values of X with a value of 1 or 0 in Y.
σx – standard deviation of all values by X
n1,n0 – number of X values from 1 or 0 to Y.
n – total value pairs

More often this type The correlation coefficient is used to calculate the relationship between test items and the total scale. This is one type of validity check.

39. Rank-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

Rank biserial correlation coefficient, used in cases where one of the variables ( X) is presented in an ordinal scale, and the other ( Y) – dichotomous, calculated by the formula

Here is the average rank of objects having one in Y; – average rank of objects with zero to Y, n– sample size.

Examination significance hypotheses The rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using the Student’s test with replacement in the formulas rpb on rrb.

In cases where one variable is measured on a dichotomous scale (variable X), and the other in the rank scale (variable Y), the rank-biserial correlation coefficient is used. We remember that the variable X, measured on a dichotomous scale, takes only two values (codes) 0 and 1. We especially emphasize: despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for the interpretation of the results. This is another exception to the general rule.

This coefficient is calculated using the formula:

where ` X 1– average rank for those elements of the variable Y, which corresponds to code (sign) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (sign) 0 in the variable X\

N – total number of elements in the variable X.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X – on a dichotomous scale; other Y– on a ranking scale.

2. Number of varying characteristics in the compared variables X And Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, you should use formula (11.9) and the table of critical values for the Student test k = n – 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is represented in dichotomous scale, and the other in rank (ordinal), require application rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

Where:
n – number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 on the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

For correlation in general (and linear correlation in particular), see question No. 36 With. 56 (64) 063.JPG

Mr. PEARSON'S COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricdifferent variables measured on the same sample. There are many situations in which its use is appropriate. Does intelligence affect academic performance in senior university years? Is the size of an employee’s salary related to his friendliness towards colleagues? Does a student’s mood affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest for each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of initial data for measuring two indicators of intelligence (verbal and nonverbal) for 20 8th grade students.

The relationship between these variables can be depicted using a scatterplot (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mostly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence using the data from example 6.1. The position of each /-point (subject with number /) on the scatter diagram relative to the other points (Fig. 6.3) can be specified by the values and signs of deviations of the corresponding variable values from their average values: (xj - MJ And (mind at ). If the signs of these deviations coincide, then this indicates a positive relationship (larger values for X large values correspond to at or lower values X smaller values correspond to y).

For subject No. 1, deviation from the average X and by at positive, and for subject No. 3 both deviations are negative. Consequently, the data from both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the characteristics. Thus, for subject No. 4, the deviation from the average X is negative, by y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x,- M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then a reverse (negative) relationship. Accordingly, if Xwy y are generally related in direct proportion, then most of the products of deviations will be positive, and if they are related by an inverse relationship, then most of the products will be negative. Therefore, a general indicator for the strength and direction of the relationship can be the sum of all products of deviations for a given sample:

With a directly proportional relationship between variables, this value is large and positive - for most subjects, the deviations coincide in sign (large values of one variable correspond to large values of another variable and vice versa). If X And at have feedback, then for most subjects, larger values of one variable will correspond to smaller values of another variable, i.e., the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic connection between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

To ensure that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of interconnection not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, divide the sum of the products of deviations not by N, and on TV - 1. This results in a measure of connection, widely used in physics and technical sciences, which is called covariance (Covahance):

psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of a sign, but mutual arrangement subjects in the group. In addition, covariance is very sensitive to the scale of the scale (variance) on which the traits are measured. To make the measure of connection independent of the units of measurement of both characteristics, it is enough to divide the covariance into the corresponding standard deviations. Thus it was obtained for-Mule of the K. Pearson correlation coefficient:

or, after substituting the expressions for o x and

If the values of both variables were converted to r-values using the formula

then the formula for the r-Pearson correlation coefficient looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical linear relationship of a non-causal nature between two quantitative variables X And at. Measured using the "K.L coefficient." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

Where s xy- covariance between variables X And at;

s x , s y- standard deviations for variables X And at;

x i , y i- variable values X And at for object with number i;

x, y- arithmetic averages for variables X And at.

Pearson coefficient r can take values from the interval [-1; +1]. Meaning r = 0 means there is no linear relationship between variables X And at(but does not exclude a nonlinear statistical relationship). Positive coefficient values ( r> 0) indicate a direct linear connection; the closer its value is to +1, the stronger the relationship is the statistical line. Negative coefficient values ( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее Feedback. Values r= ±1 means the presence of a complete linear connection, direct or reverse. In the case of complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the strength of connection in a linear pairwise regression model.

41. Correlation matrix and correlation graph.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

Correlation matrix. Often, correlation analysis includes the study of relationships between not two, but many variables measured on a quantitative scale in one sample. In this case, correlations are calculated for each pair of this set of variables. The calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(Correlation Matrix) is the result of calculating correlations of one type for each pair from the set R variables measured on a quantitative scale in one sample.

EXAMPLE

Suppose we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 Human. Below is a table of source data and a correlation matrix.

AND
similar data:

Correlation matrix:

It is easy to notice that the correlation matrix is square, symmetrical with respect to the main diagonal (takkak,y = /) y), with units on the main diagonal (since G And = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She symmetrical relative to the main diagonal, since the correlation X With at equal to correlation at With X. Units are located on its main diagonal, since the correlation of the feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are located above or below the main diagonal.

Number of correlation coefficients, Pfeatures to be analyzed when studying relationships are determined by the formula: P(P- 1)/2. In the above example, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is identifying the structure of relationships between many features. In this case, visual analysis is possible correlation galaxies- graphic image structures statisticallymeaningful connections, if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factor or cluster analysis (see section “Multivariate methods...”). Using factor or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method for dependent samples is used (p. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (p. 147-148).

Comparison methods correlations in diagonals correlation matrix (to assess stationarity random process) and comparisons several correlation matrices obtained for different samples (for their homogeneity) are labor-intensive and beyond the scope of this book. You can get acquainted with these methods from the book by G.V. Sukhodolsky 1.

Problem statistical significance correlations. The problem is that the procedure statistical testing hypotheses assumes one-multiple test carried out on one sample. If the same method is applied repeatedly, even if in relation to different variables, the probability of obtaining a result purely by chance increases. In general, if we repeat the same hypothesis testing method once in relation to different variables or samples, then with the established value a we are guaranteed to receive confirmation of the hypothesis in ahk number of cases.

Suppose a correlation matrix is analyzed for 15 variables, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test hypotheses, the level a = 0.05 is set. By checking the hypothesis 105 times, we will receive confirmation of it five times (!), regardless of whether the connection actually exists. Knowing this and having, say, 15 “statistically significant” correlation coefficients, can we tell which ones were obtained by chance and which ones reflect a real relationship?

Strictly speaking, for acceptance statistical solution it is necessary to reduce the level a as many times as the number of hypotheses being tested. But this is hardly advisable, since the probability of ignoring a really existing connection (making a Type II error) increases in an unpredictable way.

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one truly convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factor, cluster or multiple regression analysis) to identify and subsequently interpret groups of statistically significantly related variables.

Missing values problem. If there are missing values in the data, then two options are possible for calculating the correlation matrix: a) row-by-row removal of values (Excludecaseslistwise); b) pairwise deletion of values (Excludecasespairwise). At line by line deletion observations with missing values, the entire row for an object (subject) that has at least one missing value for one of the variables is deleted. This method leads to a “correct” correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values are distributed randomly across the variables, then this method can lead to the fact that there will not be a single object left in the data set under consideration (each row will contain at least one missing value). To avoid this situation, use another method called pairwise removal. This method only considers gaps in each selected column-variable pair and ignores gaps in other variables. The correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are distributed quite randomly, this method does not lead to serious errors. However, sometimes this is not the case. For example, a systematic bias (shift) in the assessment may “hidden” a systematic arrangement of omissions, which is the reason for the difference in correlation coefficients constructed for different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with pairwise removal of gaps occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that the “correct” correlation matrix is used with a certain level of consistency and “compliance” of various coefficients. Using a matrix with “bad” (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if the pairwise method of excluding missing data is used, it is necessary to check whether there are systematic patterns in the distribution of missing data.

If pairwise deletion of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated using the row-by-row method of deleting missing data. If a significant difference is observed, then there is reason to assume that there is a shift in the estimates. For example, if the average (or standard deviation) of the values of a variable A, which was used in calculating its correlation with the variable IN, much less than average (or standard deviation) the same variable values A, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-Bus) based on different subsets of data. There will be a bias in the correlations caused by the non-random placement of gaps in the variable values.

Analysis of correlation galaxies. After solving the problem of statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation galaxy or galaxy. Correlation galaxy - This is a figure consisting of vertices and lines connecting them. The vertices correspond to the characteristics and are usually designated by numbers - variable numbers. The lines correspond to statistically significant connections and graphically express the sign and sometimes the j-level of significance of the connection.

The correlation galaxy can reflect All statistically meaningful connections correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADE

Preparation for the state (final) certification of graduates: formation of the Unified State Exam database (general list of Unified State Exam participants of all categories, indicating subjects) - taking into account reserve days in case of the same subjects;

Work plan (27)

Solution

2. Activities of the educational institution to improve the content and assess the quality in the subjects of science and mathematics education Municipal educational institution secondary school No. 4, Litvinovskaya, Chapaevskaya,

Spearman rank correlation(rank correlation). Spearman's rank correlation is the simplest way to determine the degree of relationship between factors. The name of the method indicates that the relationship is determined between ranks, that is, series of obtained quantitative values, ranked in descending or ascending order. It must be borne in mind that, firstly, rank correlation is not recommended if the connection between pairs is less than four and more than twenty; secondly, rank correlation makes it possible to determine the relationship in another case, if the values are semi-quantitative in nature, that is, they do not have a numerical expression and reflect a clear order of occurrence of these values; thirdly, it is advisable to use rank correlation in cases where it is sufficient to obtain approximate data. An example of calculating the rank correlation coefficient to determine the question: measure the questionnaire X and Y are similar personal qualities subjects. Using two questionnaires (X and Y), which require alternative answers “yes” or “no,” the primary results were obtained - the answers of 15 subjects (N = 10). The results were presented as the sum of affirmative answers separately for questionnaire X and for questionnaire B. These results are summarized in table. 5.19.

Table 5.19. Tabulation of primary results to calculate the Spearman rank correlation coefficient (p) *

Analysis of the summary correlation matrix. Method of correlation galaxies.

Example. In table Figure 6.18 shows interpretations of eleven variables that are tested using the Wechsler method. Data were obtained from a homogeneous sample aged 18 to 25 years (n = 800).

Before stratification, it is advisable to rank the correlation matrix. To do this, the average values of the correlation coefficients of each variable with all the others are calculated in the original matrix.

Then according to the table. 5.20 determine the permissible levels of stratification of the correlation matrix for given confidence probability 0.95 and n - quantities

Table 6.20. Ascending correlation matrix

Variables	1	2	3	4	would	0	7	8	0	10	11	M(rij)	Rank
1	1	0,637	0,488	0,623	0,282	0,647	0,371	0,485	0,371	0,365	0,336	0,454	1
2		1	0,810	0,557	0,291	0,508	0,173	0,486	0,371	0,273	0,273	0,363	4
3			1	0,346	0,291	0,406	0,360	0,818	0,346	0,291	0,282	0,336	7
4				1	0,273	0,572	0,318	0,442	0,310	0,318	0,291	0,414	3
5					1	0,354	0,254	0,216	0,236	0,207	0,149	0,264	11
6						1	0,365	0,405	0,336	0,345	0,282	0,430	2
7							1	0,310	0,388	0,264	0,266	0,310	9
8								1	0,897	0,363	0,388	0,363	5
9									1	0,388	0,430	0,846	6
10										1	0,336	0,310	8
11											1	0,300	10

Designations: 1 - general awareness; 2 - conceptuality; 3 - attentiveness; 4 - vdataness K of generalization; b - direct memorization (in numbers) 6 - level of mastery native language; 7 - speed of mastering sensorimotor skills (symbol coding) 8 - observation; 9 - combinatorial abilities (for analysis and synthesis) 10 - ability to organize parts into a meaningful whole; 11 - ability for heuristic synthesis; M (rij) - the average value of the correlation coefficients of the variable with other observation variables (in our case n = 800): r (0) - the value of the zero "Dissecting" plane - the minimum significant absolute value of the correlation coefficient (n - 120, r (0) = 0.236; n = 40, r (0) = 0.407) | Δr | - permissible delamination step (n = 40, | Δr | = 0.558) in - permissible quantity stratification levels (n = 40, s = 1; n = 120, s = 2); r (1), r (2), ..., r (9) - absolute value of the cutting plane (n = 40, r (1) = 0.965).

For n = 800, we find the value of gtype and the boundaries of gi, after which we stratify the correlation matrix, highlighting correlation pleiades within the layers, or separate parts of the correlation matrix, drawing associations of correlation pleiades for overlying layers (Fig. 5.5).

A meaningful analysis of the resulting galaxies goes beyond mathematical statistics. It should be noted that there are two formal indicators that help with the meaningful interpretation of the Pleiades. One significant indicator is the degree of a vertex, that is, the number of edges adjacent to a vertex. The variable with the largest number of edges is the “core” of the galaxy and can be considered as an indicator of the remaining variables of this galaxy. Another significant indicator is communication density. A variable may have fewer connections in one galaxy, but closer, and more connections in another galaxy, but less close.

Predictions and estimates. The equation y = b1x + b0 is called general equation straight. It indicates that pairs of points (x, y), which

Rice. 5.5. Correlation galaxies obtained by matrix layering

lie on a certain line, connected in such a way that for any value x, the value b in paired with it can be found by multiplying x by a certain number b1 and adding secondly, the number b0 to this product.

The regression coefficient allows you to determine the degree of change in the investigative factor when the causal factor changes by one unit. Absolute values characterize the relationship between variable factors by their absolute values. The regression coefficient is calculated using the formula:

Design and analysis of experiments. Design and analysis of experiments is the third important branch of statistical methods developed to find and test causal relationships between variables.

To study multifactorial dependencies in Lately methods of mathematical experimental planning are increasingly used.

The ability to simultaneously vary all factors allows you to: a) reduce the number of experiments;

b) reduce experimental error to a minimum;

c) simplify the processing of received data;

d) ensure clarity and ease of comparison of results.

Each factor can acquire a certain corresponding number of different values, which are called levels and denoted -1, 0 and 1. A fixed set of factor levels determines the conditions of one of the possible experiments.

The totality of all possible combinations is calculated using the formula:

A complete factorial experiment is an experiment in which all possible combinations of factor levels are implemented. Full factorial experiments can have the property of orthogonality. With orthogonal planning, the factors in the experiment are uncorrelated; the regression coefficients that are ultimately calculated are determined independently of each other.

An important advantage of the method of mathematical experimental planning is its versatility and suitability in many areas of research.

Let's consider an example of comparing the influence of some factors on the formation of the level of mental stress in color TV controllers.

The experiment is based on an orthogonal Design 2 three (three factors change at two levels).

The experiment was carried out with a complete part 2 + 3 with three repetitions.

Orthogonal planning is based on the construction of a regression equation. For three factors it looks like this:

Processing of the results in this example includes:

a) construction of an orthogonal plan 2 +3 table for calculation;

b) calculation of regression coefficients;

c) checking their significance;

d) interpretation of the obtained data.

For the regression coefficients of the mentioned equation, it was necessary to put N = 2 3 = 8 options in order to be able to assess the significance of the coefficients, where the number of repetitions K was 3.

The matrix for planning the experiment looked like this:

In cases where measurements of the characteristics under study are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between two random variables is carried out using ranking coefficients correlations. Consider the Spearman rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The smallest value is assigned a rank of 1. For example, if n=7, then highest value will receive rank number 7, except as provided in the second rule.

2. If several values are equal, then they are assigned a rank that is the average of the ranks they would receive if they were not equal. As an example, consider an ascending-ordered sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values 22 and 23 appear once each, so their ranks are respectively R22=1, and R23=2 . The value 25 appears 3 times. If these values were not repeated, then their ranks would be 3, 4, 5. Therefore, their R25 rank is equal to the arithmetic mean of 3, 4 and 5: . The values 28 and 30 are not repeated, so their ranks are respectively R28=6 and R30=7. Finally we have the following correspondence:

3. total amount ranks must coincide with the calculated one, which is determined by the formula:

where n is the total number of ranked values.

Discrepancy between real and estimated amounts ranks will indicate an error made when calculating ranks or summing them up. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows one to determine the strength and direction of the relationship between two traits or two hierarchies of traits. The use of the rank correlation coefficient has a number of limitations:

a) The assumed correlation dependence must be monotonic.
b) The size of each sample must be greater than or equal to 5. To determine upper limit samples use tables of critical values (Appendix Table 3). The maximum value of n in the table is 40.
c) During the analysis, it is likely that a large number of identical ranks may arise. In this case, an amendment must be made. The most favorable case is when both samples under study represent two sequences of divergent values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

- two characteristics measured in the same group of subjects;
- two individual hierarchies of traits identified in two subjects using the same set of traits;
- two group hierarchies of characteristics;
- individual and group hierarchies of characteristics.

We begin the calculation by ranking the studied indicators separately for each of the characteristics.

Let us analyze a case with two signs measured in the same group of subjects. First, the individual values obtained by different subjects are ranked according to the first characteristic, and then the individual values are ranked according to the second characteristic. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to greater ranks of another indicator, then the two characteristics are positively related. If higher ranks of one indicator correspond to lower ranks of another indicator, then the two characteristics are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to “+1”. If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of subjects on two variables, the closer to “-1” the value of the rs coefficient will be. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Let us consider the case with two individual hierarchies of traits identified in two subjects using the same set of traits. In this situation, the individual values obtained by each of the two subjects are ranked according to a certain set of characteristics. The feature with the lowest value must be assigned the first rank; the characteristic with a higher value is the second rank, etc. Should be paid Special attention to ensure that all characteristics are measured in the same units. For example, it is impossible to rank indicators if they are expressed in different “price” points, since it is impossible to determine which of the factors will take first place in terms of severity until all values are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in another, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of characteristics, the average group values obtained in two groups of subjects are ranked according to the same set of characteristics for the studied groups. Next, we follow the algorithm given in previous cases.

Let us analyze a case with an individual and group hierarchy of characteristics. They begin by ranking separately the individual values of the subject and the average group values according to the same set of characteristics that were obtained, excluding the subject who does not participate in the average group hierarchy, since his individual hierarchy will be compared with it. Rank correlation allows us to assess the degree of consistency of the individual and group hierarchy of traits.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two characteristics, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, significance is determined by the number of characteristics being studied, and not by the number of groups. Thus, the significance of rs in all cases is determined by the number of ranked values n.

When checking the statistical significance of rs, they use tables of critical values of the rank correlation coefficient compiled for various numbers of ranked values and different levels significance. If the absolute value of rs reaches or exceeds a critical value, then the correlation is reliable.

When considering the first option (a case with two signs measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then it is necessary to put forward another pair of hypotheses:

H0: The correlation between hierarchies x and y is not different from zero.

H1: The correlation between hierarchies x and y is significantly different from zero.

The sequence of actions when calculating the Spearman rank correlation coefficient rs is as follows.

- Determine which two features or two hierarchies of features will participate in the comparison as variables x and y.
- Rank the values of the variable x, assigning a rank of 1 lowest value, in accordance with the ranking rules. Place the ranks in the first column of the table in order of test subjects or characteristics.
- Rank the values of the variable y. Place the ranks in the second column of the table in order of test subjects or characteristics.
- Calculate the differences d between the ranks x and y for each row of the table. Place the results in the next column of the table.
- Calculate the squared differences (d2). Place the resulting values in the fourth column of the table.
- Calculate the sum of squared differences? d2.
- If identical ranks occur, calculate the corrections:

where tx is the volume of each group of identical ranks in sample x;

ty is the volume of each group of identical ranks in sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. If there are no identical ranks, calculate the rank correlation coefficient rs using the formula:

If there are identical ranks, calculate the rank correlation coefficient rs using the formula:

where?d2 is the sum of squared differences between ranks;

Tx and Ty - corrections for equal ranks;

n is the number of subjects or features participating in the ranking.

Determine the critical values of rs from Appendix Table 3 for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.