Business leaders can answer a wide variety of questions by analyzing listening event results in the Perceptyx Platform. A few common examples include the following:
Are females more engaged than males?
Did Engagement improve since last quarter?
Are Diversity and Inclusion questions more favorable for minorities hired this year compared to minorities hired more than one year ago?
Tests of statistical significance help to determine which results are worthy of closer attention, and which results do not warrant additional focus.
This article walks through:
- What is Statistical Significance?
- Type I and Type II Errors
- What Affects Statistical Significance?
- Practical Significance
- How Statistical Significance is Calculated in Perceptyx Reporting
- Consultative Questions
What is Statistical Significance?
A result (usually a comparison between two scores of interest) has statistical significance when it is very unlikely to have occurred given the null hypothesis. In inferential statistics, the null hypothesis is the default position that there is no underlying effect or difference. In the example Are females more engaged than males, the null hypothesis (often denoted as H0) states that the mean female engagement score is the same as the mean male engagement score. We are far more interested in the alternative hypothesis (often denoted as H1), which states that the mean female engagement score is not equal to the mean male engagement score.
Example:
H0: μFemale = μMale
H1: μFemale ≠ μMale (or μFemale > µMale ; µFemale < µMale)
where
H0 = the null hypothesis
H1 = the alternative hypothesis
μFemale = the mean female engagement score = 4.8
μMale = the mean male engagement score = 4.0
p = .03
p-value
The p-value is the calculated probability of finding the observed or more extreme results when, in actuality, there is no underlying effect or difference (i.e., given the null hypothesis is true). Common misconceptions are the p-value is either the probability that the null hypothesis is true or that the alternative hypothesis is false. If the p-value of a test statistic (e.g., z, t, F, r, etc.) is less than a preselected critical value, typically set to p < .05, then it is considered to be statistically significant.
Again, using the example above, because the observed p-value is less than .05, we would reject the null hypothesis. These data provide reasonable evidence to support the alternative hypothesis. The calculated p-value of .03 indicates that if there is no underlying difference between females and males, we would observe data as extreme or more extreme 3% of the time. When presenting p-values, an asterisk rating system is conventionally used to denote the level of confidence:
*p < .05 – less than 5 in 100 chance of observing data as or more extreme
**p < .01 – less than 1 in 100 chance of observing data as or more extreme
***p < .001 – less than 1 in 1,000 chance of observing data as or more extreme
Type I and Type II Errors
When you perform a null hypothesis significance test, there are four possible outcomes, as depicted in the following table:
Each of the errors occur with a particular probability. Alpha (a) is the probability of a Type I error (i.e., rejecting the null hypothesis when the null hypothesis is true), and is the p-value below which you will reject the null hypothesis (e.g., a = .05). Beta (b) is the probability of a Type II error (i.e., accepting the null hypothesis when the alternative hypothesis is true). Whereas alpha is conventionally set to .05, beta is conventionally set to .20. Statistical power (1-b) is the probability of rejecting a false null hypothesis; that is, the probability that a study will detect an effect when there is an effect to be detected. When statistical power is high, the probability of making a Type II error is low.
What Affects Statistical Significance?
The p-value provided by the significance test is a function of three factors:
The larger the effect size, the more likely a test will find a significant effect if one exists. For example, a 20-point underlying difference between female and male favorability is much easier to detect than a 1-point difference.
Generally, the larger the sample size, the more likely a test will find a significant effect if one exists. This increased sensitivity allows for the detection of smaller and smaller effects. For example, in testing whether a coin is fair, flipping 5 heads in a row has a 1 in 32 chance (3.125%) whereas flipping 10 heads in a row has a 1 in 1024 chance (0.098%).
The smaller the error, the more likely a test will find a significant effect if one exists. There are two types of error that can impact a p-value: random error and systematic error. As the name implies, random error is essentially noise above and below the “true” value. An example of random error may occur when measuring the height of a basketball team. Each player may randomly differ (up or down) from the true height depending on how the measuring tape was held, the researcher who took the measurement, etc. Fortunately, the larger the sample size, the less of an impact the error will have on the findings. Systematic error is applied uniformly across a sample, and can mask, inflate, or deflate the true underlying effect. For example, a scale that measures everyone 5 pounds heavier would overestimate each person’s weight systematically, and increasing the sample size will have no effect on this error.
Practical Significance
When a difference is statistically significant, it does not necessarily mean that it is big, important, or helpful in decision-making. It simply means you can be more confident that there is a difference. While statistical significance informs how likely it would be to observe the effect, practical significance refers to the strength or magnitude of an effect. If enough cases were observed, even the smallest of differences will be significant at p < .05 or even p < .001. Thus, there is a huge difference between statistical significance and practical significance.
Let’s say, for example, that you evaluate the effect of standing and cheering on heart rate on all the attendees at a concert (n = 30,000). The mean resting heart rate when seated is 70 beats per minute and the mean after standing and cheering is 70.2 beats per minute. Although you find that the difference is statistically significant (because of an extremely large sample size), the difference isn’t practically significant – you are not going to see warning labels on chairs for the risk of standing, on increasing heart rate.
Some small effects can be incredibly meaningful, however. A study evaluating the impact of engagement on nursing turnover finds that engaged nurses have a turnover rate of 15% and disengaged nurses have a turnover rate of 16%. A one percentage-point difference may seem trivial, but for a large health system this equates to a potential savings $5M annually. In summary, no statistical test provides evidence whether the effect is large enough to care about; one must apply their subject matter expertise to make a determination of practical significance.
How Statistical Significance is Calculated in Perceptyx Reporting
The Perceptyx Platform uses a Two-Proportion Z-Test to compare favorability between data groups.
When comparing two proportions, such as favorability between data groups, a two proportions z-test has the advantage over a two-sample t-test because…
Two proportion z-tests allow us to directly compare the proportions in question rather than comparing means in a t-test.
Two proportion z-tests are useful for comparing larger sample sizes (greater than 20).
For difference in mean scores, the Perceptyx reporting system uses an independent samples t-test.
Consultative Questions
Is there one number I can give to managers so they can use it to judge if differences are statistically significant?
No. The statistical significance of the difference in two scores depends on 3 factors:
The difference between the two scores that are being compared. For any question, the greater the numerical difference, the more likely it is to be considered significant.
The size of the groups for which the scores are being compared. The larger the group, the less change it takes to be considered significant. A small group would require a greater change to be considered significant.
The variance between the scores (in other words, how close together or how spread out the scores are in each of the two groups being compared). Another way of thinking about variance is “how far apart are the scores in each group from one another.” If the scores are very similar, they have low variance. If the scores are wide apart (some people with low scores, some with midrange scores and some with high scores), the variance will be larger than if most people in the group all have low scores, or all have high scores. When variance is high, it takes a bigger change before the difference will be significant (since there was more “noise” in the data). As the deviation gets smaller (less variance), it takes less of a change before it’s considered significant.
For these reasons, the difference needed to be statistically significant will vary from question comparison to question comparison. There is no one set value you can use to say if a change is significant.
Where does statistical significance show up in Perceptyx reporting?
There are a variety of data group comparisons that show up in Perceptyx reporting. These comparisons enable our clients to understand if observed differences in scores are meaningful and worthy of action focus.
The comparisons that indicate if observed differences are statistically meaningful are:
Trend (significance of change)
Internal Comparison (significance of difference)
These difference scores are color-coded in the platform and in toolkits (reports) as follows:
Statistical Significance in the Trend Report
Statistical Significance in the Demographic Crosstab Report:
Other color-coded comparisons are available in the platform that highlight differences, without a reference to statistical significance:
Relative: Comparing scores to each other. Uses a percentile rank and colors cells accordingly.
Absolute: Comparing scores to an absolute threshold. Colors cells based on their absolute score.
Outliers: Comparing scores to a weighted average. Uses predetermined thresholds and highlights scores that are above or below the score
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article