Detecting Outliers - Univariate
From PsychWiki - A Collaborative Psychology Wiki
Revision as of 20:53, 7 September 2009 by Doug
- What are univariate outliers? Univariate outliers are outliers that occur within a single variable; and are to be contrasted with bivariate and multivariate outliers which are outliers that occur within the joint combination of two (bivariate) or more (multivariate) variables. See below for a concrete example of a univariate outlier.
- How do I detect outliers?
- One way is to visually inspect your data with a FREQUENCY DISTRIBUTION. - Imagine a study that asks the American public how many sexual partners they have over their lifetime. See the frequency distribution below for the findings from this hypothetical study. The people who said they have 100+ sexual partners in their lifetime appear disconnected from the rest of the data.
- One statistical benchmark is to use a BOXPLOT to determine "mild" and "extreme" outliers. Mild outliers are any score more than 1.5*IQR from the rest of the scores, and are indicated by open dots. Extreme outliers are any score more than 3*IQR from the rest of the scores. IQR stands for the Interquartile range, which is the middle 50% of the scores. In other words, an outlier is determined by comparison to the bulk of the scores in the middle. - The output below is from SPSS for a variable called "system1". A boxplot is a graphical display of the data that shows: (1) median, which is the middle black line, (2) middle 50% of scores, which is the shaded region, (3) top and bottom 25% of scores, which are the lines extending out of the shaded region, (4) the smallest and largest (non-outlier) scores, which are the horizontal lines at the top/bottom of the boxplot, and (5) outliers. For this variable, there is 1 mild outlier (subject #52) and 1 extreme outlier (subject #18).
- Some things to keep in mind when looking for outliers...
- Outliers can found in many (many!) of the variables in ever study. If you are going to check for outliers, then you have to check for outliers in all your variables (e.g., could be 100+ in some surveys), and also check for outliers in the bivariate and multivariate relationships between your variables (e.g., 1000+ in some surveys). Given the large number of outlier analyses you have to conduct in every study, you will invariably find outliers.
- You are less likely to find outliers after you create composites. It is common practice to use multiple questions to measure constructs because it increases the power of your statistical analysis. You typically create a “composite” score (average of all the questions) when analyzing your data. - In a study about happiness, you may use an established happiness scale, or create your own happiness questions that measure all the facets of the happiness construct. When analyzing your data, you average together all the happiness questions into 1 happiness composite measure. While there may be some outliers in each individual question, averaged the items together reduces the probability of outliers due to the increased amount of data composited into the variable.
◄ Back to Analyzing Data page