What Is an Outlier
In statistics, an outlier is an observation that lies markedly distant from the bulk of the data. While precise definitions vary by context, common criteria include values more than three standard deviations from the mean, or values exceeding 1.5 times the interquartile range (IQR) beyond the box plot whiskers.
Outliers fall into two categories: "erroneous outliers" caused by measurement errors or data entry mistakes, and "genuine outliers" representing real extreme values. The former should be removed, but the latter may contain critical information. Jeff Bezos's wealth appearing as an outlier in income distributions is not an error; it is reality.
How Outliers Distort the Mean
Consider ten people with annual incomes of 4.0, 4.2, 4.5, 4.6, 4.8, 5.0, 5.2, 5.5, 5.8, and 6.0 million yen. The mean is 4.96 million and the median is 4.9 million - nearly identical. Add one person earning 500 million yen, and the mean jumps to 49.97 million while the median barely shifts to 5.0 million.
This example starkly illustrates the vulnerability of means to outliers. The statistic that "average Japanese household savings are 19.01 million yen" feels disconnected from reality because a small number of ultra-wealthy households pull the mean upward. The median (10.61 million yen) better represents the "typical" Japanese household.
Outlier Impact on Rankings
Percentile-based rankings are inherently resistant to outlier influence. Because percentiles are rank-order statistics, no matter how extreme the maximum value becomes, other individuals' ranks remain unchanged. This robustness is one reason MyRank employs percentiles as its primary metric.
However, outliers affect the interpretation of percentiles. The gap between the 99th and 99.9th percentile in income is incomparably larger than the gap between the 50th and 51st percentile. Percentiles compress ranks uniformly, which tends to underrepresent the substantive differences in the distribution's tails.
Detecting and Handling Outliers
Multiple outlier detection methods exist. The Z-score method (flagging values where |z| exceeds 3), the IQR method (flagging values below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR), Grubbs' test, and clustering approaches such as DBSCAN are among the most widely used techniques.
Handling detected outliers requires caution. Mechanical removal risks discarding important information. MyRank's approach is to retain outliers while using robust statistics (medians, percentiles) that naturally attenuate their influence on results.
What Outliers Can Teach Us
Outliers are often dismissed as "noise," yet they frequently contain the most interesting information in a dataset. In medical research, analyzing patients who respond exceptionally well to treatment (super responders) has led to discoveries of new therapeutic approaches.
In the ranking context, individuals or data points at extreme positions generate the question "why are they there?" Answering that question often deepens understanding of the entire distribution. Rather than excluding outliers, investigating the reasons for their existence is an essential stance for extracting insight from data.