Outliers: the Stats
- An outlier[1],
in statistical terminology, is an observation that is numerically distant from
the rest of the data.
In effect, an observation that is towards either extreme end of the spectrum is
an outlier. Extreme values show large or small data values that are relative to
other data values. Outliers can occur by chance in any distribution, but they
are often indicative either of measurement
error or that the population has a heavy-tailed distribution. A physical
apparatus for taking measurements may have suffered a transient malfunction.
There may have been an error in data transmission or transcription. Outliers
arise due to changes in system behaviour, fraudulent behaviour, human error,
instrument error or simply through natural deviations in populations. In
addition, the pathological appearance of outliers of a certain form appears in
a variety of datasets, indicating that the causative mechanism for the data
might differ at the extreme end.
- When you look at frequency polygons or histograms[2], the question asked is whether the curve is bell-shaped at the middle, peaked at either end, or is the curve flat. Are the data values spread out, or do they cluster at one segment? The extreme values where there are few entries could be outliers. A better method of data visualization is the bagplot which is an approach to detecting outliers in bivariate data. This type of plot visualizes location, spread, correlation, skewness and the tails of the data without making assumptions about the data being symmetrically distributed.[3] The Galbraith plot is a graphical method for identifying outliers in a meta-analysis. The standardized effect size is plotted against precision (the reciprocal of the standard error).[5] The arithmetic mean may be affected by outliers thereby giving an inaccurate value. The alpha-trimmed mean, which is less affected by outliers than the arithmetic mean, involves dropping a proportion (alpha) of the observations from both ends of the sample before calculating the mean of the remainder.[4]
- Outliers, being the most
extreme observations, may include the sample
maximum or sample minimum, or both, depending on whether
they are extremely high or low. However, the sample maximum and minimum are not
always outliers because they may not be unusually far from other observations.
Simplistic interpretation of statistics derived from data sets that include
outliers may be misleading. The median is a robust statistic, while the
arithmetic mean is not, as we read earlier.
Existing methods that are seen for finding outliers in large data-sets
can only deal with two dimensions or attributes. Knowledge discovery in databases,
commonly referred to as data mining, is generating enormous interest in both
the research
and software arenas. With the development of better analytic algorithms
for statistical exploration (and studying outliers), high-powered computing (HPC), and the progress of
graphic ability, we are now better equipped to not only calculate significant
values, but also to visualize data that will facilitate our decision making
processes.
- Nevertheless, one should remember that “the whole is greater than the sum of its parts.” It is highly important to realize, that unless it can be ascertained that the deviation is not significant, it is ill-advised to ignore the presence of outliers. Outliers that cannot be readily explained demand special attention. In this area, we deal with probability and how we can conclude, with a reasonable assurance, hat we are dealing with either a numerical anomaly, or a more serious situation. This brings us then to the other question on hand: what constitutes an epidemic? Since we are dealing with life, the right analysis and decision will spell the difference between a case-to-case response, or a nationwide pathogen alert.
- Nevertheless, one should remember that “the whole is greater than the sum of its parts.” It is highly important to realize, that unless it can be ascertained that the deviation is not significant, it is ill-advised to ignore the presence of outliers. Outliers that cannot be readily explained demand special attention. In this area, we deal with probability and how we can conclude, with a reasonable assurance, hat we are dealing with either a numerical anomaly, or a more serious situation. This brings us then to the other question on hand: what constitutes an epidemic? Since we are dealing with life, the right analysis and decision will spell the difference between a case-to-case response, or a nationwide pathogen alert.
Outbreaks
- Outbreak is a term used in epidemiology
to describe an occurrence of disease greater than would otherwise be expected at a
particular time and place. It may affect a small group of the population in a
specific location, or it may impact hundreds of the population, either in one
specific location or across states. According to the CDC, an outbreak is the
occurrence of more cases of disease, injury, or other health condition than
expected in a given area or among a specific group of persons during a specific
period. Usually, the cases are presumed to have a common cause or to be related
to one another in some way. It is a phenomenon that is more localized, and less
likely to invoke panic in the population than an epidemic. (Not to be confused
with the word endemic, which is the term given to an ailment that is
found commonly in a certain location)
- According to the CDC, an epidemic
is the occurrence of more cases of disease, injury, or other health condition
than expected in a given area or among a specific group of persons during a
particular period. Usually, the cases are presumed to have a common cause or to
be related to one another in some way. At this point, there appears to be no
difference between an outbreak and an epidemic. In epidemiology,
an epidemic occurs when new cases of a certain disease,
in a given human population, and during a given period, substantially exceed
what is expected based on recent experience. For this reason, it is very
important to stratify the conditions we set for differentiating an outbreak
from an epidemic. One authority [6] describes an epidemic as an
increased unusual widespread infection in the community causing waves of infection. These spread
through communities and affect all people who have no active immunity to that
infection.
- While epidemics due to exogenous pathogens have
diminished in developed countries with a good health system, they may still be
found in third-world areas where nutrition and other aspects of healthcare are
substandard. There have been exceptions however. An example of an epidemic in
the last two decades was in the
1990s where there was
a large diphtheria epidemic
in Russia as
the result of the collapse of the
public health infrastructure, demonstrating that pathogenic microbes
are still in
the environment and
can become epidemic even in
technologically advanced countries if we relax our efforts to contain them.
- A mechanism that may give rise to epidemics, for example, is the antigenic
shift which refers to the emergence of a novel influenza virus in humans, due
to direct introduction of an avian strain or to a new strainproduced by recombination and reassortment of two
different influenza viruses. Recent influenza A pandemics occurred in 1957 (the
H2N2‘Asian Flu’) and 1968 (the H3N2‘Hong Kong Flu’). An outbreak
of avian influenza
from exposure to
infected poultry in Hong
Kong in 1997 caused
18 human deaths.
A genetically different strain of
A/H5N1 circulated in domestic birds throughout Asia, causing 387 cases and 245
deaths between 2003 and 2008, raising concerns that a new pandemic might arise.[7]
Another infection that is being
tracked is TB. Since 1998, the percent-age of US-born patients with MDR-TB has
remained at less than 0.7%. However, the frequency of resistant infections in
foreign-born persons increased from 25% (103 of 407) in 1993 to 80% (73 of 91)
in 2006.
Current Hantavirus Outbreak
- As of September 13, the National Park Service (NPS) has
announced a total of 9 confirmed cases of hantavirus infection in people who
recently visited Yosemite National Park. Officials believe that 8 out of the 9
patients acquired the virus while staying at the Signature Tent Cabins in Curry
Village in Yosemite National Park. The 9th person may have acquired
it at a location in the park 15 miles from Curry Village. The
visitors to Yosemite are residents of: California (7), Pennsylvania (1), and
West Virginia (1). Three of the confirmed cases were fatal. The National Park
Service issued a notification to all Park visitors. You can view this at the
NPS page here.
Decision Making and Response
- The park is contacting visitors who stayed in the
Signature Tent Cabins from mid-June through the end of August, advising them to
seek immediate medical attention if they exhibit symptoms of Hantavirus Pulmonary
Syndrome (HPS), a rare but serious illness caused by hantavirus. The park is
also providing information about HPS risks and symptoms to visitors who stayed
at the High Sierra Camps this summer. In addition to closing down the Signature
Tent Cabins, the CDC is supporting the NPS response with testing of patient
samples for evidence of hantavirus infection, providing guidance on clinical
management of HPS and epidemiologic support for the response, and maintaining a
Hantavirus Hotline for public inquiries. The park is providing educational
materials about hantavirus and HPS to all visitors to the park.
- While the number of hantavirus cases is very small, the
fatality rate is <40%. So it makes it
imperative to track the illness, carry out decontamination where possible, test
persons who were exposed, and ensure rapid hospitalization of cases even it was
just a suspicion.
- The question to be asked at this point is: are the victims of this current outbreak actually outliers in the immunocompetent population? Are they any more susceptible to the viral illness than all the rest of the visitors to the Yosemite Park? While we dig deeper into this scenario it is likely that we shall discover a novel mechanism for the outbreak. In the meantime we have to respond and take care of those who have fallen sick.
- The question to be asked at this point is: are the victims of this current outbreak actually outliers in the immunocompetent population? Are they any more susceptible to the viral illness than all the rest of the visitors to the Yosemite Park? While we dig deeper into this scenario it is likely that we shall discover a novel mechanism for the outbreak. In the meantime we have to respond and take care of those who have fallen sick.
Yosemite National Park Hantavirus Infection Epi Curves:
http://www.cdc.gov/hantavirus/outbreaks/yosemite/epi.html
http://www.cdc.gov/hantavirus/outbreaks/yosemite/epi.html
Evan Esar, Esar's Comic Dictionary
American Humorist (1899 - 1995)
- Time to wrap this up for now. Stay
safe…
- Fernando
Yaakov Lalana, M.D
Bibliography:
1.
Barnett, V. and Lewis, T.: 1994, Outliers in Statistical
Data. John Wiley & Sons., 3rd edition.
2. Bluman, Allan G.:Elementary Statistics-A Step by Step Approach, 8th Ed., Copyright © 2012 by The McGraw-Hill Companies, Inc.
3. Everitt, B.S. and Skrondal, A.:The Cambridge Dictionary of Statistics, 4th Ed., Cambridge University Press, © B. S. Everitt and A. Skrondal 2010,; First, Second and Third Editions © Cambridge University Press 1998, 2002, 2006
4. Fisher, L. D. and Van Belle, G., 1993, Biostatistics, J. Wiley & Sons, New York.
5. Everitt, B.S.; Medical Statistics from A to Z, 2nd Ed.; Cambridge University Press, © B. Everitt 2006
6. Kumar, P. and Clark, M., Kumar and Clark’s Clinical Medicine, © 2009, Elsevier Limited. All rights reserved
7. Cohen, J., Opal,S.M., Powderly,W.G.,Editors,; Infectious Diseases, 3rd Ed., © 2010, Elsevier Limited. All rights reserved.
2. Bluman, Allan G.:Elementary Statistics-A Step by Step Approach, 8th Ed., Copyright © 2012 by The McGraw-Hill Companies, Inc.
3. Everitt, B.S. and Skrondal, A.:The Cambridge Dictionary of Statistics, 4th Ed., Cambridge University Press, © B. S. Everitt and A. Skrondal 2010,; First, Second and Third Editions © Cambridge University Press 1998, 2002, 2006
4. Fisher, L. D. and Van Belle, G., 1993, Biostatistics, J. Wiley & Sons, New York.
5. Everitt, B.S.; Medical Statistics from A to Z, 2nd Ed.; Cambridge University Press, © B. Everitt 2006
6. Kumar, P. and Clark, M., Kumar and Clark’s Clinical Medicine, © 2009, Elsevier Limited. All rights reserved
7. Cohen, J., Opal,S.M., Powderly,W.G.,Editors,; Infectious Diseases, 3rd Ed., © 2010, Elsevier Limited. All rights reserved.
Helpful Links:
Calculation & Visualization of Outlier
CDC Glossary of Terms
CDC TB Report
Journal of Statistical Software
Yosemite National Park Hantavirus Infection Epi Curves:
Electron Cryo-Tomography of Tula Hantavirus
No comments:
Post a Comment