Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Exploratory data analysis is a technique to analyze and investigate a dataset and summarize its main characteristics. A main advantage of EDA is providing the visualization of data after conducting analysis. Tukey's championing of EDA encouraged the development of
statistical computing packages, especially
S at
Bell Labs. The S programming language inspired the systems
S-PLUS and
R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify
outliers,
trends and
patterns in data that merited further study. Tukey's EDA was related to two other developments in
statistical theory:
robust statistics and
nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating
statistical models. Tukey promoted the use of
five number summary of numerical data—the two
extremes (
maximum and
minimum), the
median, and the
quartiles—because these median and quartiles, being functions of the
empirical distribution are defined for all distributions, unlike the
mean and
standard deviation. Moreover, the quartiles and median are more robust to
skewed or
heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages
S,
S-PLUS, and
R included routines using
resampling statistics, such as Quenouille and Tukey's
jackknife and
Efron bootstrap, which are nonparametric and robust (for many problems). Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, both of which were of interest to Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the
analytic theory of
testing statistical hypotheses, particularly the
Laplacian tradition's emphasis on
exponential families. Additionally, there are arguments to first visualize data during EDA before modeling in order to avoid misleading conclusions as in
Anscombe's Quartet. == Development ==