This section contains rather technical explanations that may assist practitioners but are beyond the typical scope of a Wikipedia article.
Initial data analysis The most important distinction between the initial data analysis phase and the main analysis phase is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:
Quality of data The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), normal
imputation is needed. • Analysis of
extreme observations: outlying observations in the data are analyzed to see if they seem to disturb the distribution. • Comparison and correction of differences in coding schemes: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable. • Test for
common-method variance. The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.
Quality of measurements The quality of the
measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature. There are two ways to assess measurement quality: • Confirmatory factor analysis • Analysis of homogeneity (
internal consistency), which gives an indication of the
reliability of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the
Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale
Initial transformations After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase. Possible transformations of variables are: • Square root transformation (if the distribution differs moderately from normal) • Log-transformation (if the distribution differs substantially from normal) • Inverse transformation (if the distribution differs severely from normal) • Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)
Did the implementation of the study fulfill the intentions of the research design? One should check the success of the
randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups. If the study did not need or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in the sample.Other possible data distortions that should be checked are: •
dropout (this should be identified during the initial data analysis phase) • Item
non-response (whether this is random or not should be assessed during the initial data analysis phase) • Treatment quality (using
manipulation checks).
Characteristics of data sample In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the size of the subgroup when subgroup analyses will be performed during the main analysis phase.The characteristics of the data sample can be assessed by looking at: • Basic statistics of important variables • Scatter plots • Correlations and associations • Cross-tabulations
Final stage of the initial data analysis During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken. Also, the original plan for the main data analyses can and should be specified in more detail or rewritten. In order to do this, several decisions about the main data analyses can and should be made: • In the case of non-
normals: should one
transform variables; make variables categorical (ordinal/dichotomous); adapt the analysis method? • In the case of
missing data: should one neglect or impute the missing data; which imputation technique should be used? • In the case of
outliers: should one use robust analysis techniques? • In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)? • In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or
bootstrapping? • In case the
randomization procedure seems to be defective: can and should one calculate
propensity scores and include them as covariates in the main analyses?
Analysis Several analyses can be used during the initial data analysis phase: • Univariate statistics (single variable) • Bivariate associations (correlations) • Graphical techniques (scatter plots) It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level: • Nominal and ordinal variables • Frequency counts (numbers and percentages) • Associations • circumambulations (crosstabulations) • hierarchical loglinear analysis (restricted to a maximum of 8 variables) • loglinear analysis (to identify relevant/important variables and possible confounders) • Exact tests or bootstrapping (in case subgroups are small) • Computation of new variables • Continuous variables • Distribution • Statistics (M, SD, variance, skewness, kurtosis) • Stem-and-leaf displays • Box plots
Main data analysis In the main analysis phase, analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.
Exploratory and confirmatory approaches In the main analysis phase, either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis, clear hypotheses about the data are tested.
Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a
type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a
Bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. Cross-validation is generally inappropriate, though, if there are correlations within the data, e.g. with
panel data. Hence other methods of validation sometimes need to be used. For more on this topic, see
statistical model validation. •
Sensitivity analysis. A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do that is via
bootstrapping. ==Free software for data analysis==