Descriptive tools Data can be represented through
tables or
graphical representation, such as line charts, bar charts, histograms, scatter plot. Also,
measures of central tendency and
variability can be very useful to describe an overview of the data. Follow some examples:
Frequency tables One type of table is the
frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:
Absolute: represents the number of times that a determined value appear; N = f_1 + f_2 + f_3 + ... + f_n
Relative: obtained by the division of the absolute frequency by the total number; n_i = \frac{f_i}{N} In the next example, we have the number of genes in ten
operons of the same organism. :
Line graph for the December months from 2010 to 2016; Figure C:
Example of Box Plot: number of glycines in the proteome of eight different organisms (A-H); Figure D:
Example of a scatter plot. Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.
Bar chart A
bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.
Scatter plot A
scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis. They are also called
scatter graph,
scatter chart,
scattergram, or
scatter diagram.
Mean The
arithmetic mean is the sum of a collection of values ({x_1+x_2+x_3+\cdots +x_n}) divided by the number of items of this collection ({n}). : \bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}
Median The
median is the value in the middle of a dataset.
Mode The
mode is the value of a set of data that appears most often.
Box plot Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data.
Outliers may be plotted as circles.
Correlation coefficients Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason,
correlation coefficients are required. They provide a numerical value that reflects the strength of an association. about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The
standard error of the mean is a measure of variability that is crucial to do inferences. •
Hypothesis testing Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set: •
The hypothesis to be tested: as stated earlier, we have to work with the definition of a
null hypothesis (H0), that is going to be tested, and an
alternative hypothesis. But they must be defined before the experiment implementation. •
Significance level and decision rule: A decision rule depends on the
level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a
critical value that determines the statistical significance when a
test statistic is compared with it. So, α also has to be predefined before the experiment. •
Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate
experimental design, data is collected and the more suitable statistical tests are evaluated. •
Inference: Is made when the
null hypothesis is rejected or not rejected, based on the evidence that the comparison of
p-values and α brings. It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection, but not that this hypothesis is true. •
Confidence intervals A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied. == Statistical considerations ==