How to perform descriptive statistical analytics


Would it be good to summarize your data  using a few numbers and graphs?
That's what I'll be talking about in this article. We'll be using descriptive statistics to describe our data.

What do I mean by describing data?
Here's what I mean:

  •  Understanding measures of central tendency for our data.
  • Measure dispersion of our data.
  • Using frequency tables and scatter plots to gain insight.


Understanding Measures of central tendency:

Remember studying about mean, median, and mode in high school?
We use these measures to summarize our data with a single value.
The idea is to find a number that best represent all the numbers we have in a data set.
Let's discuss each one of these measures for better understanding.

Mean : Sum of all values / Number of values 

             example : data = 1,3,5,4,2

                              mean = (1+3 + 5 + 4 +2) / 5 = 3

Median : Arranging all the values in ascending order, and then finding the value in the middle.

            example : data = 1,3,5,4,2
                              arranging data in ascending order : 1,2,3,4,5
                               There are 5 values, thus we pick the 3rd number as median.
                               Median = 3

                   Note  :   Given n is the number of elements in the data set.
                                 pick (n+1)/2th number when n is odd.
                                 pick n/2th number when n is even.  
    

               
Mode : The value that occurs most frequently in the data.

              example : data = 1,3,5,4,2,1
                                Mode = 1

Note : We prefer using median over mean when there are outliers in the data.
Let me make this point using an example.

                data = 1,2,3,4,5,6,7,8,9,1000
                mean = (1+2+3+4+5+6+7+8+9+1000) / 10 = 104.5
                median = 5
 

The median represents the data better than the mean. The mean gets influenced by one value (1000) in the data. So it is better to use median when there are outliers in the data.

Understanding measures of dispersion :

Want to know about the distribution of our data?
There are two main measures used for this purpose. Range and standard deviation.

Range : Greatest value - Smallest value

example : data = 1,2,3,4,5
                range = 5-1 = 4

 Standard deviation : It tells us what percentage of data lie in a particular rage.
 i.e. : 68% of the data exists above or below the 1st standard deviation.
         95.2% of the data exists above or below the 2nd standard deviation.
         99.2% of the data exists above or below the 3rd standard deviation.



Any value that is outside three standard deviations is considered to be an outlier.

Note : An outlier is a number that is so removed from the norm that is disrupts the data.

Z-Score : Used for comparing a specific value to a population.
 it tells us how standard deviations away from the mean is the value.

Z-score = (Value -  Mean) / Standard deviation

 

Frequency tables and scatter plots :

We want to get an idea of counts of values in specific categories. This is where frequency tables comes in.

Frequency Table : Visualization of counts of values in specific categories.
This is what a frequency table looks like :



But how to make a frequency table with many categories?
This where you will use a contingency table.

Contingency table : Frequency table with many variables
This is what a contingency table looks like :

Let's move on to scatter plots now.

Scatter plot : Visualization of two numerical variables.
This is what a scatter plot looks like :


Any idea why scatter plots are used?
Scatter plots are used to find how two variables are correlated.

Correlation : Measure of how related two variables are.

Three types of correlation :

1) Positive correlation : Both the variables increase together.
2) Negative correlation : One variable increases while the other decreases.
3) No correlation : The the variables are not related.

We measure correlation using correlation coefficient.

Correlation Coefficient : A number that measures how related two variables are.

Correlation coefficient varies between -1 and 1.
A value greater than 0.7 means high positive correlation.
A value less that -0.7 means low negative correlation.

This was all I had to share about descriptive statistics.
Thank you :)


 


Comments

Popular posts from this blog

Modes of thinking