Many of us are depending on third-party libraries like NumPy, Pandas, SciPy for doing statistics. What is often overlooked is python’s own inbuilt statistics library. Although not meant for big data statistics, it is better when the data set is small. It can do our school homework or at least validate it. Today we are going to take a look at python’s inbuilt statistics library and see what we can do with it.
Python’s inbuilt statistics library is used for descriptive statistics. So its functions are either dedicated to measures of centrality or spread. There is no probability and by extension, there is no inferential statistics.
Measures of central location are used to find the centre of the data. There are mainly 3 methods to do that - mean, median, and mode.
mean function is used to find the simple mean or average of the given dataset.
from statistics import * numbers = [1,2,3,4,5,6,7,8,9,10] mean_numbers = mean(numbers) print(mean_numbers)
Median is the centre value of a given data set. It is found out by arranging the data in increasing or decreasing order. Then the middle value is the median. If there are 2 middle values, the mean is taken. In the above example, the median of the list
numbers is 5.5. Now we are deliberately reordering the list just to show that the median function will do the sorting for us as well.
from statistics import * numbers = [4,9,7,1,5,6,3,8,2,1 ] median = mean(numbers) print(median)
Mode is the most common/repeated value in a dataset. For this, we are going to modify our sample to increase the count of certain items.
from statistics import * numbers = [4,2,2,2,9,9,9,7,7,1,5,5,5,5,5,6,2,2,2,2,6,6,6,6,6,6,3,3,8,10,10,10,10,2,10] mode = mode(numbers) print(mode)
Now that the 3 methods for determining measures of centrality are out of the way, it is time to discuss the measures of spread. We measure spread by calculating variance and standard deviation. If one doesn’t know about variance and standard deviation, he can read this article.
from statistics import * numbers = [1,2,3,4,5,6,7,8,9,10] variance_numbers = variance(numbers) print(variance_numbers)
The greater the variance the more spread apart our data set. We know for a fact that the mean of our data set is 5.5 as already calculated above. The variance shows us how far apart our data is from the mean. If the variance is 0, that means all members of our data set are identical.
Standard deviation is simply the square root of variance. It shows the deviation of each data point from the mean.
from statistics import * numbers = [1,2,3,4,5,6,7,8,9,10] standard_deviation = stdev(numbers) print(standard_deviation)on)
Now that we are done with schoolboy statistics, there are few more options in the statistics module that will do good to be aware of.
- fmean(): simply returns mean as a float.
- geometric_mean(): finds the geometric mean.
- harmonic_mean(): finds the harmonic mean.
- median_low(): when there are two middle values, the lower one is returned as the median instead of taking the average.
- median_high(): when there are two middle values, the higher one is returned as the median instead of taking the average.
- multimode(): A list of most common values will be returned instead of a single one.
- quantiles(): Divides the data into the given number of continuous intervals.