The Most Essential Python Libraries for Data Science

Published in

Analytics Vidhya

9 min readNov 11, 2020

Photo by **Maxwell Nelson** on **Unsplash**

With an increase in the amount of data that can be found on the internet, Python too has seen a huge boom in use recently (from 9% in 2010 to 25% in 2020). This in turn has led to an increase in job opportunities for data scientists, machine learning engineers and data analysts (the average salary is pretty good too).

Basically, Python has been popular in the world of data science for a while now and hence, is an in-demand skill in the tech industry.

To perform any data science based task in Python, one needs to import packages, also called libraries. These packages help you achieve what’s desired in an efficient manner, making your job easier. Packages are basically the Robin to your Batman, except in this case there are way too many Robins, some good, some great, some…. well they get the job done.

In this story, I attempt to be your Alfred and help you screen and find the Robins that are of paramount importance when it comes to data science and will help you achieve your goals, based on my experiences and explorations.

Libraries for Mathematical and Scientific Calculation

1. NumPy

One of the most commonly used library for scientific applications, NumPy is used to, efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed. It provides tools to work with array objects.

It can be used to do,

arithmetic operations
handling complex numbers
exponential operations
trigonometric operations

and many more mathematical processes. It also comes handy in manipulative processes like slicing, indexing, splitting, broadcasting et cetera.

NumPy can also be used to read and write files and is also an important tool when it comes to pre-processing data in case of machine learning.

It should be noted that NumPy deals with homogeneous multidimensional arrays mainly.

Latest Version: 1.19.4.

Read more about this library here and its functions here.

2. SciPy

SciPy, very much like NumPy is an open source implementation used for mathematical, scientific and engineering purposes. SciPy depends on NumPy since it works on NumPy arrays and makes significant use of it.

The SciPy package is part of the SciPy stack that consists of libraries like NumPy, Matplotlib, Pandas, SymPy, IPython and nose.

When it comes to calculations, SciPy has modules that help in executing mathematical routines like,

linear algebra
differential equations
integration
signal processing
statistical analysis, and
interpolation

in an efficient manner.

Latest Version: 1.5.3.

Read more about this library here and its functions here.

3. Statsmodels

Statsmodels provides a complement to SciPy for mathematical computations, but it is widely used and is the best option in case of statistical calculations, descriptive statistics and estimation and inference for statistical models.

Statsmodels makes it very easy and convenient for you to perform statistical operations on Python itself. It has been widely accepted that R is the best and easiest to use when it comes to statistics. Statsmodels attempts to provide a similar ease of access in Python.

Statsmodels can be used for

linear regression models
multivariate calculations
time series analysis
hypothesis testing
mixed linear model, generalized linear model and Bayesian model

Latest Version: 0.12.1.

Read more about this library here and its functions here.

4. Pandas

Yes, they are named after the cute cuddly animals from China, also, they are very useful.

Pandas is widely considered as the most important library for applied data science using Python.

You can perform a tremendous amount of functions using Pandas. It provides fast, flexible and expressive data structures that make it very easy to work on large structured and time-series data.

Basically, data is taken from a SQL table or a CSV file and stored in a dataframe and using these dataframes, you can easily manipulate data in a way that helps you get the desired outcome. The closest thing to a Pandas dataframe is an Excel table.

Some of the functions you can perform using Pandas are,

loading data from flat files (CSV and delimited), Excel files, databases
write/save data to the aforementioned types of files and databases
label-based slicing, fancy indexing, and subsetting of datasets
merging and joining of datasets
time series functionality

Fun fact: Pandas stands for Python Data Analysis Library.

Latest Version: 1.1.4.

Read more about this library here and its functions here.

Libraries for Visualization

1. Matplotlib

When I first heard about Matplotlib, I thought it sounded like Matlab and hence, would provide similar functions. Well, I was wasn’t totally correct. Matplotlib is used to plot graphs and make diagrams and is also a part of the SciPy stack. It can also be used to show images.

Matplotlib is a library for Python that provides an object oriented API for embedding plots into applications. Using Matplotlib, you can visualize data, make stories and present your findings in a clean and conclusive way. The code used to plot these graphs is syntactically similar to that in Matlab.

Matplotlib also allows you to format grids, labels, titles, legends and other components of the graph.

It can be used to make,

Line Graphs
Scatter Plots
Area Graphs
Histograms
Bar Charts
Pie Charts
Contour Plots
Box Plots

and many more!!!

Latest Version: 3.3.2.

Read more about this library here and its functions here.

2. Seaborn

Seaborn, to put it simply, is Matplotlib 2.0. It provides a high-level interface for drawing attractive and informative statistical graphics and is based on Matplotlib.

Photo by **Edward Howell** on **Unsplash**

So what is the difference between the two you might ask… well I (actually this article) have you covered: Matplotlib is mainly used for plotting basic graphs like bars, pies, lines, scatter plots and so on. Seaborn on the other hand, provides a variety of complex visualization patterns while using fewer syntax.

In addition to the common graphs (bar, scatter, line, area, pie at cetera), Seaborn can be used to make,

Joint Distribution Plots
Density Plots
Factor Plots
Swarm Plots
Violin Plots
Lollipop Graphs

and many more!!

Latest Version: 0.11.0.

Read more about this library here and its functions here.

3. Plotly

Plotly is a graphing library for Python that can be used to make interactive graphs. Like in the case of other visualization tools, data is imported and then visualized and analyzed.

Plotly is more of a ‘enterprise’ version of Matplotlib and Seaborn as it can be integrated and used for building web applications that are ML or data science oriented.

Plotly can be used for,

making basic charts (scatter, line, bar, pie, bubble, gantt)
making statistical charts (box plots, histograms, distplots, error bars, trellis plots, violin plots)
making financial charts (candlestick, waterfall, funnel, OHLC)
making visualizations using maps
subplots
3-D charts
making graphs using various transforms (aggregation, group by, filter)
Jupyter Widgets Interaction

Latest Version: 4.12.0.

Read more about this library here and its functions here.

4. Bokeh

Bokeh is a Python library that can be used to make interactive visualizations.

One of its main functions is that it can be used to visualize learning algorithms. To everyone who has just started learning machine learning concepts, I would recommend Bokeh since it helps you better understand simple ML techniques like K-means or KNN.

Its applications are,

creating basic graphs
quickly and easily make interactive plots, dashboards, and data applications
it supports HTML, Jupyter Notebook or a server output
visualizations can be integrated to Django and Flask apps.

Latest Version: 2.2.3.

Read more about this library here and its functions here.

Libraries for Machine Learning

1. Tensorflow

Tensorflow is the most popular deep and machine learning framework used by data enthusiasts. It is free and a open-source software library developed by the Google Brain Team.

Tensorflow is very simple to use and can be used to develop and deploy machine learning applications. It allows you to work with neural networks with many layers, which is helped by the GPU integration which helps you run an Estimator model on multiple GPUs in one machine.

Some of the most popular uses and applications of Tensorflow are:

Speech Recognitions System
Text Summarization
Image/Video Recognition and Tagging
Sentiment Analysis
Self Driving Cars
Recommendations Systems.

Using Tensorflow, some really cool applications have been made, here are some of them:

Inception Image Classifier by Google
Massive Multitask Networks for Drug Discovery by Stanford University
Nsynth Super by Google Creative Labs
Deep Speech by Mozilla

Latest Version: 2.3.1.

Read more about this library here and its functions here.

2. Keras

Keras is a high-level library that can be imported from Tensorflow. Keras simplifies many tasks, relieving you from writing tons of monotonous code, but it many not be suitable for complicated procedures or tasks.

Since Keras runs on top of Tensorflow, the questions arises: What is the difference between the two libraries?

Firstly, Keras is not as complicated as Tensorflow, making it simple and easier to use. Keras also is a high-level API that works as a wrapper to Tensorflow and Theano. If you are interested in just creating and executing a machine learning model, Keras is for you, but if you also want to know the deeper intricacies and working, then Tensorflow.

Some applications of Keras are:

Image Classification
Feature Extraction
Fine-tuning and loss computation

Latest Version: 2.4.3.

Read more about this library here and its functions here.

3. Scikit-Learn

Scipy is a free, open-source machine learning library for Python built using NumPy, SciPy and Matplotlib.

It provides various supervised and unsupervised machine learning models and is the perfect package for a beginner in ML. The documentation is pretty simple and intuitive and most importantly, compact. Using very few lines, you can train a model and subsequently implement it.

It is one of the best libraries for working with data.

Scikit-Learn helps you with:

Classification Models (SVM, Nearest Neighbours, Random Forest, Naive Bayes, Decision Tree, Supervised Neural Networks et cetera)
Regression Models (SVM, Nearest Neighbours, Naive Bayes, Decision Tree, Supervised Neural Networks et cetera)
Clustering
Dimensionality Reduction
Model Selection
Preprocessing of data

Latest Version: 0.23.2.

Read more about this library here and its functions here.

4. NLTK

When it comes to text or textual analysis, one part of machine learning is Natural Language Processing, which happens to be NLTK’s forte.

Photo by **Jason Leung** on **Unsplash**

NLTK, which stands for Natural Language Toolkit, is a combination of many libraries that help you analyze and process text in order to achieve meaningful conclusion, using only text.

It must be noted that NLTK is used only for processing the data. It does not offer any machine learning models. You can use LSTM or BERT or a deep learning model, post-processing in order to train a model that derives results with text as input.

The main features and uses of NLTK are:

Stemming and Lemmatization
Tokenization
Tagging
Sentiment Analysis
Topic Modeling/Text Classification.

Latest Version: 3.5.

Read more about this library here and its functions here.

Libraries for Scraping Data

1. Scrapy

Scrapy is an “open source and collaborative framework” that can be used to extract data from websites, i.e., web scraping. It provides a fast and powerful framework to extract information you need from a webpage. It is considered to be the best web crawler for Python.

The best thing about Scrapy is its extensibility, since it can be used to extract data from APIs and functionalities can be plugged in without touching the core.

Why use Scrapy you might ask, well let me tell you why:

makes it possible to scrape any website
requests are schedule and processed asynchronously
can decode JSON directly from websites that provide JSON data
makes use of spider bots that scan web pages and collect structured data.

It must be noted that Scrapy can only be used in Python 2.7 and later versions.

Latest Version: 2.4.0.

Read more about this library here and its functions here.

2. BeautifulSoup

It is beautiful, but it definitely isn’t soup.

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help.

If you go to their website, these will be the first few sentences. Honestly, what more is there to say.

Make sure to check out their Hall of Fame page to see some high-profile projects that were made using BeautifulSoup.

BeautifulSoup,

is easy to use and master
the syntax is pretty simple and the documentations are extraordinarily clear and informative
it is a smaller library compared to Scrapy, hence requires minimum setup and lesser attention.

Also, unlike Scrapy, which is a crawler, BeautifulSoup makes use of HTML parsing, which basically means you have to provide sort of an HTML address in order to get the information.

Latest Version: 4.9.3.

Read more about this library here and its functions here.

Best of luck on your journey in data science and thank you for reading :)

The Most Essential Python Libraries for Data Science

Libraries for Mathematical and Scientific Calculation

1. NumPy

2. SciPy

3. Statsmodels

4. Pandas

Libraries for Visualization

1. Matplotlib

2. Seaborn

3. Plotly

4. Bokeh

Libraries for Machine Learning

1. Tensorflow

2. Keras

3. Scikit-Learn

4. NLTK

Libraries for Scraping Data

1. Scrapy

2. BeautifulSoup

Written by Abhinav Kumar