The Most Essential Python Libraries for Data Science

Abhinav Kumar
Analytics Vidhya
Published in
9 min readNov 11, 2020

--

Photo by Maxwell Nelson on Unsplash

With an increase in the amount of data that can be found on the internet, Python too has seen a huge boom in use recently (from 9% in 2010 to 25% in 2020). This in turn has led to an increase in job opportunities for data scientists, machine learning engineers and data analysts (the average salary is pretty good too).

Basically, Python has been popular in the world of data science for a while now and hence, is an in-demand skill in the tech industry.

To perform any data science based task in Python, one needs to import packages, also called libraries. These packages help you achieve what’s desired in an efficient manner, making your job easier. Packages are basically the Robin to your Batman, except in this case there are way too many Robins, some good, some great, some…. well they get the job done.

In this story, I attempt to be your Alfred and help you screen and find the Robins that are of paramount importance when it comes to data science and will help you achieve your goals, based on my experiences and explorations.

Libraries for Mathematical and Scientific Calculation

1. NumPy

One of the most commonly used library for scientific applications, NumPy is used to, efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed. It provides tools to work with array objects.

It can be used to do,

  • arithmetic operations
  • handling complex numbers
  • exponential operations
  • trigonometric operations

and many more mathematical processes. It also comes handy in manipulative processes like slicing, indexing, splitting, broadcasting et cetera.

NumPy can also be used to read and write files and is also an important tool when it comes to pre-processing data in case of machine learning.

It should be noted that NumPy deals with homogeneous multidimensional arrays mainly.

Latest Version: 1.19.4.

Read more about this library here and its functions here.

2. SciPy

SciPy, very much like NumPy is an open source implementation used for mathematical, scientific and engineering purposes. SciPy depends on NumPy since it works on NumPy arrays and makes significant use of it.

The SciPy package is part of the SciPy stack that consists of libraries like NumPy, Matplotlib, Pandas, SymPy, IPython and nose.

When it comes to calculations, SciPy has modules that help in executing mathematical routines like,

  • linear algebra
  • differential equations
  • integration
  • signal processing
  • statistical analysis, and
  • interpolation

in an efficient manner.

Latest Version: 1.5.3.

Read more about this library here and its functions here.

3. Statsmodels

Statsmodels provides a complement to SciPy for mathematical computations, but it is widely used and is the best option in case of statistical calculations, descriptive statistics and estimation and inference for statistical models.

Statsmodels makes it very easy and convenient for you to perform statistical operations on Python itself. It has been widely accepted that R is the best and easiest to use when it comes to statistics. Statsmodels attempts to provide a similar ease of access in Python.

Statsmodels can be used for

  • linear regression models
  • multivariate calculations
  • time series analysis
  • hypothesis testing
  • mixed linear model, generalized linear model and Bayesian model

Latest Version: 0.12.1.

Read more about this library here and its functions here.

4. Pandas

Yes, they are named after the cute cuddly animals from China, also, they are very useful.

Photo by rigel on Unsplash

Pandas is widely considered as the most important library for applied data science using Python.

You can perform a tremendous amount of functions using Pandas. It provides fast, flexible and expressive data structures that make it very easy to work on large structured and time-series data.

Basically, data is taken from a SQL table or a CSV file and stored in a dataframe and using these dataframes, you can easily manipulate data in a way that helps you get the desired outcome. The closest thing to a Pandas dataframe is an Excel table.

Some of the functions you can perform using Pandas are,

  • loading data from flat files (CSV and delimited), Excel files, databases
  • write/save data to the aforementioned types of files and databases
  • label-based slicing, fancy indexing, and subsetting of datasets
  • merging and joining of datasets
  • time series functionality

Fun fact: Pandas stands for Python Data Analysis Library.

Latest Version: 1.1.4.

Read more about this library here and its functions here.

Libraries for Visualization

1. Matplotlib

When I first heard about Matplotlib, I thought it sounded like Matlab and hence, would provide similar functions. Well, I was wasn’t totally correct. Matplotlib is used to plot graphs and make diagrams and is also a part of the SciPy stack. It can also be used to show images.

Matplotlib is a library for Python that provides an object oriented API for embedding plots into applications. Using Matplotlib, you can visualize data, make stories and present your findings in a clean and conclusive way. The code used to plot these graphs is syntactically similar to that in Matlab.

Matplotlib also allows you to format grids, labels, titles, legends and other components of the graph.

It can be used to make,

  • Line Graphs
  • Scatter Plots
  • Area Graphs
  • Histograms
  • Bar Charts
  • Pie Charts
  • Contour Plots
  • Box Plots

and many more!!!

Latest Version: 3.3.2.

Read more about this library here and its functions here.

2. Seaborn

Seaborn, to put it simply, is Matplotlib 2.0. It provides a high-level interface for drawing attractive and informative statistical graphics and is based on Matplotlib.

Photo by Edward Howell on Unsplash

So what is the difference between the two you might ask… well I (actually this article) have you covered: Matplotlib is mainly used for plotting basic graphs like bars, pies, lines, scatter plots and so on. Seaborn on the other hand, provides a variety of complex visualization patterns while using fewer syntax.

In addition to the common graphs (bar, scatter, line, area, pie at cetera), Seaborn can be used to make,

  • Joint Distribution Plots
  • Density Plots
  • Factor Plots
  • Swarm Plots
  • Violin Plots
  • Lollipop Graphs

and many more!!

Latest Version: 0.11.0.

Read more about this library here and its functions here.

3. Plotly

Plotly is a graphing library for Python that can be used to make interactive graphs. Like in the case of other visualization tools, data is imported and then visualized and analyzed.

Plotly is more of a ‘enterprise’ version of Matplotlib and Seaborn as it can be integrated and used for building web applications that are ML or data science oriented.

Plotly can be used for,

  • making basic charts (scatter, line, bar, pie, bubble, gantt)
  • making statistical charts (box plots, histograms, distplots, error bars, trellis plots, violin plots)
  • making financial charts (candlestick, waterfall, funnel, OHLC)
  • making visualizations using maps
  • subplots
  • 3-D charts
  • making graphs using various transforms (aggregation, group by, filter)
  • Jupyter Widgets Interaction

Latest Version: 4.12.0.

Read more about this library here and its functions here.

4. Bokeh

Bokeh is a Python library that can be used to make interactive visualizations.

One of its main functions is that it can be used to visualize learning algorithms. To everyone who has just started learning machine learning concepts, I would recommend Bokeh since it helps you better understand simple ML techniques like K-means or KNN.

Its applications are,

  • creating basic graphs
  • quickly and easily make interactive plots, dashboards, and data applications
  • it supports HTML, Jupyter Notebook or a server output
  • visualizations can be integrated to Django and Flask apps.

Latest Version: 2.2.3.

Read more about this library here and its functions here.

Libraries for Machine Learning

1. Tensorflow

Tensorflow is the most popular deep and machine learning framework used by data enthusiasts. It is free and a open-source software library developed by the Google Brain Team.

Tensorflow is very simple to use and can be used to develop and deploy machine learning applications. It allows you to work with neural networks with many layers, which is helped by the GPU integration which helps you run an Estimator model on multiple GPUs in one machine.

Some of the most popular uses and applications of Tensorflow are:

  • Speech Recognitions System
  • Text Summarization
  • Image/Video Recognition and Tagging
  • Sentiment Analysis
  • Self Driving Cars
  • Recommendations Systems.

Using Tensorflow, some really cool applications have been made, here are some of them:

Latest Version: 2.3.1.

Read more about this library here and its functions here.

2. Keras

Keras is a high-level library that can be imported from Tensorflow. Keras simplifies many tasks, relieving you from writing tons of monotonous code, but it many not be suitable for complicated procedures or tasks.

Since Keras runs on top of Tensorflow, the questions arises: What is the difference between the two libraries?

Firstly, Keras is not as complicated as Tensorflow, making it simple and easier to use. Keras also is a high-level API that works as a wrapper to Tensorflow and Theano. If you are interested in just creating and executing a machine learning model, Keras is for you, but if you also want to know the deeper intricacies and working, then Tensorflow.

Some applications of Keras are:

  • Image Classification
  • Feature Extraction
  • Fine-tuning and loss computation

Latest Version: 2.4.3.

Read more about this library here and its functions here.

3. Scikit-Learn

Scipy is a free, open-source machine learning library for Python built using NumPy, SciPy and Matplotlib.

It provides various supervised and unsupervised machine learning models and is the perfect package for a beginner in ML. The documentation is pretty simple and intuitive and most importantly, compact. Using very few lines, you can train a model and subsequently implement it.

It is one of the best libraries for working with data.

Scikit-Learn helps you with:

  • Classification Models (SVM, Nearest Neighbours, Random Forest, Naive Bayes, Decision Tree, Supervised Neural Networks et cetera)
  • Regression Models (SVM, Nearest Neighbours, Naive Bayes, Decision Tree, Supervised Neural Networks et cetera)
  • Clustering
  • Dimensionality Reduction
  • Model Selection
  • Preprocessing of data

Latest Version: 0.23.2.

Read more about this library here and its functions here.

4. NLTK

When it comes to text or textual analysis, one part of machine learning is Natural Language Processing, which happens to be NLTK’s forte.

Photo by Jason Leung on Unsplash

NLTK, which stands for Natural Language Toolkit, is a combination of many libraries that help you analyze and process text in order to achieve meaningful conclusion, using only text.

It must be noted that NLTK is used only for processing the data. It does not offer any machine learning models. You can use LSTM or BERT or a deep learning model, post-processing in order to train a model that derives results with text as input.

The main features and uses of NLTK are:

  • Stemming and Lemmatization
  • Tokenization
  • Tagging
  • Sentiment Analysis
  • Topic Modeling/Text Classification.

Latest Version: 3.5.

Read more about this library here and its functions here.

Libraries for Scraping Data

1. Scrapy

Scrapy is an “open source and collaborative framework” that can be used to extract data from websites, i.e., web scraping. It provides a fast and powerful framework to extract information you need from a webpage. It is considered to be the best web crawler for Python.

The best thing about Scrapy is its extensibility, since it can be used to extract data from APIs and functionalities can be plugged in without touching the core.

Why use Scrapy you might ask, well let me tell you why:

  • makes it possible to scrape any website
  • requests are schedule and processed asynchronously
  • can decode JSON directly from websites that provide JSON data
  • makes use of spider bots that scan web pages and collect structured data.

It must be noted that Scrapy can only be used in Python 2.7 and later versions.

Latest Version: 2.4.0.

Read more about this library here and its functions here.

2. BeautifulSoup

It is beautiful, but it definitely isn’t soup.

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help.

If you go to their website, these will be the first few sentences. Honestly, what more is there to say.

Make sure to check out their Hall of Fame page to see some high-profile projects that were made using BeautifulSoup.

BeautifulSoup,

  • is easy to use and master
  • the syntax is pretty simple and the documentations are extraordinarily clear and informative
  • it is a smaller library compared to Scrapy, hence requires minimum setup and lesser attention.

Also, unlike Scrapy, which is a crawler, BeautifulSoup makes use of HTML parsing, which basically means you have to provide sort of an HTML address in order to get the information.

Latest Version: 4.9.3.

Read more about this library here and its functions here.

Best of luck on your journey in data science and thank you for reading :)

--

--

Abhinav Kumar
Analytics Vidhya

Machine Learning Engineer at Fyllo | Data science enthusiast | I like to write and roll