9 Python Libraries for Data Science You’re Probably Not Using (But Definitely Should)

Discover 9 underrated Python libraries that can supercharge your data workflows and give you a serious edge.

9 Python Libraries for Data Science You’re Probably Not Using (But Definitely Should)
Photo by Alina Grubnyak on Unsplash

There’s more to data science than pandas and NumPy.

9 Python Libraries for Data Science You’re Probably Not Using (But Definitely Should)

When it comes to data science, the Python ecosystem is a goldmine. While libraries like NumPy, pandas, and scikit-learn dominate the conversation (and rightly so), there’s a whole world of lesser-known libraries quietly doing incredible work behind the scenes.

If you’re looking to step up your data science game and break out of the basics, this list is for you. These libraries may not be the rockstars of the Python world — yet — but they absolutely deserve a spot in your toolkit.


1. Polars — A Lightning-Fast Alternative to pandas

If you’ve ever had to wrangle a massive dataset and watched your machine beg for mercy, Polars is here to save the day. It’s a blazing-fast DataFrame library written in Rust, with a Python API that feels comfortably familiar.

Installation

pip install polars

Example

>>> import polars as pl 
>>> df = pl.DataFrame( 
...     { 
...         "A": [1, 2, 3, 4, 5], 
...         "fruits": ["banana", "banana", "apple", "apple", "banana"], 
...         "B": [5, 4, 3, 2, 1], 
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], 
...     } 
... ) 
 
# embarrassingly parallel execution & very expressive query language 
>>> df.sort("fruits").select( 
...     "fruits", 
...     "cars", 
...     pl.lit("fruits").alias("literal_string_fruits"), 
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(), 
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"), 
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"), 
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"), 
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"), 
... ) 
shape: (5, 8) 
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐ 
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │ 
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │ 
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │ 
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │ 
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡ 
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │ 
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │ 
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │ 
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │ 
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │ 
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘
Great for data preprocessing, exploratory data analysis, and anything where performance matters.

2. Sweetviz — Visualize Data with One Line

Exploratory Data Analysis (EDA) can be tedious. Sweetviz makes it fun — and automatic. Generate high-density, visual EDA reports with just one line of code.

Installation

pip install sweetviz

Example

import sweetviz as sv 
 
df = pl.DataFrame( 
     { 
         "A": [1, 2, 3, 4, 5], 
         "fruits": ["banana", "banana", "apple", "apple", "banana"], 
         "B": [5, 4, 3, 2, 1], 
         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], 
     } 
 ) 
 
report = sv.analyze(df) 
report.show_html()

3. DABL — Data Analysis Made Ridiculously Simple

From the developers of scikit-learn, DABL (Data Analysis Baseline Library) helps you automate model selection, feature type detection, and preprocessing with minimal code.

Installation

pip install dabl

Example

Let’s start with the classic. You have the titanic.csv file and want to predict whether a passenger survived or not based on the information about the passenger in that file. We know, for tabular data like this, pandas is our friend. Clearly we need to start with loading our data:

import pandas as pd 
import dabl 
 
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))

Let’s familiarize ourself with the data a bit; what’s the shape, what are the columns, what do they look like?

>>> titanic.shape 
(1309, 14)
>>> titanic.head()  
   pclass  survived  ... body                        home.dest 
0       1         1  ...    ?                     St Louis, MO 
1       1         1  ...    ?  Montreal, PQ / Chesterville, ON 
2       1         0  ...    ?  Montreal, PQ / Chesterville, ON 
3       1         0  ...  135  Montreal, PQ / Chesterville, ON 
4       1         0  ...    ?  Montreal, PQ / Chesterville, ON 
 
[5 rows x 14 columns]

4. Vaex — Handle Billion-Row DataFrames in Memory

Vaex is a DataFrame library built for lazy out-of-core computation. It lets you work with billions of rows — all from your laptop.

Installation

pip install vaex

Example

If your data is already in one of the supported binary file formats (HDF5, Apache Arrow, Apache Parquet, FITS), opening it with Vaex rather simple:

import vaex 
 
# Reading a HDF5 file 
df_names = vaex.open('../data/io/sample_names_1.hdf5') 
df_names
# name age city 
0 John 17 Edinburgh 
1 Sally 33 Groningen

5. Scikit-Optimize — Smarter Hyperparameter Tuning

Grid search is fine. Random search is better. But Bayesian optimization? That’s where scikit-optimize comes in. It’s a simple and efficient library for minimizing (or maximizing) objective functions — ideal for hyperparameter tuning.

Installation

pip install scikit-optimize

Example

Find the minimum of the noisy function f(x) over the range -2 < x < 2 with skopt:

>>> import numpy as np 
>>> from skopt import gp_minimize 
>>> np.random.seed(123) 
>>> def f(x): 
...     return (np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) * 
...             np.random.randn() * 0.1) 
>>> 
>>> res = gp_minimize(f, [(-2.0, 2.0)], n_calls=20) 
>>> print("x*=%.2f f(x*)=%.2f" % (res.x[0], res.fun)) 
x*=0.85 f(x*)=-0.06

For more control over the optimization loop you can use the skopt.Optimizer class:

>>> from skopt import Optimizer 
>>> opt = Optimizer([(-2.0, 2.0)]) 
>>> 
>>> for i in range(20): 
...     suggested = opt.ask() 
...     y = f(suggested) 
...     res = opt.tell(suggested, y) 
>>> print("x*=%.2f f(x*)=%.2f" % (res.x[0], res.fun)) 
x*=0.27 f(x*)=-0.15

6. PyJanitor — Clean Your Data Like a Pro

Inspired by the R janitor package, PyJanitor extends pandas with convenient methods for cleaning, organizing, and preprocessing data.

Installation

pip install pyjanitor

Example

Let’s import some libraries and begin with some sample data for this example:

# Libraries 
import numpy as np 
import pandas as pd 
import janitor 
 
# Sample Data curated for this example 
company_sales = { 
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'], 
    'Company1': [150.0, 200.0, 300.0, 400.0], 
    'Company2': [180.0, 250.0, np.nan, 500.0], 
    'Company3': [400.0, 500.0, 600.0, 675.0] 
}

As such, there are three ways to use the API. The first, and most strongly recommended one, is to use pyjanitor's functions as if they were native to pandas.

import janitor  # upon import, functions are registered as part of pandas. 
 
# This cleans the column names as well as removes any duplicate rows 
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()

The second is the functional API.

from janitor import clean_names, remove_empty 
 
df = pd.DataFrame.from_dict(company_sales) 
df = clean_names(df) 
df = remove_empty(df)

The final way is to use the pipe() method:

from janitor import clean_names, remove_empty 
df = ( 
    pd.DataFrame.from_dict(company_sales) 
    .pipe(clean_names) 
    .pipe(remove_empty) 
)

7. Lux — Instantly Visualize Your DataFrames

Tired of manually creating charts just to understand your data? With Lux, you can automatically generate visualizations every time you display a pandas DataFrame in a Jupyter notebook.

Installation

pip install lux-api

Example

To start using Lux, simply add an extra import statement along with your Pandas import.

import lux 
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/college.csv") 
df

When the dataframe is printed out, Lux automatically recommends a set of visualizations highlighting interesting trends and patterns in the dataset.


8. Evidently — Track Your Data and Model Drift

You trained the perfect model — but how do you know it’s still performing well in production? Evidently helps you monitor data and model quality over time.

Installation

pip install evidently

Example

Import the necessary components:

import pandas as pd 
from evidently import Report 
from evidently import Dataset, DataDefinition 
from evidently.descriptors import Sentiment, TextLength, Contains 
from evidently.presets import TextEvals

Create a toy dataset with questions and answers.

eval_df = pd.DataFrame([ 
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."], 
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."], 
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]], 
                       columns=["question", "answer"])

Create an Evidently Dataset object and add descriptors: row-level evaluators. We'll check for sentiment of each response, its length and whether it contains words indicative of denial.

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df), 
data_definition=DataDefinition(), 
descriptors=[ 
    Sentiment("answer", alias="Sentiment"), 
    TextLength("answer", alias="Length"), 
    Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials") 
])

You can view the dataframe with added scores:

eval_dataset.as_dataframe()

To get a summary Report to see the distribution of scores:

report = Report([ 
    TextEvals() 
]) 
 
my_eval = report.run(eval_dataset) 
my_eval 
 
# my_eval.json() 
# my_eval.dict()

9. DTale — A GUI for Your pandas DataFrames

DTale lets you inspect and edit pandas DataFrames via a slick web interface. It’s like having a lightweight spreadsheet tool right inside your Python environment.

Installation

pip install dtale

Example

import dtale 
import pandas as pd 
 
df = pd.DataFrame([dict(a=1,b=2,c=3)]) 
 
# Assigning a reference to a running D-Tale process. 
d = dtale.show(df) 
 
# Accessing data associated with D-Tale process. 
tmp = d.data.copy() 
tmp['d'] = 4 
 
# Altering data associated with D-Tale process 
# FYI: this will clear any front-end settings you have at the time for this process (filter, sorts, formatting) 
d.data = tmp 
 
# Get raw dataframe w/ any sorting or edits made through the UI 
d.data 
 
# Get raw dataframe similar to '.data' along with any filters applied using the UI 
d.view_data 
 
# Shutting down D-Tale process 
d.kill() 
 
# Using Python's `webbrowser` package it will try and open your server's default browser to this process. 
d.open_browser() 
 
# There is also some helpful metadata about the process. 
d._data_id  # The process's data identifier. 
d._url  # The url to access the process. 
 
d2 = dtale.get_instance(d._data_id)  # Returns a new reference to the instance running at that data_id. 
 
dtale.instances()  # Prints a list of all ids & urls of running D-Tale sessions.

Final Thoughts

The Python ecosystem for data science is massive — and constantly evolving. While pandas, NumPy, and scikit-learn will always be essential, these lesser-known libraries can seriously level up your workflow, especially as your projects grow in complexity.

Try a few of these out on your next project — your future self (and your CPU) will thank you.


Enjoyed the read?
Clap, comment, or share your favorite underrated Python library — and let’s help each other discover the hidden gems of data science!