How to Work with Pandas in Python

How to work with Pandas in Python

When we talk about data science, we usually refer to the data analysis through summarization, visualizations, sophisticated algorithms that learn patterns in data (machine learning), and other fancy tools. When we discuss the term with software developers, we also hear a lot of Python, the popular programming language.

But why is Python so popular and special in the data science world? There are many reasons, and an important one is the Python ecosystem and libraries that make data science seem natural to Python.

One of these libraries is pandas , which every data science in the world uses, used, or at least heard of (if you are a data scientist who never used pandas, scream in comments).

Pandas is an essential part of the ecosystem that many other data science tools build on top or provide specific functionalities for pandas.

This guide introduces pandas for developers and aims to cover the what, why, and how of pandas’ most commonly used features.

Before we get started, if you want to access the full source code for this project to follow along, you can download the project’s source code from GitHub .

What Is Pandas For?

Pandas, in the past couple of years, have established itself as one of the best libraries for data analysis. It is such the importance and power of pandas that one of the first things data scientists do when receiving a data set is to load it into pandas to begin understanding it.

Pandas provides them with anything they need for cleaning, transforming, and analyzing data. It gives so many features for analysts that it can make entire books about them.

Here are some of the things you can do with pandas:

  • Describe: get information about the data set, calculate statistical values, answer immediate questions like averages, medians, min, max, correlations, distribution, and more.
  • Clean: Remove duplicates, replace empty values, filter rows, columns
  • Transform: calculate new values, rename columns, mutate your data
  • Visualize: Build state of the art visualizations with matplotlib, seaborn, or others directly from your pandas’ dataset
  • Storage: load and save data from and to files like CSV, JSON or connect directly to databases

Install and Import

Pandas is an external library, and as such, requires installation for your project. The library name is pandas, and you can install it using your favorite package manager for Python. In my case, I’m using pipenv as I’m not a big fan of conda, but the process applies the same.

If you checked out my code from GitHub, simply run the following command as all dependencies are present on the Pipfile

pipenv install

If you want to follow along with this tutorial from a new project, you can simply install pandas by running:

pipenv install pandas

And I highly recommend using a Jupyter notebook as I did for experimenting and testing on the data, so you will have to install that as well by running:

pipenv install notebook

If you need more information about how to use Jupyter notebooks, you can check this guide .

The last thing before starting is to import the library.

import pandas as pd

It’s not mandatory to provide an alias, but it’s very common to use pd, it makes it straightforward, and it’s how people do it, so let’s follow along.

Basis of Pandas: Series and DataFrames

Pandas has two basic building blocks for all its functionality: Series and DataFrames. In simple terms, a Series is a column, and DataFrames is a multidimensional dataset with multiple Series.

Series and DataFrames

Series and DataFrames

Reading Data

Though you can manually create DataFrames in pandas, which makes it easy to use is its ability to directly load data from multiple sources such as CSV, JSON, or databases.

From a CSV file

Reading CSV files is as simple as one line of code with pandas.

df_titanic = pd.read_csv("titanic.csv")

Importing a CSV file is as simple as calling one function read_csv with the file name. The second line of code calls the function head, which prints the first five rows of data. More on that later.

From a JSON file

Reading JSON files is not more complicated than reading a CSV, it’s a matter of a simple function call:

df_json = pd.read_json("sample.json")
0Desktop Computer900

From a SQL database

Reading data from a SQL database requires some additional steps compared to reading CSV or JSON files. This is because pandas doesn’t support databases natively but instead relies on third-party libraries to establish the connection.

Once the connection to the database is ready, you can work directly with pandas using the function read_sql_query.

For our scenario, we will read data from SQLLite, but other database engines are supported, such as MySQL, Postgres, etc.

Note: If you are running this code on your own and not using the tutorial’s provided project, make sure you have pysqlite3 installed before continuing.

As discussed, first, we need to establish a connection to the database.

import sqlite3
con = sqlite3.connect("chinook.db")

Later, we can directly run SQL queries in pandas to retrieve information. If you need help with SQL, I can recommend the complete SQL bootcamp course on Udemy.

In our case, we will just fetch data from one table.

df_sql = pd.read_sql_query("SELECT * FROM albums", con)
01For Those About To Rock We Salute You1
12Balls to the Wall2
23Restless and Wild2
34Let There Be Rock1
45Big Ones3


So far, we read data by using the default values and works great for most cases. However, there may be circumstances in which you need to change how to read and parse the data. Pandas reading functions are complete and offer tons of customizations. You can read about them on the official docs .

One important and highly used parameter is the index_col which allows you to specify the column or columns used as the index of the DataFrame. Thing that is very useful when slicing or selecting data, as we will do later on in the guide.

df_drinks = pd.read_csv("data/drinks.csv", index_col="country")

Saving Data in Pandas

Sometimes we perform changes to our datasets to facilitate the data analysis. However, those changes are lost unless we save them. Pandas provides simple methods to store a DataFrame into a file or database.

df.to_sql('saevd, con)

Note that we can store the data on any type of file regardless of the format we first loaded.

Describing Your Data

When you first load your data set, you want to understand its structure and start making sense of the data. In this section, we will cover basic pandas functions that will help you exactly on that.

The first function we will cover is the head function, which we already introduced. This function prints on the screen the first n rows of data, and it is handy to explore the columns and types of a dataset for the first time.

You can run this function with no arguments, in which case will default to 5 rows:


Or you can pass the number of rows you would like to see:


If you want to see the very last rows, you can use the function tail. As with head, you can provide how many rows you would like to see:


Another interesting function you probably need to run after loading your data is the info function, which provides essential information about your datasets, such as the number of rows and columns, the number of non-null values, data types, and memory usage.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

If you are trying to know just how many columns and rows of your data has, then you can get the shape attribute of the DataFrame:


(note that shape is an attribute and not a function)

(891, 12)

If you are looking to understand your dataset’s values better, the function describe will provide a statistical summary of each column, with information like count, mean, standard deviation, etc.


Manipulating DataFrames

DataFrames expose hundreds of methods to work on data sets, way too many to know them all. However, some operations frequently occur during data analysis tasks. We will review all those methods that are crucial to understanding for all data scientists.

Removing duplicates

Duplicates can distort our results, so it’s essential to deal with them. Pandas offers some utilities to handle duplicates like drop_duplicates, which automatically removes all duplicate values from the DataFrame.

df_movies = pd.read_csv(movies.csv')
df_movies = df_movies.drop_duplicates()

Note that for our example, we loaded a new file that contains two duplicates, and here are the results:

(77, 8)
(75, 8)

Keep in mind that drop_duplicates won’t affect the original DataFrame (by default), and instead, it will return a new DataFrame with unique values. You can alternatively pass the argument inplace=True to modify the original instead.

Renaming columns

There are several ways to rename the data frame columns in pandas but one of the most useful and easy ways is using the rename function:

df_tmp = df_titanic.rename(columns = {"PassengerId" : "PassId"})

Similar to drop_column, rename will return a new DataFrame with the updates.

rename also supports renaming more than one column at a time:

df_tmp = df_titanic.rename(columns = {"PassengerId" : "PassId", "Pclass": "PassClass"})

Working with missing values

Missing data in a real-life scenario can be a big problem. Fortunately, pandas is design to detect the missing values or NA (Not Available) values. To detect the null values, we use the isnull() function:


The isnull function returns a new DataFrame with the shape of the original one with boolean values (True or False), indicating whether a particular cell is null or not. This can be very helpful to perform calculations or replace empty values.

Sometimes we can’t work with missing values, so it’s best to simply remove the entire rows. You can achieve that with the function dropna.

tmp = df_titanic.dropna()

Which results in:

(891, 12)
(183, 12)

Judging by the results, most of our records contain at least one empty column. It wouldn’t be wise in this scenario to simply drop all these rows from the DataFrame.

Slicing, selecting, extracting data

So far, we have been working with entire datasets or performing basic filtering like removing empty rows. But how can we intentionally select data from the DataFrames?

There are two ways to access or slice a pandas DataFrame, either by column or by rows.

By column

Extracting data by columns on a DataFrame is super easy. You simply use [] with the series (column) name as follows:

ds_survived = df_titanic['Survived']

In which case it would output:

0      0
1      1
2      1
3      1
4      0
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64


Note that the result is a DataSeries and not a DataFrame. Sometimes because some methods and properties differ between both, we want to transform a DataSeries into a DataFrame, we do that by using the [[]] notation:

df_survived = df_titanic[['Survived']]

Which outputs:


By row When selecting data by row, we have more options than by columns. We can access a row by indices or perform a search in the dataset by querying or applying conditionals.

We will start with the property loc, allowing access to a group of rows and columns by label(s) or a boolean array.

Note that if we try to use loc with labels, you need to have your DataFrame indexed as we did in the case of the drinks DataFrame above.


Additionally, you need to consider that looking by index requires an exact match and it is case sensitive, so in our case, Argentina is correct but Arg or argentina will raise an exception.

The output for the code is:

beer_servings                   193
spirit_servings                  25
wine_servings                   221
total_litres_of_pure_alcohol    8.3
continent                        SA
Name: Argentina, dtype: object

Interestingly enough, you can also access multiple rows at the same time using the same techniques we do for lists, so for example:

df_drinks.loc['Argentina': 'Austria']

This will return a new DataFrame with all the results between Argentina and Austria inclusive.


How to filter data with no labels? You can use indices using the property iloc, which is equivalent to slicing on a Python list.


Which will produce the same result as above, selecting from index 6 (Argentina) until index 10-1 (Austria).

It is also possible to assign an increment, the same way as we do with lists.


Sometimes you need to show only the rows that meet some conditions like displaying the countries with more than 320 beer servings. For that you can use conditional selections.

Here is that example in code:

df_drinks.loc[df_drinks['beer_servings'] > 320]

Let’s break it down. df_drinks['beer_servings'] > 320 is a special conditional, as it applies to a complete DataSeries. The result will be a new DataSeries with boolean values representing the result of the conditional for each row in the original DataFrame.

Then, by using df_drinks[] of a boolean DataSeries, we return a new DataFrame filtering out all the rows with False value on the conditional DataSeries.

I wasn’t too surprised about the results, except for the case of Venezuela and Namibia.

Czech Republic36117013411.8EU

Note: in some other tutorials you may find the same code but leaving .loc aside, e.g.:

df_drinks[df_drinks['beer_servings'] > 320]

And though in this case, that code works, there is a corner case in which it won’t, so better to be explicit than implicit and use .loc. You can read more about this on this StackOverflow thread.


Statistics and math are among the most important skills that a data scientist should have in his belt. Fortunately, pandas makes it easy for us to perform statistical calculations on our dataset.

Aggregate Statistics

This section will cover how to perform some statistical calculations on the data set such as the mean, groupby, and count functions.

Starting by using the mean() function, which will return the mean values of the numeric DataSeries:


beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

The mean() function has returned the mean values of only the numerical columns, making sense since you can’t perform a statistical calculation on a text.

We can also perform this calculation on DataSeries level, in which case we will return a single computed number.



Let’s say that you want to calculate the mean of some columns based on a specific group so for that goal you have to use a pandas function called groupby() function.


AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

By using groupby we can perform calculations grouped by sets of data, for example the mean of each continent.

Same as mean , you can calculate the median , min , max , sum , and much more.

Another popular function is to count the number of rows/columns, which can be done by using count.

df.count(axis = 0)

beer_servings                   193
spirit_servings                 193
wine_servings                   193
total_litres_of_pure_alcohol    193
continent                       170
dtype: int64

By specifying the axis=0 (default value) we count the number of rows per column. However, we can also invert the axis (axis=1) and count the number of columns per row.

df.count(axis = 1)

Afghanistan    5
Albania        5
Algeria        5
Andorra        5
Angola         5
Venezuela      5
Vietnam        5
Yemen          5
Zambia         5
Zimbabwe       5
Length: 193, dtype: int64

Common functions

This section will discuss some of the most common statistical functions in pandas, a must-have in any data scientist’s arsenal.

Let’s start with the cov function that will calculate the column’s pairwise covariance. See the below code:


The cov functions have calculated the covariance values of the pairwise covariance of all the columns.

Another commonly used pandas function to calculate the change percentage of the value is the pct_change function. This function will calculate the percentage change of the current values compared to the previous element.

ind = pd.date_range('01/01/2020', periods = 7, freq ='W')    
df_tmp = pd.DataFrame({"A":[4, 14, 3, 6, 2, 55, 33], 
                   "B":[5, 2, 54, 34, 2, 32, 56],  
                   "C":[20, 20, 17, 31, 8, 5, 3], 
                   "D":[14, 3, 6, 2, 3, 4, 12]}, index = ind)

First, we’ve created a data series index using the date_range() function and a data frame containing just random values for demonstration purposes and finally applied the percentage change function on that data.

Let’s see now how to calculate the relationship of two values in what’s known as the correlation. The used function called corr() function and it calculates the relationship of all columns in your data frame and doesn’t include the null values, not available values, and non-numeric data such as texts.

df_tmp = pd.DataFrame({"A":[6, 8, 3, 4],  
                   "B":[51, 2, 6, 3], 
                   "C":[7, 2, 9, 5], 
                   "D":[5, 4, 4, 22]})

We’ve used the previous random values of the covariance example to calculate the correlation of the columns.

Learning how to rank your data based on a specific standard like ranking the movies dataset based on their rating score is one of the most commonly used things in data science when dealing with datasets. The below example will use the rank function in Pandas:

df_movies['revenue_rank'] = df_movies["Worldwide Gross"].rank()

We’ve created a new column called “‘revenue_rank’” with the ranking assigned by the rank function based on the “Worldwide Gross”.


Pandas can’t plot by itself, but it provides helper methods that would use plotting engines such as matplotlib or seaborn to perform the task.

For our guide, we will use matplotlib , if you are running the code on your own, make sure that you have the library installed. If you are using the code from GitHub, this is already a project dependency.

Scatter plot

Scatter plot is a diagram that displays points based on two variables and luckily Pandas makes that easy for us to create a scatter plot just by a simple function called plot() function:

colors = df_titanic["Survived"].replace(1, 'green').replace(0, 'red')
df_titanic.plot(x='Fare', y="Age", c=colors, colormap="Reds", kind="scatter")
Sample scatter plot

Sample scatter plot

The first chart shows the survival status spread by age and fare, painting with red those who did not survive and the others in green. To set the colors is necessary to generate a color map depending on the status. This can be done in a better way which we will analyze later.

The plot function is doing all the heavy work by rendering the chart.

Line plot

Line chart is a series of values that are connected with a line and it is one of the most used charts. Pandas also able to display this kind of plot very easily. See the following code:

df_titanic.loc[df_titanic["Survived"] == 1].groupby("Age")['Survived'].count().plot()
Sample line plot

Sample line plot

A line plot showing the number of survivors by age.

Bar plot

Bar plot is one of the most charts in the data visualization field and essentially represents categorical data with a rectangular bar. Pandas as well offer this kind of chart. See the following code:

Sample bar plot

Sample bar plot

You can also build histograms.

Sample histogram plot

Sample histogram plot

Or much more complex charts. Here is an interesting article on building charts for this particular dataset.


Data scientists need first to explore, clean, and transform their data before going to the visualization process. Pandas makes it easy to perform these actions with just some simple commands and can be used to plot the data instead of using the other libraries such as matplotlib and seaborn. Still, you may find yourself in need to learn these visualization libraries for more complex charts.

You can learn more on how to use pandas, plotting and much more on the Data Science Bootcamp on Udemy.

Thanks for reading!

If you liked what you saw, please support my work!

Juan Cruz Martinez - Author @ Live Code Stream

Juan Cruz Martinez

Juan has made it his mission to help aspiring developers unlock their full potential. With over two decades of hands-on programming experience, he understands the challenges and rewards of learning to code. By providing accessible and engaging educational content, Juan has cultivated a community of learners who share their passion for coding. Leveraging his expertise and empathetic teaching approach, Juan has successfully guided countless students on their journey to becoming skilled developers, transforming lives through the power of technology.