How to Work with Pandas in Python

How to work with Pandas in Python

When we talk about data science, we usually refer to the data analysis through summarization, visualizations, sophisticated algorithms that learn patterns in data (machine learning), and other fancy tools. When we discuss the term with software developers, we also hear a lot of Python, the popular programming language.

But why is Python so popular and special in the data science world? There are many reasons, and an important one is the Python ecosystem and libraries that make data science seem natural to Python.

One of these libraries is pandas , which every data science in the world uses, used, or at least heard of (if you are a data scientist who never used pandas, scream in comments).

Pandas is an essential part of the ecosystem that many other data science tools build on top or provide specific functionalities for pandas.

This guide introduces pandas for developers and aims to cover the what, why, and how of pandas’ most commonly used features.

Before we get started, if you want to access the full source code for this project to follow along, you can download the project’s source code from GitHub .


What Is Pandas For?

Pandas, in the past couple of years, have established itself as one of the best libraries for data analysis. It is such the importance and power of pandas that one of the first things data scientists do when receiving a data set is to load it into pandas to begin understanding it.

Pandas provides them with anything they need for cleaning, transforming, and analyzing data. It gives so many features for analysts that it can make entire books about them.

Here are some of the things you can do with pandas:

  • Describe: get information about the data set, calculate statistical values, answer immediate questions like averages, medians, min, max, correlations, distribution, and more.
  • Clean: Remove duplicates, replace empty values, filter rows, columns
  • Transform: calculate new values, rename columns, mutate your data
  • Visualize: Build state of the art visualizations with matplotlib, seaborn, or others directly from your pandas’ dataset
  • Storage: load and save data from and to files like CSV, JSON or connect directly to databases

Install and Import

Pandas is an external library, and as such, requires installation for your project. The library name is pandas, and you can install it using your favorite package manager for Python. In my case, I’m using pipenv as I’m not a big fan of conda, but the process applies the same.

If you checked out my code from GitHub, simply run the following command as all dependencies are present on the Pipfile

pipenv install

If you want to follow along with this tutorial from a new project, you can simply install pandas by running:

pipenv install pandas

And I highly recommend using a Jupyter notebook as I did for experimenting and testing on the data, so you will have to install that as well by running:

pipenv install notebook

If you need more information about how to use Jupyter notebooks, you can check this guide .

The last thing before starting is to import the library.

import pandas as pd

It’s not mandatory to provide an alias, but it’s very common to use pd, it makes it straightforward, and it’s how people do it, so let’s follow along.


Basis of Pandas: Series and DataFrames

Pandas has two basic building blocks for all its functionality: Series and DataFrames. In simple terms, a Series is a column, and DataFrames is a multidimensional dataset with multiple Series.

Series and DataFrames

Series and DataFrames


Reading Data

Though you can manually create DataFrames in pandas, which makes it easy to use is its ability to directly load data from multiple sources such as CSV, JSON, or databases.

From a CSV file

Reading CSV files is as simple as one line of code with pandas.

df_titanic = pd.read_csv("titanic.csv")
df_titanic.head()
PassengerIdSurvivedPclassFareCabinEmbarked
01037.2500NaNS
121171.2833C85C
23137.9250NaNS
341153.1000C123S
45038.0500NaNS

Importing a CSV file is as simple as calling one function read_csv with the file name. The second line of code calls the function head, which prints the first five rows of data. More on that later.

From a JSON file

Reading JSON files is not more complicated than reading a CSV, it’s a matter of a simple function call:

df_json = pd.read_json("sample.json")
df_json.head()
ProductPrice
0Desktop Computer900
1Tablet360
2iPhone700
3Laptop1460

From a SQL database

Reading data from a SQL database requires some additional steps compared to reading CSV or JSON files. This is because pandas doesn’t support databases natively but instead relies on third-party libraries to establish the connection.

Once the connection to the database is ready, you can work directly with pandas using the function read_sql_query.

For our scenario, we will read data from SQLLite, but other database engines are supported, such as MySQL, Postgres, etc.

Note: If you are running this code on your own and not using the tutorial’s provided project, make sure you have pysqlite3 installed before continuing.

As discussed, first, we need to establish a connection to the database.

import sqlite3
con = sqlite3.connect("chinook.db")

Later, we can directly run SQL queries in pandas to retrieve information. If you need help with SQL, I can recommend the complete SQL bootcamp course on Udemy.

In our case, we will just fetch data from one table.

df_sql = pd.read_sql_query("SELECT * FROM albums", con)
df_sql.head()
AlbumIdTitleArtistId
01For Those About To Rock We Salute You1
12Balls to the Wall2
23Restless and Wild2
34Let There Be Rock1
45Big Ones3

Customization

So far, we read data by using the default values and works great for most cases. However, there may be circumstances in which you need to change how to read and parse the data. Pandas reading functions are complete and offer tons of customizations. You can read about them on the official docs .

One important and highly used parameter is the index_col which allows you to specify the column or columns used as the index of the DataFrame. Thing that is very useful when slicing or selecting data, as we will do later on in the guide.

df_drinks = pd.read_csv("data/drinks.csv", index_col="country")
df_drinks.head()

Saving Data in Pandas

Sometimes we perform changes to our datasets to facilitate the data analysis. However, those changes are lost unless we save them. Pandas provides simple methods to store a DataFrame into a file or database.

df.to_csv('saved.csv')  
df.to_json('saved.json')  
df.to_sql('saevd, con)

Note that we can store the data on any type of file regardless of the format we first loaded.


Describing Your Data

When you first load your data set, you want to understand its structure and start making sense of the data. In this section, we will cover basic pandas functions that will help you exactly on that.

The first function we will cover is the head function, which we already introduced. This function prints on the screen the first n rows of data, and it is handy to explore the columns and types of a dataset for the first time.

You can run this function with no arguments, in which case will default to 5 rows:

df_titanic.head()

Or you can pass the number of rows you would like to see:

df_titanic.head(10)
PassengerIdSurvivedPclassFareCabinEmbarked
01037.2500NaNS
121171.2833C85C
23137.9250NaNS
341153.1000C123S
45038.0500NaNS
56038.4583NaNQ
670151.8625E46S
780321.0750NaNS
891311.1333NaNS
9101230.0708NaNC

If you want to see the very last rows, you can use the function tail. As with head, you can provide how many rows you would like to see:

df_titanic.tail()
PassengerIdSurvivedPclassFareCabinEmbarked
8868870213.0000NaNS
8878881130.0000B42S
8888890323.4500NaNS
8898901130.0000C148C
890891037.7500NaNQ

Another interesting function you probably need to run after loading your data is the info function, which provides essential information about your datasets, such as the number of rows and columns, the number of non-null values, data types, and memory usage.

df_titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

If you are trying to know just how many columns and rows of your data has, then you can get the shape attribute of the DataFrame:

df_titanic.shape

(note that shape is an attribute and not a function)

(891, 12)

If you are looking to understand your dataset’s values better, the function describe will provide a statistical summary of each column, with information like count, mean, standard deviation, etc.

df_titanic.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

Manipulating DataFrames

DataFrames expose hundreds of methods to work on data sets, way too many to know them all. However, some operations frequently occur during data analysis tasks. We will review all those methods that are crucial to understanding for all data scientists.

Removing duplicates

Duplicates can distort our results, so it’s essential to deal with them. Pandas offers some utilities to handle duplicates like drop_duplicates, which automatically removes all duplicate values from the DataFrame.

df_movies = pd.read_csv(movies.csv')
df_movies.shape
df_movies = df_movies.drop_duplicates()
df_movies.shape

Note that for our example, we loaded a new file that contains two duplicates, and here are the results:

(77, 8)
(75, 8)

Keep in mind that drop_duplicates won’t affect the original DataFrame (by default), and instead, it will return a new DataFrame with unique values. You can alternatively pass the argument inplace=True to modify the original instead.

Renaming columns

There are several ways to rename the data frame columns in pandas but one of the most useful and easy ways is using the rename function:

df_tmp = df_titanic.rename(columns = {"PassengerId" : "PassId"})
df_tmp.head()
PassIdSurvivedPclassFareCabinEmbarked
01037.2500NaNS
121171.2833C85C
23137.9250NaNS
341153.1000C123S
45038.0500NaNS

Similar to drop_column, rename will return a new DataFrame with the updates.

rename also supports renaming more than one column at a time:

df_tmp = df_titanic.rename(columns = {"PassengerId" : "PassId", "Pclass": "PassClass"})
df_tmp.head()
PassIdSurvivedPassClassFareCabinEmbarked
01037.2500NaNS
121171.2833C85C
23137.9250NaNS
341153.1000C123S
45038.0500NaNS

Working with missing values

Missing data in a real-life scenario can be a big problem. Fortunately, pandas is design to detect the missing values or NA (Not Available) values. To detect the null values, we use the isnull() function:

df_titanic.isnull()
PassengerIdSurvivedPclassTicketFareCabinEmbarked
0FalseFalseFalseFalseFalseTrueFalse
1FalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseTrueFalse
3FalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseTrueFalse
..
886FalseFalseFalseFalseFalseTrueFalse
887FalseFalseFalseFalseFalseFalseFalse
888FalseFalseFalseFalseFalseTrueFalse
889FalseFalseFalseFalseFalseFalseFalse
890FalseFalseFalseFalseFalseTrueFalse

The isnull function returns a new DataFrame with the shape of the original one with boolean values (True or False), indicating whether a particular cell is null or not. This can be very helpful to perform calculations or replace empty values.

Sometimes we can’t work with missing values, so it’s best to simply remove the entire rows. You can achieve that with the function dropna.

df_titanic.shape
tmp = df_titanic.dropna()
tmp.shape

Which results in:

(891, 12)
(183, 12)

Judging by the results, most of our records contain at least one empty column. It wouldn’t be wise in this scenario to simply drop all these rows from the DataFrame.

Slicing, selecting, extracting data

So far, we have been working with entire datasets or performing basic filtering like removing empty rows. But how can we intentionally select data from the DataFrames?

There are two ways to access or slice a pandas DataFrame, either by column or by rows.

By column

Extracting data by columns on a DataFrame is super easy. You simply use [] with the series (column) name as follows:

ds_survived = df_titanic['Survived']
ds_survived
type(ds_survived)

In which case it would output:

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

pandas.core.series.Series

Note that the result is a DataSeries and not a DataFrame. Sometimes because some methods and properties differ between both, we want to transform a DataSeries into a DataFrame, we do that by using the [[]] notation:

df_survived = df_titanic[['Survived']]
type(df_survived)

Which outputs:

pandas.core.frame.DataFrame

By row When selecting data by row, we have more options than by columns. We can access a row by indices or perform a search in the dataset by querying or applying conditionals.

We will start with the property loc, allowing access to a group of rows and columns by label(s) or a boolean array.

Note that if we try to use loc with labels, you need to have your DataFrame indexed as we did in the case of the drinks DataFrame above.

df_drinks.loc['Argentina']

Additionally, you need to consider that looking by index requires an exact match and it is case sensitive, so in our case, Argentina is correct but Arg or argentina will raise an exception.

The output for the code is:

beer_servings                   193
spirit_servings                  25
wine_servings                   221
total_litres_of_pure_alcohol    8.3
continent                        SA
Name: Argentina, dtype: object

Interestingly enough, you can also access multiple rows at the same time using the same techniques we do for lists, so for example:

df_drinks.loc['Argentina': 'Austria']

This will return a new DataFrame with all the results between Argentina and Austria inclusive.

countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
Argentina193252218.3SA
Armenia21179113.8EU
Australia2617221210.4OC
Austria279751919.7EU

How to filter data with no labels? You can use indices using the property iloc, which is equivalent to slicing on a Python list.

df_drinks.iloc[6:10]

Which will produce the same result as above, selecting from index 6 (Argentina) until index 10-1 (Austria).

It is also possible to assign an increment, the same way as we do with lists.

df_drinks.iloc[6:10:2]

Sometimes you need to show only the rows that meet some conditions like displaying the countries with more than 320 beer servings. For that you can use conditional selections.

Here is that example in code:

df_drinks.loc[df_drinks['beer_servings'] > 320]

Let’s break it down. df_drinks['beer_servings'] > 320 is a special conditional, as it applies to a complete DataSeries. The result will be a new DataSeries with boolean values representing the result of the conditional for each row in the original DataFrame.

Then, by using df_drinks[] of a boolean DataSeries, we return a new DataFrame filtering out all the rows with False value on the conditional DataSeries.

I wasn’t too surprised about the results, except for the case of Venezuela and Namibia.

countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
Czech Republic36117013411.8EU
Gabon34798598.9AF
Germany34611717511.3EU
Lithuania3432445612.9EU
Namibia376316.8AF
Poland3432155610.9EU
Venezuela33310037.7SA

Note: in some other tutorials you may find the same code but leaving .loc aside, e.g.:

df_drinks[df_drinks['beer_servings'] > 320]

And though in this case, that code works, there is a corner case in which it won’t, so better to be explicit than implicit and use .loc. You can read more about this on this StackOverflow thread.


Statistics

Statistics and math are among the most important skills that a data scientist should have in his belt. Fortunately, pandas makes it easy for us to perform statistical calculations on our dataset.

Aggregate Statistics

This section will cover how to perform some statistical calculations on the data set such as the mean, groupby, and count functions.

Starting by using the mean() function, which will return the mean values of the numeric DataSeries:

df_drinks.mean()

-------------------------
Output
-------------------------
beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

The mean() function has returned the mean values of only the numerical columns, making sense since you can’t perform a statistical calculation on a text.

We can also perform this calculation on DataSeries level, in which case we will return a single computed number.

df_drinks["beer_servings"].mean()

-------------------------
Output
-------------------------
106.16062176165804

Let’s say that you want to calculate the mean of some columns based on a specific group so for that goal you have to use a pandas function called groupby() function.

df_drinks.groupby("continent")["beer_servings"].mean()

-------------------------
Output
-------------------------
continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

By using groupby we can perform calculations grouped by sets of data, for example the mean of each continent.

Same as mean , you can calculate the median , min , max , sum , and much more.

Another popular function is to count the number of rows/columns, which can be done by using count.

df.count(axis = 0)

-------------------------
Output
-------------------------
beer_servings                   193
spirit_servings                 193
wine_servings                   193
total_litres_of_pure_alcohol    193
continent                       170
dtype: int64

By specifying the axis=0 (default value) we count the number of rows per column. However, we can also invert the axis (axis=1) and count the number of columns per row.

df.count(axis = 1)

-------------------------
Output
-------------------------
country
Afghanistan    5
Albania        5
Algeria        5
Andorra        5
Angola         5
              ..
Venezuela      5
Vietnam        5
Yemen          5
Zambia         5
Zimbabwe       5
Length: 193, dtype: int64

Common functions

This section will discuss some of the most common statistical functions in pandas, a must-have in any data scientist’s arsenal.

Let’s start with the cov function that will calculate the column’s pairwise covariance. See the below code:

df_drinks.cov()
beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
beer_servings10229.9271914096.9539624249.458468318.992031
spirit_servings4096.9539627794.1197651370.601306218.184985
wine_servings4249.4584681370.6013066351.707200200.762044
total_litres_of_pure_alcohol318.992031218.184985200.76204414.237779

The cov functions have calculated the covariance values of the pairwise covariance of all the columns.

Another commonly used pandas function to calculate the change percentage of the value is the pct_change function. This function will calculate the percentage change of the current values compared to the previous element.

ind = pd.date_range('01/01/2020', periods = 7, freq ='W')    
df_tmp = pd.DataFrame({"A":[4, 14, 3, 6, 2, 55, 33], 
                   "B":[5, 2, 54, 34, 2, 32, 56],  
                   "C":[20, 20, 17, 31, 8, 5, 3], 
                   "D":[14, 3, 6, 2, 3, 4, 12]}, index = ind)
df_tmp.pct_change()
ABCD
2020-01-05NaNNaNNaNNaN
2020-01-122.500000-0.6000000.000000-0.785714
2020-01-19-0.78571426.000000-0.1500001.000000
2020-01-261.000000-0.3703700.823529-0.666667
2020-02-02-0.666667-0.941176-0.7419350.500000
2020-02-0926.50000015.000000-0.3750000.333333
2020-02-16-0.4000000.750000-0.4000002.000000

First, we’ve created a data series index using the date_range() function and a data frame containing just random values for demonstration purposes and finally applied the percentage change function on that data.

Let’s see now how to calculate the relationship of two values in what’s known as the correlation. The used function called corr() function and it calculates the relationship of all columns in your data frame and doesn’t include the null values, not available values, and non-numeric data such as texts.

df_tmp = pd.DataFrame({"A":[6, 8, 3, 4],  
                   "B":[51, 2, 6, 3], 
                   "C":[7, 2, 9, 5], 
                   "D":[5, 4, 4, 22]})
df_tmp.corr()
ABCD
A1.0000000.167894-0.792908-0.369624
B0.1678941.0000000.345789-0.300948
C-0.7929080.3457891.000000-0.154586
D-0.369624-0.300948-0.1545861.000000

We’ve used the previous random values of the covariance example to calculate the correlation of the columns.

Learning how to rank your data based on a specific standard like ranking the movies dataset based on their rating score is one of the most commonly used things in data science when dealing with datasets. The below example will use the rank function in Pandas:

df_movies['revenue_rank'] = df_movies["Worldwide Gross"].rank()
df_movies.head()

We’ve created a new column called “‘revenue_rank’” with the ranking assigned by the rank function based on the “Worldwide Gross”.

Plotting

Pandas can’t plot by itself, but it provides helper methods that would use plotting engines such as matplotlib or seaborn to perform the task.

For our guide, we will use matplotlib , if you are running the code on your own, make sure that you have the library installed. If you are using the code from GitHub, this is already a project dependency.

Scatter plot

Scatter plot is a diagram that displays points based on two variables and luckily Pandas makes that easy for us to create a scatter plot just by a simple function called plot() function:

colors = df_titanic["Survived"].replace(1, 'green').replace(0, 'red')
df_titanic.plot(x='Fare', y="Age", c=colors, colormap="Reds", kind="scatter")
Sample scatter plot

Sample scatter plot

The first chart shows the survival status spread by age and fare, painting with red those who did not survive and the others in green. To set the colors is necessary to generate a color map depending on the status. This can be done in a better way which we will analyze later.

The plot function is doing all the heavy work by rendering the chart.

Line plot

Line chart is a series of values that are connected with a line and it is one of the most used charts. Pandas also able to display this kind of plot very easily. See the following code:

df_titanic.loc[df_titanic["Survived"] == 1].groupby("Age")['Survived'].count().plot()
Sample line plot

Sample line plot

A line plot showing the number of survivors by age.

Bar plot

Bar plot is one of the most charts in the data visualization field and essentially represents categorical data with a rectangular bar. Pandas as well offer this kind of chart. See the following code:

df_titanic["Survived"].value_counts().plot(kind="bar")
Sample bar plot

Sample bar plot

You can also build histograms.

df_titanic.hist(bins=10,figsize=(9,7),grid=False)
Sample histogram plot

Sample histogram plot

Or much more complex charts. Here is an interesting article on building charts for this particular dataset.


Conclusion

Data scientists need first to explore, clean, and transform their data before going to the visualization process. Pandas makes it easy to perform these actions with just some simple commands and can be used to plot the data instead of using the other libraries such as matplotlib and seaborn. Still, you may find yourself in need to learn these visualization libraries for more complex charts.

You can learn more on how to use pandas, plotting and much more on the Data Science Bootcamp on Udemy.

Thanks for reading!

If you liked what you saw, please support my work!

Juan Cruz Martinez - Author @ Live Code Stream

Juan Cruz Martinez

Juan has made it his mission to help aspiring developers unlock their full potential. With over two decades of hands-on programming experience, he understands the challenges and rewards of learning to code. By providing accessible and engaging educational content, Juan has cultivated a community of learners who share their passion for coding. Leveraging his expertise and empathetic teaching approach, Juan has successfully guided countless students on their journey to becoming skilled developers, transforming lives through the power of technology.