NumPy is one of the first libraries that comes to our mind today when working with numerical data. In data science, especially, it’s impossible to miss the trail of NumPy that’s powering the majority of Python’s data science toolkit, including Pandas, Matplotlib, and Scikit-learn, under the hood.
The reason behind NumPy’s widespread popularity is the robust n-dimensional array object it provides. This data structure allows us to perform fast computations on a large amount of data with speed and efficiency. This quality of the NumPy library becomes invaluable in data science when manipulating and analyzing data and doing costly mathematical calculations.
In this guide, we are going to explore how NumPy manages to efficiently run mathematical operations and the different methods it provides for achieving this task. Having this understanding of the mechanics of the library will give you a much-needed helping hand in your data science journey. You can find the entire code used here on GitHub if you want to follow along with the tutorial.
What is Numpy?
NumPy is an open-source Python library with built-in support for fast array operations, including shape transformations, mathematical computations, linear algebra, and statistical procedures. It is an essential part of the Python data science toolkit and even powers some of its components under the hood. The library is developed around a homogeneous multidimensional array object called
Why use NumPy when we already have Python lists?
Although Python already has a built-in list object to conduct array-like operations, it is not good enough for working with a considerable amount of data. One reason for this is the memory and speed inefficiency of Python lists.
Python is inherently slower than compiled languages like C because of going through an additional interpretation step at runtime. As a solution to this issue, NumPy uses pre-compiled C code behind the scenes. This allows NumPy to complete usually costly operations like looping with a speed and memory efficiency close to C. The numerical operation-focused methods of NumPy also make our code a lot simpler than regular Python.
In addition to these performance-wise details, here are some key differences between Python lists and NumPy arrays for further reference:
- Unlike Python lists which can dynamically grow in size, NumPy arrays have a fixed size defined at the creation time.
- All the elements stored in a NumPy array should be homogeneous. This is significantly different from Python lists that can store data of different types.
NumPy provides several methods of installation. If you already have conda installed on your device, the easiest solution is running the following conda command.
conda install numpy
You can also use pip to install NumPy into your Python environment.
pip install numpy
Once installed, you can import NumPy to your project like this:
import numpy as np
Numpy array introduction
As we mentioned earlier, Numpy arrays are multidimensional arrays defined by the
ndarray object. Therefore, the arrays can be one-dimensional, two-dimensional, three-dimensional, or even higher depending on the data and the structure we provide.
In NumPy, dimensions of the array are called “axes.” Each axis has a unique integer index. For example, in a 2D array, the vertical axis has the index 0 and the horizontal axis the index 1.
The following example array has a length of 2 in the vertical direction and 3 in the horizontal direction.
[[1, 4, 9], [3, 0, 6]]
Numpy provides several methods to create new arrays. Passing a Python list or tuple to np.array is one of the simplest of them. The dimensions of the passed list or tuple determine the array’s dimensions.
For example, here’s how you can create a one-dimensional array with this method.
arr1 = np.array([1, 2, 3]) arr2 = np.array((4.3, 3.2, 2.1, 9.8))
In this case, NumPy automatically deducts the type of the elements stored in the array based on the passed data.
arr1.dtype #int32 arr2.dtype #float64
You can create higher-dimensional arrays following a similar method.
#2D array arr = np.array([[9, 21, 53],[45, 12, 43], [54, 21, 11]]) #3D array arr = np.array((((1, 2), (7, 8)), ((9, 10), (11,12))))
Numpy also allows you to explicitly pass the data type of the elements stored.
arr = np.array([12, 43, 15], dtype=np.float64)
It creates the following array of floats from the passed integers.
array([12., 43., 15.])
NumPy supports more data types than regular Python. Given the libary’s underlying C code, they have a closer connection to C data types in shape and size. In addition to
str_ are some examples of NumPy data types. You can find more details on this topic on the official NumPy API reference
Create arrays of known shape
Rather than creating an array from the data already at your hand, it’s more common to be in situations where you only know the shape of the array you want to create. Numpy provides a set of methods to create an array of a specific shape that you can populate later as required. These methods include:
np.zeros() #Populate the array with 0s np.ones() #Populate the array with 1s np.empty() #Populate the array with random numbers
You can pass the shape of the required array as a tuple to these methods.
arr = np.zeros((3, 5)) # array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]])
arr = np.ones((6)) # array([1., 1., 1., 1., 1., 1.])
arr = np.empty((2, 3, 2)) # array([[[0.0e+000, 0.0e+000], [0.0e+000, 1.5e-323], [4.9e-324, 0.0e+000]], [[4.9e-324, 1.5e-323], [9.9e-324, 0.0e+000], [9.9e-324, 1.5e-323]]])
By default, the data type of arrays created with these methods is float64. But you can explicitly pass another data type during creation to get arrays of different types.
arr = np.zeros((1, 2), dtype=np.int32) # array([[0, 0]])
Create arrays populated with a sequence of numbers
If you want to create a one-dimensional array populated with a sequence of numbers within a given range, use the np.arange method.
arr = np.arange(5) # array([0, 1, 2, 3, 4])
arr = np.arange(10, 16) # array([10, 11, 12, 13, 14, 15])
This method accepts an upper and lower bounds of the sequence and returns an array with the integers in between, excluding the upper bound, as elements. Passing a single bound (n) to
arange is equivalent to setting the range to (0, n).
You can create an array populated with evenly spaced numbers using the same methods. To achieve this, pass the gap between numbers as the third argument.
arr = np.arange(1, 25, 3) array([ 1, 4, 7, 10, 13, 16, 19, 22])
If you want floating point numbers included among elements as well, use the np.linspace method, which automatically determines the space between elements based on the element count passed as an argument. For example, here’s how you can get an evenly-spaced array with 5 elements populated by numbers between 10 and 20:
arr = np.linspace(10, 20, 5) # array([10. , 12.5, 15. , 17.5, 20. ])
Basic array attributes
Numpy arrays have several important attributes. Let’s see what these attributes are using the following two-dimensional array.
arr = np.zeros((4, 5))
The number of dimensions (axes) in the array.
arr.ndim # 2
The shape of the array or the lengths of the array across each dimension.
arr.shape # (4, 5)
Total number of elements in the array.
arr.size # 20
The data type of stored array elements
arr.dtype # float64
The size of each element in the array in bytes.
arr.itemsize # 8
In this example, given the data type of float64, the item size is 64/8 = 8.
Broadcasting is a special concept in NumPy that determines how it behaves during a mathematical operation involving arrays with different shapes. When the shapes are incompatible, NumPy usually performs this action of “broadcasting” the smaller array across the larger one so that both retains a compatible shape to complete the operation.
NumPy’s use cases involve operating on a pair of arrays most of the time. In such cases, NumPy adopt an element-by-element basis to perform these operations on the two arrays. This allows it to run vectorized code that relies on C behind the scenes for usually costly tasks in Python like loops and indexing. Therefore, array pairs having compatible shapes is essential for NumPy to work as efficiently as it does.
The simplest case of broadcasting occurs when an array is working with a scalar. Think of multiplication between an array
b and scalar
a. Here, broadcasting automatically stretches the scalar into an array of shape
For example, in the following case, broadcasting automatically converts b into the array,
[[5, 5], [5, 5]] before multiplication.
a = np.array([[1, 2], [3, 4]]) b = 5 a*b
According to broadcasting rules, two dimensions are compatible if,
- They are equal in size.
- One of the dimension sizes is 1.
In cases where one of the dimension sizes is 1, NumPy stretches it to become compatible with the other, like in the above example.
If none of these two conditions are satisfied, NumPy throws a ValueError for that operation. You can read more about the rules of broadcasting and compatible array dimensions here .
Changing the shape of an array
One of the most common ways of reshaping an array in NumPy is using the reshape method. It accepts a tuple containing the shape of the expected array and returns the new reshaped array if it’s compatible with the original array dimensions.
arr = np.arange(20) arr.shape // (20, ) reshaped = arr.reshape((4, 5)) reshaped.shape // (4, 5)
You can also leave one dimension to be automatically calculated with this method.
reshaped_auto = arr.reshape((5, -1)) reshaped_auto.shape // (5, 4)
Flatten & unravel
NumPy provides two methods to flatten an array: flatten & unravel. The difference between the two is that, with flatten, any change you make to the returned array doesn’t affect the parent array used in the operation. But altering the array returned by unravel affects its parent.
flattened = reshaped.flatten() flattened.shape # (20, ) flattened = 50 reshaped # 0 unravelled = reshaped.ravel() unravelled.shape # (20, ) unravelled = 50 reshaped # 50
Resize is another method that can be used to reshape an array. Unlike the previous reshape method, however, this one modifies the array itself without creating a new one.
arr = np.arange(20) arr.resize((2, 10)) arr.shape # (2, 10)
Indexing & slicing
Indexing and slicing a one-dimensional NumPy array works the same way as a Python list. You can access and modify elements of the array using their indices like this.
arr = np.arange(15) # array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) arr # 5 arr[-1] # 14 arr # 7 arr = 100 arr #100
You can also use the indices to slice the arrays to generate subarrays like this.
#Slice the elements between indices 2 and 7 arr[2:7] # array([2, 3, 4, 5, 6]) #Choose every 3rd element between the indices 0 and 10 arr[:10:3] # array([0, 3, 6, 9]) #Reverse the entire array arr[::-1] # array([ 14, 13, 12, 11, 10, 9, 8, 100, 6, 5, 4, 3, 2, 1, 0])
In multidimensional arrays, each element gets one index per one dimension. An element’s position is therefore represented by a tuple containing these indices. When slicing a multidimensional array, you can provide the relevant index ranges for each dimension separated by commas.
arr = np.arange(18).reshape(3, 6) # array([[ 0, 1, 2, 3, 4, 5], # [ 6, 7, 8, 9, 10, 11], # [12, 13, 14, 15, 16, 17]]) arr[1, 4] # 10 arr[-1, 4] # 16 #Slice columns 2 to 4 of each row of the array arr[:, 1:4] # array([[ 1, 2, 3], # [ 7, 8, 9], # [13, 14, 15]]) #1st, 3rd, 5th columns of the 2nd row of the array arr[1, 0:6:2] # array([ 6, 8, 10]) #Slice the last column of the array arr[:, -1] # array([ 5, 11, 17])
Filter elements using special conditions
NumPy also allows you to retrieve specific elements in an array based on certain conditions. For example, if you want to retrieve all the elements that are smaller than 5, it can be executed like this.
arr[arr < 5] # array([0, 1, 2, 3, 4]) More examples of selecting elements based on special conditions: arr[arr % 4 == 0] # array([ 0, 4, 8, 12, 16]) arr[(arr > 3) & (arr < 12)] # array([ 4, 5, 6, 7, 8, 9, 10, 11])
Basic arithmetic operations
NumPy conducts basic arithmetic operations like addition and multiplication on the arrays element-wise.
a = np.arange(1, 7).reshape((2, 3)) # array([[1, 2, 3], [4, 5, 6]]) b = np.array([[12, 32, 21], [99, 34, 43]]) # array([[12, 32, 21], [99, 34, 43]]) #Multiply by a scalar a*5 # array([[ 5, 10, 15], [20, 25, 30]]) #Summation a + b # array([[ 13, 34, 24], [103, 39, 49]]) #Substraction a - b # array([[-11, -30, -18], [-95, -29, -37]]) #Multiplication a * b # array([[ 12, 64, 63], [396, 170, 258]]) #Division b / a # array([[12. , 16. , 7. ], [24.75 , 6.8 , 7.16666667]]) #Power a**3 array([[ 1, 8, 27], [ 64, 125, 216]], dtype=int32)
NumPy provides dedicated methods for operations like getting the sum of all array elements and finding the minimum and maximum.
a.sum() # 21 b.min() #12 b.max() #99 #Maximum in each column b.max(axis=0) # array([99, 34, 43]) #Minimum in each row b.min(axis=1) # array([12, 34])
Similarly, you can find the median, mean, and other simple statistics or sort the elements using built-in methods.
# Mean np.mean(b) # 40.166666666666664 #Median np.median(b) #33.0 # Median of each row np.median(b, axis=1) # array([21., 43.]) #Sort array np.sort(b) # array([[12, 21, 32], [34, 43, 99]])
Basic matrix operations
Considering how two-dimensional arrays can be used to represent matrices, NumPy becomes a useful tool for conducting matrix operations. It contains built-in support for basic operations like calculting dot product, transpose, inverse, and determinant.
Here are few example matrix operations you can do with NumPy.
a = np.arange(6).reshape((2, 3)) b = np.arange(9).reshape((3, 3)) #Dot product #One way of calculating dot product a @ b # array([[15, 18, 21], [42, 54, 66]]) #Another way of calculating using built-in dot method a.dot(b) # array([[15, 18, 21], [42, 54, 66]]) #Transpose transposed = a.transpose() transposed.shape # (3, 2) # Determinant determinant = np.linalg.det(b) determinant # 0.0
Stacking and splitting arrays
NumPy supports stacking two arrays together to build new arrays. The stacking can happen in both horizontal and vertical directions as long as the two arrays have compatible sizes along that axis.
Let’s use the following three arrays, a, b, and c to understand how stacking works in practice.
a = np.linspace(1, 50, 8).reshape(2, 4) b = np.arange(4).reshape(2, 2) c = np.arange(12).reshape(3, 4)
When horizontally stacking arrays together, a and b arrays can work compatibly given their shared dimension of 2 along the 1st axis. But a and c or b and c pairs are not compatible for this operation.
np.hstack((a, b)) #array([[ 1., 8., 15., 22., 0., 1.], [29., 36., 43., 50., 2., 3.]])
Similarly, you can stack the compatible a and c arrays along the vertical direction using the vstack method.
np.vstack((a, c)) array([[ 1., 8., 15., 22.], [29., 36., 43., 50.], [ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]])
When it comes to splitting arrays, one way to achieve this task is employing the familiar array slicing option. NumPy, however, provides two other dedicated methods to split arrays along the vertical and horizontal axes called hsplit and vsplit. You can either pass the number of equal-sized parts the array should be split into or the positions where splits should happen to these methods as arguments. Horizontal split:
#Split into two equal-sized arrays split_a = np.hsplit(a, 2) print(split_a) # [[ 1. 8.] # [29. 36.]] print(split_a) # [[15. 22.] # [43. 50.]] #Split the array after the 1st and 3rd columns split_a = np.hsplit(a, (1, 3)) print(split_a) # [[ 1.] # [29.]] print(split_a) # [[ 8. 15.] # [36. 43.]] print(split_a) # [[22.] # [50.]]
#Split into three equal-sized arrays split_c = np.vsplit(c, 3) print(split_c) # [[0 1 2 3]] print(split_c) # [[4 5 6 7]] print(split_c) # [[ 8 9 10 11]] #Split the array after the 1st row split_c = np.vsplit(c, (1, )) print(split_c) # [[0 1 2 3]] print(split_c) # [[ 4 5 6 7] # [ 8 9 10 11]]
When working with arrays, you’ll often find instances where you require a copy of the array for a variety of reasons. For this reason, NumPy provides two special methods to create either a shallow or deep copy of a parent array.
A shallow copy of an array is an array that contains the same data as the parent. Even though a shallow copy returns a new array object, its data still references the parent. So, if you alter the elements of a shallow copy, it automatically changes the referenced parent elements. However, changing the shallow-copied array object doesn’t affect the parent array. The subarray you get after a slicing operation is an example of a shallow copy.
To get a shallow copy of an array, you can use the view method.
a = np.arange(10) b = a.view() # b is a shallow copy of a b is a # False b.reshape(2, 5) a.shape # (10, ) b = 99 a #99
A deep copy, on the other hand, create entirely new copies for the array and stored elements so that the copy doesn’t have to reference the parent data. You can make any type of change to a deep copy without affecting its parent. Here’s how you can make a deep copy of an array.
c = a.copy() # c is a deep copy of a c is a # False c.reshape(2, 5) a.shape # (10, ) c = 55 a #5
Importing & exporting data
Often, when working with a large amount of data, you’ll want to either export data from an external file or save the resulting arrays to a file. NumPy already has built-in methods for supporting this task, like
save (which work with binary files with .npy extension) and
savetxt (which work with text files). You can read more about these methods in the official NumPy documentation
However, since working with csv files is more common in data science, I’ll go through how to export and import data to NumPy from csv files with the help of Pandas library.
In the first step, I will import data from a file named “drinks.csv” and create a NumPy array using its “wine_servings” column. Pandas comes with a built-in
to_numpy method to help with this task.
import pandas as pd df = pd.read_csv("drinks.csv") arr = df["wine_servings"].to_numpy() arr.shape # (193, )
Similarly, for saving the manipulated data in a NumPy array, you can convert it to a Pandas dataframe and save into a csv file.
df = pd.DataFrame(arr) df.to_csv("wine_servings.csv")
Today, NumPy has become a tool that data scientists can’t live without. The library brings fast and efficient array operations and numerical calculations to Python by combining with pre-compiled C code under the hood.
While NumPy has many use cases as a standalone tool, it’s the most powerful when combined with other Python computational libraries like scikit-learn, scikit-image, and SciPy.
Therefore, understanding the basics of NumPy I’ve discussed here will give you a good foundation to get started with these powerful libraries as well.
Thanks for reading!