Introduction to 8 Essential Data Structures
Our understanding of data structures as programmers is mostly limited to using them at a higher level of abstraction with a programming language. Though we know how to store and retrieve data from different data structures using a particular programming language, most of us don’t try to unravel what goes on in the lower level implementation of these data structures.
Surface level knowledge of data structures is enough to somehow get our work done in most cases. But understanding how different data structures behave at the lower levels is crucial when it comes to selecting the optimal data structure for a given task. In this article, we will look under the wraps of 8 different data structures and see how they handle data.
Array data structure stores a fixed number of data of a single data type. Elements (items) in an array are stored in a block of contiguous memory slots. Due to this, elements in an array are assigned consecutive numbers, starting from 0 or 1, as their “indexes”.
One can access a single element stored in an array at random using its unique index. Accessing an element using the index has a time complexity of Θ(1). Reading or updating array elements can be easily achieved in this manner. Because of the contiguous location of array elements, arrays traversal is faster compared to most of the other data structures.
Inserting to or deleting from an array is a rather complex and time-consuming task. When inserting, all the elements in the current array are copied to a newly created array with increased size, and the new element is added to the end of the new array. Deleting is also implemented in a similar manner to reduce the array size.
Arrays can be multi-dimensional (array of arrays). This makes arrays a good choice for storing matrices and vectors. Arrays are frequently used to implement other data structures like lists, heaps, stacks, and queues.
The queue data structure is similar to the queue we see in our daily life: the first person to enter the queue is the first person to get the next chance to exit the queue. In the programming world’s version of the queue, every new data element added to the queue is stored at the rear end and every element removed from the queue is taken from the front end—on the First In First Out basis.
Enqueue: Inserting an element to the end of the queue. A newly added element becomes the rear element of the queue.
Dequeue: Removing an element from the front of the queue. Both enqueue and dequeue operations have a time complexity of Θ(1).
Peek: Reads the element at the front of the queue without removing or modifying it.
Queues are used to implement buffers. Multithreading uses queues to manage tasks waiting to be implemented by threads.
Stacks are quite similar to queues, but they are implemented on the Last In First Out basis instead of the First In First Out basis. Think of a stack of dishes where the last dish to add is the first dish that is going to be removed.
- Push: Inserting a new element to the top of the stack. A newly added element becomes the new top element.
- Pop: Removing an element from the top of the stack. Both push and pop operations have a time complexity of Θ(1).
- Peek: Reads the element at the top of the stack without removing or modifying it.
Stacks are used to handling and evaluating mathematical expressions. They are also used in algorithms that use a backtracking procedure. Handling recursive function calls in recursive programming is another application.
A linked list is a dynamic data structure. This means the number of data items stored in a linked list can easily expand or shrink. This gives linked lists more flexibility compared to arrays that have a fixed size. Linked lists achieve this dynamic quality by storing each item as a separate object.
Elements in a linked list don’t have to be stored in contiguous memory slots, instead, each element (called a node) stores a pointer to the location of the next node. These pointers maintain the connection with separate nodes in the linked list. Other than the pointer to the next node, a node also stores a data field.
There are two important nodes in a linked list: head and tail.
- Head: the first node of the linked list.
- Tail: the last node of the linked list. Tail’s pointer value is set to null.
When inserting a new element to a linked list, the new data field is stored at a particular location in memory and the pointer in the preceding node is updated to point to the new node. The new node stores the pointer previously stored in the preceding node.
When deleting a node, the node preceding the deleted node is given the pointer previously stored in the deleted node.
However, with linked lists, you cannot directly access a single data item without traversing through the list starting from the head. This gives the access operation a time complexity of Θ(n).
Linked list types
- Singly linked list: Linked lists shown in the above examples are all singly linked lists. A node contains only a pointer to the next node.
- Doubly linked list: A node in a doubly linked list contains pointers to the node before and after the given node. Traversal of the list can be done in both forward and backward directions.
- Circular linked list: The pointer of the tail points to the head instead of being null. Essentially, a circular linked list doesn’t have a tail, has only one head.
Linked lists are used to implement data structures like stacks, queues, and graphs. When performing polynomial algebra operations, linked lists are used to store the constants.
A graph consists of a finite number of data items called vertices (V). Some pairs of these vertices are linked to each other through edges (E). Two vertices connected by an edge are adjacent to each other.
Graphs can be categorized using different attributes. One of such categorization is directed graphs and undirected graphs.
In directed graphs, an edge connecting two vertices have a starting vertex and an ending vertex. When traversing the graph, the edge can be crossed only from the starting vertex to the ending vertex.
In undirected graphs, an edge can be traversed in both directions without a limitation.
Social media applications like Facebook uses graphs to represent its users as vertices and their friendships as edges. Google Page Ranking Algorithm used graphs to represent web pages and links connecting them. Google Maps uses graphs to represent the road network in its transportation systems.
A binary tree shares some resemblance to a directed graph. The difference between the two is that, in a binary tree, data is stored in a hierarchical structure, where upper-level nodes are called parents and the lower level nodes are called children. A node in a binary tree can have only one parent and a maximum of two children.
Let’s look at a few terms associated with binary trees.
- Root: The node at the top of the tree. It has no parent node.
- Leaf: A node at the bottom of the tree. It has no children nodes.
- Key: The data value stored in a node.
- Subtree: The tree consisting of all the descendants of a node
There are numerous special binary trees like Binary Search Tree, Treap, Binary Tries, and Heap.
Binary Search Tree
Binary search tree stores data values in sorted order. The left child of a node in a binary search tree must have a value less than the parent and the right child must have a value greater than the parent.
The main advantage of a BST, as the name suggests, is being able to search for stored data fast. The time complexity of searching for a stored element in a BST is O(log n).
- Binary search trees are used to implement map and set objects in programming languages
- Binary tries are used to store router tables.
- Treaps are used in wireless networking.
Heap is another special case of binary trees. In heaps, the key of the root is compared with the key of its children to arrange them in a specific way. There are two types of heaps.
- Max heap: The key of the parent is greater than or equal to that of the child. The root node stores the maximum value in the given dataset.
- Min heap: The key of the parent is less than or equal to that of the child. The root node stores the minimum value in the given dataset.
Consider we were given the integer values (33, 24, 45, 12, 90, 78, 23, 53) as the dataset. We can construct a separate max heap and min heap from this data.
Inserting, deleting, and extracting the maximum (or minimum) functions in a heap has a time complexity of O(log n). But finding the maximum (or minimum) has only a time complexity of O(1).
Heaps are used in implementing the heapsort algorithm. Heaps are also used to implement priority queues because the first element of the heap is always storing the value with the maximum (or minimum) priority.
Hash table is one of the most efficient data structures we can use when we want to maintain the speed of searching and insertion operations on a large dataset. Every data value stored in a hash table is associated with a key that gives fast access to the stored value if we know the key. Think of a student registration system where each student has a unique student ID which can be used as a key to store their data in a hash table.
Hash tables use arrays to store data values. The key is used to finding the index within the array where the value is stored. But how does hash tables map these keys with their values?
One of the methods that can be used is direct addressing. It uses one to one mapping: each key points to the exact location its data is stored at. But this approach does not use memory efficiently, especially as the number of key-value pairs grows and the keys become larger in size. So instead, hash tables use hash functions.
Hash tables use a hash function to map data values to their keys. It converts a range of key values to a range of array indexes. The index or the value generated by passing the key to the hash function is called the hash value. Here is an example hash function.
h(k) = k % m
- h is the hash function
- h(k) is the hash value corresponding to the key k
- k is the key
- m is the size of the hash table. A good choice for m is a prime number that is not close to a power of 2.
Let’s consider the hash values of several keys. Consider m=20.
- k=1001, h(k) = 1001%20 = 1
- k=1055, h(k) = 1055%20 = 15
- k=20123, h(k) = 20123%20 = 3
For k values, 1001, 1055, and 20123, their associated values are stored at the indexes 1, 15, and 3 in the hash table respectively.
Consider the hash value of the key 2021, which is 1. We previously saw that the value associated to the key 1001 is stored at index 1 in the hash table. When the hash values generated by two keys are similar, we call it a collision. Hash tables use techniques like chaining and open addressing to address this problem of collision.
Hash tables have searching and insertion time complexity of O(1).
Hash tables are used to implement database indexes. Compilers use hash tables to identify keyword in a programming language. Computers use hash tables to link file names with their paths.
This article provided a basic introduction to the underlying logic of 8 data structures we interact with daily as programmers. With this knowledge of the unique properties of different data structures, from today onwards you can be more mindful when choosing the most appropriate data structure for your programming task. But remember, this is only a basic introduction. There is a lot more you can and you should learn about data structures.