How to Use Generator and yield in Python

Work with large datasets or files using Python generators

Feature Image

Today we are going to talk about generators in Python, how are they different from normal functions, and why you should use them.

What are generators in Python?

Have you ever run into a situation where you would need to read large datasets or files, and those were too overwhelming to load into memory? Or maybe you wanted to build an iterator, but the producer function was so simple that most of your code is just around building the iterator other than producing the desired values? These are some of the scenarios where generator can be really useful and simple.

Introduced with PEP 255 , generator functions are a special kind of function that returns some sort of lazy iterator. There are objects that you can loop over like a list, however, unlike lists, lazy iterators do not store their contents in memory. One of the advantages of generator functions to iterators is the amount of code that is required to code.

After that introduction, let’s see some examples of generators in action:


Some use cases of generators

Reading large files

A common use case of generators is to work with large files or data streams, like for example, CSV files. Let’s say we need to count how many rows there are on a text file, our code could look something like:

csv_gen = csv_reader("some_file.txt")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

with our csv_reader function implemented in the following way:

def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    return result

That’s now very clear and simple. Our csv_reader function will simply open the file into memory and read all the lines, then it will split the lines and form an array with the file data, so our code above would work perfectly, or so we think.

If the file contains a few thousand lines, this code will probably work in any modern computer, however, if the file is large enough we will start having some issue. These issues can go from the machine starting to slow down, to the program killing the machine to the point that we need to terminate the program, to the ultimate:

Traceback (most recent call last):
  File "ex1_naive.py", line 22, in <module>
    main()
  File "ex1_naive.py", line 13, in main
    csv_gen = csv_reader("file.txt")
  File "ex1_naive.py", line 6, in csv_reader
    result = file.read().split("\n")
MemoryError

We crashed the program. The file was too big and couldn’t be loaded into memory, provoking python to raise an exception and crash.

So how can we fix it? well… we know that with generators we have a way to build simple iterators, so that should help, let’s now take a look at the csv_reader function built using a generator.

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

Still very simple, it looks even more elegant than before, but what is that yield keyword over there?

The yield keyword is what makes this function a generator instead of a normal function. In difference to a return, yield will pause the function by saving all its states and will later continue from that point on successive calls. In both cases, the expression will be returned to the callers execution.

When a function contains yield, Python will automatically (and behind the scenes) implement an iterator applying all the required methods like __iter__() and __next__() for us, so we don’t need to worry about any of it.

Going back to our example, if we now decide to execute our code, we would get something as follows:

Row count is 65123455

Depending on your file would be a different number, but what’s important is that it works! We would be lazy loading the file, so we would minimize our memory load, and it’s a very easy an elegant solution.

But that’s not the end of the story, there are even easier and more interesting ways to implement generators by defining a generator expression (also called a generator comprehension) which has a syntax that looks very much like list comprehensions.

Let’s see how that would look like

csv_gen = (row for row in open(file_name))

Beautiful, isn’t it? just remember these main differences:

  • Using yield will result in a generator object
  • Using return will result in the first line of the file only.

Generating an infinite sequence

Another common scenario for generators is an infinite sequence generation. In Python, when you are using a finite sequence, you can simply call range() and evaluate it in a list context, for example:

a = range(5)
print(list(a))
[0, 1, 2, 3, 4]

We could do the same, generating an infinite sequence using generators like this:

def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

And you can use it for example to print the values

for i in infinite_sequence():
    print(i, end=" ")

Though this will be super fast and would run “forever”, so you will have to stop it manually by pressing CTRL+C or the MAC alternative, but you will see all the numbers printed very fast on the screen.

There are other ways to get values, like for example you can get the values one by one as follows:

>> gen = infinite_sequence()
>>> next(gen)
0
>>> next(gen)
1
>>> next(gen)
2
....

More on yielding

So far we looked at simple cases for generators, and the yield statement, however, as with all Python things, it doesn’t end there, there are more things around it, though the idea of it is what you learned so far.

As we already discussed, when we use yield we are saving the local state for the function and returning the value expression to the caller function. But what do we mean to save the local state? well… here is where it comes very interesting. When the Python yield statement is hit, the program suspends the function execution and returns the yielded value to the caller. When the function gets suspended, the state of that function is saved, this includes data like any variable bindings, the instruction pointer, the internal stack, and any exception handling. When the generator is once again called, the state is restored and the function continues from the last yield statement it hit, like if the previous yield wouldn’t have been called and the function wouldn’t have been suspended.

Pretty neat! Let’s see an example to understand this better

>>> def multiple_yield():
...     value = "I'm here for the first time"
...     yield value
...     value = "My Second time here"
...     yield value
...
>>> multi_gen = multiple_yield()
>>> print(next(multi_gen))
I'm here for the first time
>>> print(next(multi_gen))
My Second time here
>>> print(next(multi_gen))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

The first time we executed the function, the execution pointer was at the beginning and thus we hit the first yield on line 2, thus the statement “I’m here for the first time” is printed on the screen. The second time next() is called, the execution pointer continues from line 3, hitting the second yield statement on line 4 and returning “My second time here”, though technically was only in that line once 😛. When we now call next() for the 3rd time, we get an error. This is because generators, like all iterators, can be exhausted, and if you try to call next() after that happens you will get this error.


Advance generator methods

So far we have covered the most common uses and constructions of generators, but there are a few more things to cover. Over time Python added some extra methods to generators, and I’ll like to discuss the following here:

  • .send()
  • .throw()
  • .close()

Before we go into the details of each of these methods, let’s create a sample generator that we are going to use as an example. Our generator will generate prime numbers as it’s implemented as follow:

def isPrime(n):
    if n < 2 or n % 1 > 0:
        return False
    elif n == 2 or n == 3:
        return True
    for x in range(2, int(n**0.5) + 1):
        if n % x == 0:
            return False
    return True

def getPrimes():
    value = 0
    while True:
        if isPrime(value):
            yield value
        value += 1

How to use .send()

.send() allows you to set the value of the generator at any time. Let’s say you want to generate only the prime numbers from 1000 onward, that’s where .send() comes handy. Let’s take a look into that example:

prime_gen = getPrimes()
print(next(prime_gen))
print(prime_gen.send(1000))
print(next(prime_gen))

And when we run it, we get:

2
3
5

mm… that did not go quite as planned. And the issue is in the generator function we implemented. In order to use the send method we would need to make a few changes and make it look like this:

def getPrimes():
    value = 0
    while True:
        if isPrime(value):
            i = yield value
            if i is not None:
                value = i
        value += 1

Now again we run

prime_gen = getPrimes()
print(next(prime_gen))
print(prime_gen.send(1000))
print(next(prime_gen))

and we obtain:

2
1009
1013

Nice! Good work!

How to use .throw()

.throw() as you probably guessed allows you to throw exceptions with the generator. This can be useful to for example end the iteration at a certain value.

Let’s see it in action:

prime_gen = getPrimes()

for x in prime_gen:
    if x > 10:
        prime_gen.throw(ValueError, "I think it was enough!")
    print(x)

and we get:

2
3
5
7
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    prime_gen.throw(ValueError, "I think it was enough!")
  File "test.py", line 15, in getPrimes
    i = yield value
ValueError: I think it was enough!

The interesting characteristic about doing this is that the error is generated from within the generator as can be seen on the stack trace.

How to use .close()

In the previous example, we stop the iteration by raising an exception, however, that’s not very elegant. A better way to end the iterations is by using .close().

prime_gen = getPrimes()

for x in prime_gen:
    if x > 10:
        prime_gen.close()
    print(x)

with output:

2
3
5
7
11

In this case, the generator stopped and we left the loop without raising any exception.


Conclusion

Generators, either used as generator functions or generator expressions can be really useful to optimize the performance of our python applications especially in scenarios when we work with large datasets or files. They will also bring clarity to your code by avoiding complicated iterators implementations or handling the data on your own by other means.

I hope that you now have a better understanding of generators, and that you can use them on your next project.

Thanks for reading!

Join the Free Newsletter

A free, weekly e-mail with the best new articles, courses, and special bonuses.

We won't send you spam. Unsubscribe at any time.