Convert Web Pages into PDFs with Puppeteer and NodeJS

Convert web pages into PDFs with Puppeteer and NodeJS


As web developers, we love working with HTML, CSS, and JavaScript, but when we are working on projects, sometimes we are asked to solve problems that involve other technologies. A common request I have seen in my career, especially when building web applications, is to generate PDF documents that users of the app can download.

When we are faced with such a task our intuition is to google “JavaScript create pdf” and we get awesome SDKs (like PDFKit) that deal with PDF generation, but those libraries are often hard to use and involve a ton of steps, and APIs we are not familiar with.

Fortunately, there is an easier way, which is to convert web pages into PDF format, and that can be done with the help of a tool called Puppeteer.


What Is Puppeteer, and Why Is It Awesome?

In Google’s own words, Puppeteer is, “A Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol”.

What is a headless browser?

If you are unfamiliar with the term headless browsers, it’s simply a browser without a GUI. In that sense, a headless browser is simply just another browser that understands how to render HTML web pages and process JavaScript. Due to the lack of a GUI, the interactions with a headless browser take place over a command line.

Even though Puppeteer is mainly a headless browser, you can configure and use it as non-headless Chrome or Chromium.

What can you do with Puppeteer?

Puppeteer allows you to work with a browser in headless mode, which allows you to do things like:

  • Create a PDF document and/or an image of a web page.
  • Automate form submission.
  • Scrape a website.
  • Perform automated UI tests.
  • Generating pre-rendered content for Single Page Applications (SPAs)

Using Puppeteer to Load a Web Page

Before we do anything with Puppeteer, we need to install it, and it is as easy as installing any other package with NPM.

npm i puppeteer

That installation with download among other things a Chromium browser into your node modules folder, so be patient, it may take a minute or two more than a regular package.

The next step is to write code to open up a browser and load a website:

import puppeteer from 'puppeteer';

(async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();
    
    await page.goto('https://livecodestream.dev');
})();

The code is quite simple, one key aspect, however, is that I purposely set the parameter headless in the launch function to false, so that when we run the code we can see the browser opening up and loading the page, here are the results you should get.

Image of the browser window opened by Puppeteer with the website on the screen

Image of the browser window opened by Puppeteer with the website on the screen

But if we plan to execute this in the background, or in a server somewhere, we don’t need to see the window, and we need to be able to exit the browser after finishing, to prevent having too many instances going on.

To close the browser you can use the close API,

await browser.close();

If you run it again, you’ll see how the browser opens, loads the website, and immediately after closes the browser and terminates the app. It all happens very fast (at least if you have a fast computer), so you probably won’t see much of the site.

Next, let’s remove the headless parameter, leaving its default value, where everything will happen without showing the user the browser window.

If you run it again, it’s like nothing ever happened, but don’t worry, if you look at your processes monitor, you’ll see chromium opening and closing there, so it’s all good.


Using Puppeteer to Convert a Web Page into a PDF File

Now that we know how to load a web page, let’s convert it into a PDF, which interestingly enough, is the same function you will execute when printing to a PDF using your normal Chrome function.

The API to generate a PDF exists in the page object and can be used as follows:

await page.pdf({
    path: 'lcs.pdf',
    format: 'letter',
});

But we’ll also have to change a parameter on the page.goto API, as we need to define what it means to load a page, is it when the document loads, when the DOMContentLoaded event is fired? or is it something else?

Puppeteer offers a waitUntil property where you can specify exactly that, and it has four possible options:

  • load: consider navigation to be finished when the load event is fired.
  • domcontentloaded: consider navigation to be finished when the DOMContentLoaded event is fired.
  • networkidle0: consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
  • networkidle2: consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.

You should choose the proper one based on the site you are loading, but if you are in no rush to generate the PDF, you can always set it to networkidle2.

Here is the full example of the PDF generation source code:

import puppeteer from 'puppeteer';

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    
    await page.goto('https://livecodestream.dev/', {
      waitUntil: 'networkidle2'
    });
    
    await page.pdf({
        path: 'lcs.pdf',
        format: 'letter',
    });
    
    await browser.close();
})();

Customizing the PDF Generation

If you load the PDF from the previous section you’ll notice that some things are off, the resolution in which the website loaded is terrible, the margins are off, and there’s no background so some things are hard to read.

Example of PDF generated with default values

Example of PDF generated with default values

Let’s fix that by passing more options to the pdf function.

The first argument will render the backgrounds so things are easier to read.

await page.pdf({
    path: 'lcs.pdf',
    format: 'letter',
    printBackground: true,
});

Example PDF with backgrounds

Example PDF with backgrounds

Looking better, but I’d like to see the website in a higher resolution, not that mobile-like view of the site.

The rendering size of the page is controlled by the format, which in our case is set to letter, however, we can also specify a scale property, which is a value that goes anywhere from 0.1 to 2

await page.pdf({
    path: 'lcs.pdf',
    format: 'letter',
    printBackground: true,
    scale: 0.5,
});

And the result:

Now the PDF looks more like the site looks in the desktop version

Now the PDF looks more like the site looks in the desktop version

There are a lot more options you can customize, I recommend visiting the official docs, in the PDF options interface to learn more about them.


Writing a Puppeteer PDF API with Express

What we learn so far is fantastic, but unless you plan to generate PDFs from the command line is not very useful. One popular use of such code is as part of an API endpoint, and today we will build one using ExpressJS.

Before continuing, make sure you have express installed in your project,

npm i express

Then let’s wrap our existing code in an endpoint, and return the PDF file as part of the server response

import puppeteer from 'puppeteer';
import express from 'express';

const app = express();

app.get("/pdf", async (req, res) => {
    const url = req.query.target;
    
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    
    await page.goto('https://livecodestream.dev/', {
        waitUntil: 'networkidle2'
    });
    
    const pdf = await page.pdf({
        path: 'lcs.pdf',
        format: 'letter',
        printBackground: true,
        scale: 0.5,
    });
    
    await browser.close();
    
    res.contentType("application/pdf");
    res.send(pdf);
});

app.listen(3000, () => {
    console.log("Server started");
});

Now start your server, and visit, for example:

http://localhost:3000/pdf?target=livecodestream.dev

There you go, an API that takes a target URL and renders a PDF version of it.

Notice: the code provided is very simple and does not contemplate any error handling, so please beaware of that if you plan to use it in a production set up.


Notice on PDF Size

The PDFs generated by Puppeteer, and a lot depending on the website, may not be the most optimal regarding file size. You can try and optimize the size by playing with the options, but often you’ll have to take a second step to compress PDFs.

For most scenarios, it may not be required, but if the size is a concern of yours, know that you should take additional steps to fix it.


Conclusion

Puppeteer is an excellent tool to convert the web into other formats such as PDFs and images, it has a lot of options that allow for customization and optimization and is super easy to use.

In just a few lines of code we created a simple API that will perform such a task, and taking that code to production levels won’t be that hard.

Thanks for reading!

Newsletter

Subscribe to my weekly newsletter for developers and builders and get a weekly email with relevant content.

    You can learn more about what to expect in these emails here.