HTML to PDF Generation using Puppeteer: From Basics to Advanced

HTML to PDF Generation using Puppeteer: From Basics to Advanced

Puppeteer
PDF
Docker
LocalStack
AWS
Lambda
Node.js
2024-07-17

Introduction

Converting HTML to PDFs is essential for a wide variety of tasks such as generating invoices, digital receipts, or high-quality reports. With Puppeteer, a Node.js library that provides a robust API for headless Chrome/Chromium, we can produce consistently clean and professional PDFs that mirror the layout and styling of our web content. In this guide, I take you through the basics of Puppeteer’s HTML-to-PDF generation, walk through advanced customizations, and then scale it up with Docker, LocalStack, and AWS Lambdas (leveraging AWS SAM CLI) so it’s ready for prime time in a production environment. I’ll sprinkle in tips, best practices, and gotchas I’ve learned from building real-world PDF generation pipelines.

1. The Basics of Puppeteer PDF Generation

Puppeteer makes PDF generation simple. Under the hood, it launches a headless browser, renders your HTML content exactly as Chrome would, and exports the resulting layout as a PDF. This ensures layout fidelity, including CSS styling, responsive designs, and web fonts (if properly loaded).

Let’s jump into a minimal example. We provide Puppeteer with some inline HTML, tell it to render the page (allowing enough time for resources to load), and then produce a simple PDF:

Minimal HTML-to-PDF Example
import puppeteer from 'puppeteer'; import fs from 'fs/promises'; async function generatePdfFromHtml(html: string, outputPath: string) { // Launch a headless browser const browser = await puppeteer.launch(); const page = await browser.newPage(); // Set the page content await page.setContent(html, { waitUntil: 'networkidle0' }); // Generate the PDF await page.pdf({ path: outputPath }); await browser.close(); } // Usage example (async () => { const sampleHtml = `<html> <head> <title>Sample PDF</title> </head> <body> <h1>Hello, Puppeteer!</h1> <p>This is a PDF generated from a simple HTML.</p> </body> </html>`; await generatePdfFromHtml(sampleHtml, 'sample.pdf'); console.log('PDF generated successfully!'); })();

This snippet is perfect for scenarios where you only need a straightforward PDF. But as soon as you want to add complex layout or incorporate real data, Puppeteer’s additional configuration options and automation features become indispensable.

2. Advanced Puppeteer PDF Options

Puppeteer’s page.pdf() method provides a wealth of options:

Advanced PDF Options
await page.pdf({ path: 'custom.pdf', format: 'A4', printBackground: true, margin: { top: '1cm', right: '1cm', bottom: '1cm', left: '1cm', }, pageRanges: '1-2', // specify pages to include });

By mixing and matching these options, you can create a variety of specialized outputs, from smaller receipt-type prints to large-format pages with bleeds for more sophisticated design requirements.

3. Dockerizing Puppeteer

Anyone who’s tried running Puppeteer in production knows that missing system dependencies quickly become a stumbling block. To avoid “it works on my machine” issues, I wrap my Puppeteer-based projects in a Docker container that has all the necessary libraries and fonts. This ensures a consistent runtime environment wherever the container is deployed.

Dockerfile
# Dockerfile FROM node:18-bullseye # Install required dependencies for Chromium RUN apt-get update && apt-get install -y \ gconf-service \ libasound2 \ libatk1.0-0 \ libatk-bridge2.0-0 \ libc6 \ libcairo2 \ libcups2 \ libdbus-1-3 \ libexpat1 \ libfontconfig1 \ libgcc1 \ libgdk-pixbuf2.0-0 \ libglib2.0-0 \ libgbm-dev \ libgtk-3-0 \ libx11-6 \ libx11-xcb1 \ libxcb1 \ libxcomposite1 \ libxcursor1 \ libxdamage1 \ libxext6 \ libxfixes3 \ libxi6 \ libxrandr2 \ libxrender1 \ libxss1 \ libxtst6 \ ca-certificates \ fonts-ipafont-gothic \ fonts-wqy-zenhei \ fonts-thai-tlwg \ fonts-kacst \ fonts-freefont-ttf \ libappindicator1 \ libnss3 \ lsb-release \ xdg-utils \ wget \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY package*.json ./ RUN npm install COPY . . # Expose a port for our service EXPOSE 3000 CMD ["npm", "run", "start"]

Now, whenever I run docker build and docker run, my Puppeteer environment is guaranteed to have everything it needs—no more scrambling for missing fonts or GPU library errors.

4. Running Locally with Docker Compose and LocalStack

If you want to store your PDFs in S3 or need other AWS services, you’ll want to test your application locally without incessant round-trips to the AWS cloud. LocalStack is the perfect solution—it emulates AWS services, including S3, on your development machine.

docker-compose.yml
version: '3.8' services: pdf-service: build: . container_name: pdf-service ports: - "3000:3000" environment: - AWS_REGION=us-east-1 - AWS_ACCESS_KEY_ID=test - AWS_SECRET_ACCESS_KEY=test localstack: image: localstack/localstack container_name: localstack ports: - "4566:4566" - "4571:4571" environment: - SERVICES=s3 - DEBUG=1 - DATA_DIR=/tmp/localstack/data - AWS_DEFAULT_REGION=us-east-1 volumes: - "./.localstack:/tmp/localstack"

After a quick docker-compose up, you’ll have both your PDF service and LocalStack running in tandem. You can now point your AWS SDK to LocalStack’s endpoints, treat your local environment as if it was AWS, and store PDFs in a “fake” S3 bucket during development.

5. Saving PDFs to S3 (Emulated with LocalStack)

To illustrate how we might store generated PDFs in an S3 bucket, here’s a short snippet using the AWS SDK v3 for JavaScript. Make sure you’ve installed @aws-sdk/client-s3 before running the code:

Generate PDF and Upload to S3
import puppeteer from 'puppeteer'; import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3'; import { Readable } from 'stream'; const s3Client = new S3Client({ region: process.env.AWS_REGION || 'us-east-1', endpoint: 'http://localstack:4566', forcePathStyle: true, // needed for localstack credentials: { accessKeyId: process.env.AWS_ACCESS_KEY_ID || 'test', secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY || 'test', }, }); async function generatePdfAndUpload(html: string, bucketName: string, key: string) { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.setContent(html, { waitUntil: 'networkidle0' }); // Instead of writing to a path, we generate a buffer const pdfBuffer = await page.pdf(); await browser.close(); // Upload to S3 const pdfStream = Readable.from(pdfBuffer); const putParams = { Bucket: bucketName, Key: key, Body: pdfStream, ContentType: 'application/pdf' }; await s3Client.send(new PutObjectCommand(putParams)); console.log(`PDF uploaded to S3 as ${key}`); } export { generatePdfAndUpload };

It’s often more flexible to generate a PDF as a buffer in-memory rather than writing it to the local file system, especially when working with modern cloud services or microservices patterns. This snippet seamlessly streams the PDF to “S3” (LocalStack, in our case) and keeps your Docker container’s file system usage minimal.

6. Deploying as an AWS Lambda with AWS SAM CLI

Building a local Docker-based environment is great, but eventually, you might want to deploy your PDF generation service to AWS. By combining Puppeteer with AWS Lambda, you can build a highly scalable PDF generation function that only costs money when it’s used—a perfect pay-per-request model. AWS SAM (Serverless Application Model) CLI can simplify both packaging and deploying your Lambda.

template.yaml
Transform: AWS::Serverless-2016-10-31 Description: PDF generation service Resources: PdfLambda: Type: AWS::Serverless::Function Properties: Handler: index.handler Runtime: nodejs18.x CodeUri: ./dist Timeout: 30 Policies: - S3FullAccess # or more restrictive policies Environment: Variables: AWS_NODEJS_CONNECTION_REUSE_ENABLED: "1" BUCKET_NAME: "my-pdf-bucket" Outputs: PdfLambdaFunction: Description: "PDF Lambda Function ARN" Value: !GetAtt PdfLambda.Arn

And here’s a minimal index.ts file that picks up the incoming payload, generates a PDF, and stores it in S3:

index.ts (Lambda Entry)
// index.ts - Lambda entry point import { APIGatewayEvent, Context } from 'aws-lambda'; import { generatePdfAndUpload } from './pdfService'; // your Puppeteer logic export const handler = async (event: APIGatewayEvent, context: Context) => { const bucketName = process.env.BUCKET_NAME || ''; const html = event.body || '<h1>Hello from Lambda</h1>'; // Derive key from event or time const key = `test-${Date.now()}.pdf`; await generatePdfAndUpload(html, bucketName, key); return { statusCode: 200, body: JSON.stringify({ message: 'PDF generated and uploaded!', key }), }; };

If your resulting deployment package is too large (Puppeteer can be hefty!), consider using a dedicated Puppeteer “layer” or rely on minimal Chromium builds that reduce the overall size. AWS Lambda Layers let you share common dependencies across multiple functions and reduce your per-function deploy size.

7. Handling Dynamic Data & Templating

Often, you won’t just be dumping static HTML into Puppeteer. You’ll need to inject dynamic data—like user info, purchase histories, or real-time analytics—into your document. For this scenario, templating engines are your friend.

You might choose libraries like ejs, handlebars, or pug. For example, with Handlebars, you can separate your presentation (HTML layout) from your logic, making the code more maintainable:

Install Handlebars
npm install handlebars
Using Handlebars
import Handlebars from 'handlebars'; import fs from 'fs/promises'; async function generateDynamicPdf() { // Load an HTML template from disk const template = await fs.readFile('./invoiceTemplate.html', 'utf-8'); const compileTemplate = Handlebars.compile(template); // Data to be inserted in the template const data = { customerName: 'Jane Doe', items: [ { description: 'Laptop', price: 1599 }, { description: 'Monitor', price: 299 }, ], total: 1898 }; const htmlWithData = compileTemplate(data); // Now pass htmlWithData to Puppeteer as before }

This approach ensures that your PDFs can adapt to a wide range of data inputs while keeping your templates organized and straightforward to modify.

8. Troubleshooting Common Issues and Performance Tips

Anytime I’ve used Puppeteer for PDF generation, I’ve come across a few repeat issues or performance pitfalls. Here’s a quick overview:

These practices can help resolve common headaches and keep your PDF service stable, even when the real world doesn’t always match a controlled workshop environment.

9. A Full Example: LocalStack-Hosted PDF Service with S3 Storage

Let’s tie together the Docker + Puppeteer + LocalStack trifecta in a single sample application. Our final service will:

Below is a rough directory structure and relevant code snippets to get you running:

Project Structure
my-pdf-service/ ├── Dockerfile ├── docker-compose.yml ├── src/ │ ├── index.ts │ ├── pdfService.ts │ └── server.ts ├── package.json └── tsconfig.json

server.ts: A simple Express server to accept HTML via POST, forward it along to our Puppeteer logic, and return a JSON response.

server.ts
import express from 'express'; import bodyParser from 'body-parser'; import { generatePdfAndUpload } from './pdfService'; const app = express(); app.use(bodyParser.text({ type: '*/*' })); app.post('/pdf', async (req, res) => { try { const html = req.body; if (!html) { return res.status(400).send({ error: 'No HTML provided' }); } const bucket = 'my-local-bucket'; const key = `pdf-${Date.now()}.pdf`; await generatePdfAndUpload(html, bucket, key); res.status(200).send({ message: 'PDF generated and uploaded!', key }); } catch (error) { console.error(error); res.status(500).send({ error: 'Something went wrong!' }); } }); const port = process.env.PORT || 3000; app.listen(port, () => { console.log(`PDF service listening on port ${port}`); });

pdfService.ts: Our Puppeteer logic plus integration with LocalStack’s S3 endpoint.

pdfService.ts
import puppeteer from 'puppeteer'; import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3'; import { Readable } from 'stream'; const s3Client = new S3Client({ region: 'us-east-1', endpoint: 'http://localstack:4566', forcePathStyle: true, credentials: { accessKeyId: 'test', secretAccessKey: 'test', }, }); export async function generatePdfAndUpload(html: string, bucketName: string, key: string) { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.setContent(html, { waitUntil: 'networkidle0' }); const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true }); await browser.close(); const pdfStream = Readable.from(pdfBuffer); const putParams = { Bucket: bucketName, Key: key, Body: pdfStream, ContentType: 'application/pdf', }; await s3Client.send(new PutObjectCommand(putParams)); }

With these files in place, along with the earlier docker-compose.yml and Dockerfile, you’re ready to run docker-compose up. Once everything starts, you can test the service using a simple cURL command:

cURL Request
curl -X POST \ -H "Content-Type: text/plain" \ --data "<h1>Hello Container!</h1>" \ http://localhost:3000/pdf

That’s it! You’ll get a JSON response telling you the PDF has been successfully stored in your LocalStack S3 bucket. This system closely mirrors a live cloud environment but keeps your development loops pleasantly fast and offline.

10. Conclusion

We’ve taken quite the journey—starting with a simple HTML-to-PDF approach using Puppeteer, then scaling up to containerization, local AWS emulation via LocalStack, and even serverless deployment with AWS Lambda. This robust workflow allows you to develop, test, and deploy PDF generation pipelines quickly, ensuring consistent results from local environments all the way to production.

By layering Docker, LocalStack, and AWS SAM on top of Puppeteer’s PDF generation features, you’re equipped to tackle everything from simple invoice creation to complex, on-the-fly, data-driven reports. I hope this tutorial has provided a clear and enjoyable path to building your own professional-grade PDF generation service.

Whether you need dynamic templating, specialized fonts, or advanced print layouts, you can fine-tune your setup to match your unique requirements. Puppeteer’s flexibility combined with microservices, containers, and the serverless paradigm sets you up for success in just about any environment.

Further Reading

Additional resources to deepen your understanding:

Key Resources

Puppeteer GitHub Repository

Official source code for Puppeteer – indispensable for PDF generation

LocalStack Documentation

Everything you need to emulate AWS services locally

AWS SAM CLI Documentation

Official docs for building serverless apps using AWS SAM

Academic References