phantom.py: A lean replacement for bulky headless browser frameworks

Note: This post is 6 years old. Some information may no longer be correct or even relevant. Please, keep this in mind while reading.

This is a simple but fully scriptable headless QtWebKit browser using PyQt5 in Python3, specialized in executing external JavaScript and generating PDF files. A lean replacement for other bulky headless browser frameworks. (Source code at end of this post as well as in this github gist)

Usage

If you have a display attached:

./phantom.py <url> <pdf-file> [<javascript-file>]

If you don’t have a display attached (i.e. on a remote server):

xvfb-run ./phantom.py <url> <pdf-file> [<javascript-file>]

Arguments:

  • <url> Can be a http(s) URL or a path to a local file
  • <pdf-file> Path and name of PDF file to generate
  • [<javascript-file>] (optional) Path and name of a JavaScript file to execute

Features

  • Generate a PDF screenshot of the web page after it is completely loaded.
  • Optionally execute a local JavaScript file specified by the argument <javascript-file> after the web page is completely loaded, and before the PDF is generated.
  • console.log’s will be printed to stdout.
  • Easily add new features by changing the source code of this script, without compiling C++ code. For more advanced applications, consider attaching PyQt objects/methods to WebKit’s JavaScript space by using QWebFrame::addToJavaScriptWindowObject().

If you execute an external <javascript-file>, phantom.py has no way of knowing when that script has finished doing its work. For this reason, the external script should execute console.log("__PHANTOM_PY_DONE__"); when done. This will trigger the PDF generation, after which phantom.py will exit. If no __PHANTOM_PY_DONE__ string is seen on the console for 10 seconds, phantom.py will exit without doing anything. This behavior could be implemented more elegantly without console.log’s but it is the simplest solution.

It is important to remember that since you’re just running WebKit, you can use everything that WebKit supports, including the usual JS client libraries, CSS, CSS @media types, etc.

Dependencies

  • Python3
  • PyQt5
  • xvfb (optional for display-less machines)

Installation of dependencies in Debian Stretch is easy:

apt-get install xvfb python3-pyqt5 python3-pyqt5.qtwebkit

Finding the equivalent for other OSes is an exercise that I leave to you.

Examples

Given the following file /tmp/test.html:

<html>
  <body>
    <p>foo <span id="id1">foo</span> <span id="id2">foo</span></p>
  </body>
  <script>
    document.getElementById('id1').innerHTML = "bar";
  </script>
</html>

… and the following file /tmp/test.js:

document.getElementById('id2').innerHTML = "baz";
console.log("__PHANTOM_PY_DONE__");

… and running this script (without attached display) …

xvfb-run python3 phantom.py /tmp/test.html /tmp/out.pdf /tmp/test.js

… you will get a PDF file /tmp/out.pdf with the contents “foo bar baz”.

Note that the second occurrence of “foo” has been replaced by the web page’s own script, and the third occurrence of “foo” by the external JS file.

Source Code

"""
# phantom.py

Simple but fully scriptable headless QtWebKit browser using PyQt5 in Python3,
specialized in executing external JavaScript and generating PDF files. A lean
replacement for other bulky headless browser frameworks.

Copyright 2017 Michael Karl Franzl

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

"""

import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtPrintSupport import QPrinter
from PyQt5.QtCore import QTimer
import traceback

  
class Render(QWebPage):
  def __init__(self, url, outfile, jsfile):
    self.app = QApplication(sys.argv)
    
    QWebPage.__init__(self)

    self.jsfile = jsfile
    self.outfile = outfile
    
    qurl = QUrl.fromUserInput(url)
    
    print("phantom.py: URL=", qurl, "OUTFILE=", outfile, "JSFILE=", jsfile)
    
    # The PDF generation only happens when the special string __PHANTOM_PY_DONE__
    # is sent to console.log(). The following JS string will be executed by
    # default, when no external JavaScript file is specified.
    self.js_contents = "setTimeout(function() { console.log('__PHANTOM_PY_DONE__') }, 500);";
    
    if jsfile:
      try:
        f = open(self.jsfile)
        self.js_contents = f.read()
        f.close()
      except:
        print(traceback.format_exc())
        self._exit(10)
        
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(qurl)
    self.javaScriptConsoleMessage = self._onConsoleMessage
    
    # Run for a maximum of 10 seconds
    watchdog = QTimer()
    watchdog.setSingleShot(True)
    watchdog.timeout.connect(lambda: self._exit(1))
    watchdog.start(10000)
    
    self.app.exec_()
    
    
  def _onConsoleMessage(self, txt, lineno, filename):
    print("CONSOLE", lineno, txt, filename)
    if "__PHANTOM_PY_DONE__" in txt:
      # If we get this magic string, it means that the external JS is done
      self._print()
      self._exit(0)
  
  
  def _loadFinished(self, result):
    print("phantom.py: Loading finished!")
    print("phantom.py: Evaluating JS from", self.jsfile)
    self.frame = self.mainFrame()
    self.frame.evaluateJavaScript(self.js_contents)
    

  def _print(self):
    print("phantom.py: Printing...")
    printer = QPrinter()
    printer.setPageMargins(10, 10, 10, 10, QPrinter.Millimeter)
    printer.setPaperSize(QPrinter.A4)
    printer.setCreator("phantom.py by Michael Karl Franzl")
    printer.setOutputFormat(QPrinter.PdfFormat);
    printer.setOutputFileName(self.outfile);
    self.frame.print(printer)
    
  def _exit(self, val):
    print("phantom.py: Exiting with val", val)
    self.app.exit(val) # Qt exit
    exit(val) # Python exit
    
    
def main():
  if (len(sys.argv) < 3):
    print("USAGE: ./phantom.py <url> <pdf-file> [<javascript-file>]")
  else:
    url = sys.argv[1]
    outfile = sys.argv[2]
    jsfile = sys.argv[3] if len(sys.argv) > 3 else None
    r = Render(url, outfile, jsfile)


if __name__ == "__main__":
  main()

Digitize books: Searchable OCR PDF with text overlay from scanned or photographed books on Linux

Note: This post is 7 years old. Some information may no longer be correct or even relevant. Please, keep this in mind while reading.

Here is my method to digitize books. It is a tutorial about how to produce searchable, OCR (Optical Character Recognition) PDFs from a hardcopy book using free software tools on Linux distributions. You probably can find more convenient proprietary software, but that’s not the objective of this post.

Important: I should not need to mention that depending on the copyright attached to a particular work, you may not be allowed to digitize it. Please inform yourself before, so that you don’t break copyright laws!!!

Digitize books

To scan a book, you basically have 2 choices:

  1. Scan each double page with a flatbed scanner
  2. Take a good photo camera, mount it on a tripod, have it point vertically down on the book, and then take photos of each double page. Professional digitizers use this method due to less strain on the originals.

No matter which method, the accuracy of OCR increases with the resolution and contrast of the images. The resolution should be high enough so that each letter is at least 25 pixels tall.

Since taking a photo is almost instant, you can be much faster with the photographing method than using a flatbed scanner. This is especially true for voluminous books which are hard to repeatedly take on and off a scanner. However, getting sharp high-resolution images with a camera is more difficult than using a flatbed scanner. So it’s a tradeoff that depends on your situation, equitpment and your skills.

Using a flatbed scanner doesn’t need explanation, so I’ll only explain the photographic method next.

Photographing each page

If you use a camera, and you don’t have some kind of remote trigger or interval-trigger at hand, you would need 2 people: someone who operates the camera, and another one who flips the pages. You can easily scan 1 double page every 2 seconds once you get more skilled in the process.

Here are the steps:

  • Set the camera on a tripod and have it point vertically down. The distance between camera and book should be at least 1 meter to approximate orthagonal projection (imitates a flatbed scanner). Too much perspective projection would skew the text lines.
  • Place the book directly under the camera – avoid pointing the camera at any non-90-degree angles that would cause perspective skewing of the contents. Later we will unskew the images, but the less skewing you get at this point, the better.
  • Set up uniform lighting, as bright as you are able. Optimize lighting directions to minimize possible shadows (especially in the book fold). Don’t place the lights near the camera or it will cause reflections on paper or ink.
  • Set the camera to manual mode. Use JPG format. Turn the camera flash off. All pictures need to have uniform exposure characteristics to make later digital processing easier.
  • Maximize zoom so that a margin of about 1 cm around the book is still visible. This way, aligning of the book will take less time. The margin will be cropped later.
  • Once zoom and camera position is finalized, mark the position of the book on the table with tape. After moving the book, place it back onto the original position with help of these marks.
  • Take test pictures. Inspect and optimize the results by finding a balance between the following camera parameters:
    • Minimize aperture size (high f/value) to get sharper images.
    • Maximize ISO value to minimize exposure time so that wiggling of the camera has less of an effect. Bright lighting helps lowering ISO which helps reducing noise.
    • Maximize resolution so that the letter size in the photos is at least 25 pixels tall. This will be important to increase the quality of the OCR step below, and you’ll need a good camera for this.
  • Take one picture of each double page.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Note that the right page is slighty rotated.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Make sure that letters are at least 25 pixels tall. Note that the right page is slighty rotated.

Image Preprocessing

Let’s remember our goal: We want a PDF …

  • which is searchable (the text should be selectable)
  • whose file size is minimized
  • has the same paper size as the original
  • is clearly legible

The following steps are the preprocessing steps to accomplish this. We will use ImageMagick command line tools (available for all platforms) and a couple of other software, all available for Linux distributions.

A note on image formats

Your input files can be JPG or TIFF, or whatever format your scanner or camera support. However, this format must also be supported by ImageMagick. We’ll convert these images to black-and-white PBM images to save space and speed up further processing. PBM is a very simple, uncompressed image format that only stores 1 bit per pixel (2 colors). This image format can be embedded into the PDF directly, and it will be losslessly compressed extremely well, resulting in the smallest possible PDF size.

Find processing parameters by using just a single image

Before we will process all the images as a batch, we’ll just pick one image and find the right processing parameters. Copy one photograph into a new empty folder and do the following steps.

Converting to black and white

Suppose we have chosen one image in.JPG. Run:

convert -normalize -threshold 50% -brightness-contrast 0x10 in.JPG 1blackwhite.pbm

Inspect the generated 1blackwhite.pbm file. Optimize the parameters threshold (50% in above example), brightness (0 in above example), and contrast (10 in above example) for best legibiligy of the text.

Black-white conversion of the original image. Contrast and resolution is important.
Black-white conversion of the original image. Contrast and resolution is important.

Cropping away the margins

Next we will crop away the black borders so that the image will correspond to the paper size.

convert -crop 2400x2000+760+250 1blackwhite.pbm 2cropped.pbm

In this example, the cropped image will be a rectangle of 2400×2000 pixels, taken from the offset 760,250 of the input image. Inspect 2cropped.pbm until you get the parameters right, it will be some trial-and-error. The vertical book fold should be very close to the horizontal middle of the cropped image (important for next step).

The cropped image
The cropped image

Split double pages into single pages

convert +repage -crop 50%x100% +repage 2cropped.pbm split%04d.pbm

This will generate 2 images. Inspect split0001.pbm and split0002.pbm. You only can use 50% of horizontal cut, otherwise you’ll get more than 2 images.

Left split page
Left split page
Right split page
Right split page

Deskewing the image

Your text lines are probably not exactly horizontal (page angles, camera angles, perspective distortion, etc.). However, having exactly horizontal text lines is very important for accuracy of OCR software. We can deskew an image with the following command:

convert -deskew 40% split0001.pbm 3deskewed.pbm

Inspect the output file 3deskewed.pbm for best results.

The deskewed left page
The deskewed left page
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!

Process all the images

Now that you’ve found the paramters that work for you, it’s simple to convert all of your images as a batch, by passing all the paramters at the same time to convert. Run the following in the folder where you stored all the JPG images (not in the folder where you did the previous single-image tests):

convert -normalize -threshold 50% -brightness-contrast 0x10 -crop 2400x2000+760+250 +repage -crop 50%x100% +repage -deskew 40% *.JPG book%04d.pbm

Now, for each .JPG input file, we’ll have two .pbm output files. Inspect all .pbm files and make manual corrections if needed.

Note: If you have black borders on the pages, consider using unpaper to remove them. I’ll save writing about using unpaper for a later time.

 

Producing OCR PDFs with text overlay

The tesseract OCR engine can generate PDFs with a selectable text layer directly from our PBM images. Since OCR is CPU intensive, we’ll make use of parallel processing on all of our CPU cores with the parallel tool. You can install both by running

apt-get install tesseract-ocr parallel

For each PBM file, create one PDF file:

find . -name '*.pbm' | parallel 'echo "Processing {}"; tesseract {} {.} pdf'

To merge all the PDF files into one, run pdfunite from the poppler-utils package:

pdfunite *.pdf book.pdf

Success! And this is our result:

The PDF has been OCR'd and a selectable text layer has been generated
The PDF has been OCR’d and a selectable text layer has been generated