phantom.py: A lean replacement for bulky headless browser frameworks

This is a simple but fully scriptable headless QtWebKit browser using PyQt5 in Python3, specialized in executing external JavaScript and generating PDF files. A lean replacement for other bulky headless browser frameworks. (Source code at end of this post as well as in this github gist)

Usage

If you have a display attached:

./phantom.py <url> <pdf-file> [<javascript-file>]

If you don’t have a display attached (i.e. on a remote server):

xvfb-run ./phantom.py <url> <pdf-file> [<javascript-file>]

Arguments:

  • <url> Can be a http(s) URL or a path to a local file
  • <pdf-file> Path and name of PDF file to generate
  • [<javascript-file>] (optional) Path and name of a JavaScript file to execute

Features

  • Generate a PDF screenshot of the web page after it is completely loaded.
  • Optionally execute a local JavaScript file specified by the argument <javascript-file> after the web page is completely loaded, and before the PDF is generated.
  • console.log’s will be printed to stdout.
  • Easily add new features by changing the source code of this script, without compiling C++ code. For more advanced applications, consider attaching PyQt objects/methods to WebKit’s JavaScript space by using QWebFrame::addToJavaScriptWindowObject().

If you execute an external <javascript-file>, phantom.py has no way of knowing when that script has finished doing its work. For this reason, the external script should execute console.log("__PHANTOM_PY_DONE__"); when done. This will trigger the PDF generation, after which phantom.py will exit. If no __PHANTOM_PY_DONE__ string is seen on the console for 10 seconds, phantom.py will exit without doing anything. This behavior could be implemented more elegantly without console.log’s but it is the simplest solution.

It is important to remember that since you’re just running WebKit, you can use everything that WebKit supports, including the usual JS client libraries, CSS, CSS @media types, etc.

Dependencies

  • Python3
  • PyQt5
  • xvfb (optional for display-less machines)

Installation of dependencies in Debian Stretch is easy:

apt-get install xvfb python3-pyqt5 python3-pyqt5.qtwebkit

Finding the equivalent for other OSes is an exercise that I leave to you.

Examples

Given the following file /tmp/test.html:

<html>
  <body>
    <p>foo <span id="id1">foo</span> <span id="id2">foo</span></p>
  </body>
  <script>
    document.getElementById('id1').innerHTML = "bar";
  </script>
</html>

… and the following file /tmp/test.js:

document.getElementById('id2').innerHTML = "baz";
console.log("__PHANTOM_PY_DONE__");

… and running this script (without attached display) …

xvfb-run python3 phantom.py /tmp/test.html /tmp/out.pdf /tmp/test.js

… you will get a PDF file /tmp/out.pdf with the contents “foo bar baz”.

Note that the second occurrence of “foo” has been replaced by the web page’s own script, and the third occurrence of “foo” by the external JS file.

Source Code

"""
# phantom.py

Simple but fully scriptable headless QtWebKit browser using PyQt5 in Python3,
specialized in executing external JavaScript and generating PDF files. A lean
replacement for other bulky headless browser frameworks.

Copyright 2017 Michael Karl Franzl

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

"""

import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtPrintSupport import QPrinter
from PyQt5.QtCore import QTimer
import traceback

  
class Render(QWebPage):
  def __init__(self, url, outfile, jsfile):
    self.app = QApplication(sys.argv)
    
    QWebPage.__init__(self)

    self.jsfile = jsfile
    self.outfile = outfile
    
    qurl = QUrl.fromUserInput(url)
    
    print("phantom.py: URL=", qurl, "OUTFILE=", outfile, "JSFILE=", jsfile)
    
    # The PDF generation only happens when the special string __PHANTOM_PY_DONE__
    # is sent to console.log(). The following JS string will be executed by
    # default, when no external JavaScript file is specified.
    self.js_contents = "setTimeout(function() { console.log('__PHANTOM_PY_DONE__') }, 500);";
    
    if jsfile:
      try:
        f = open(self.jsfile)
        self.js_contents = f.read()
        f.close()
      except:
        print(traceback.format_exc())
        self._exit(10)
        
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(qurl)
    self.javaScriptConsoleMessage = self._onConsoleMessage
    
    # Run for a maximum of 10 seconds
    watchdog = QTimer()
    watchdog.setSingleShot(True)
    watchdog.timeout.connect(lambda: self._exit(1))
    watchdog.start(10000)
    
    self.app.exec_()
    
    
  def _onConsoleMessage(self, txt, lineno, filename):
    print("CONSOLE", lineno, txt, filename)
    if "__PHANTOM_PY_DONE__" in txt:
      # If we get this magic string, it means that the external JS is done
      self._print()
      self._exit(0)
  
  
  def _loadFinished(self, result):
    print("phantom.py: Loading finished!")
    print("phantom.py: Evaluating JS from", self.jsfile)
    self.frame = self.mainFrame()
    self.frame.evaluateJavaScript(self.js_contents)
    

  def _print(self):
    print("phantom.py: Printing...")
    printer = QPrinter()
    printer.setPageMargins(10, 10, 10, 10, QPrinter.Millimeter)
    printer.setPaperSize(QPrinter.A4)
    printer.setCreator("phantom.py by Michael Karl Franzl")
    printer.setOutputFormat(QPrinter.PdfFormat);
    printer.setOutputFileName(self.outfile);
    self.frame.print(printer)
    
  def _exit(self, val):
    print("phantom.py: Exiting with val", val)
    self.app.exit(val) # Qt exit
    exit(val) # Python exit
    
    
def main():
  if (len(sys.argv) < 3):
    print("USAGE: ./phantom.py <url> <pdf-file> [<javascript-file>]")
  else:
    url = sys.argv[1]
    outfile = sys.argv[2]
    jsfile = sys.argv[3] if len(sys.argv) > 3 else None
    r = Render(url, outfile, jsfile)


if __name__ == "__main__":
  main()

Reasonably secure unattended SSH logins from untrusted machines

There are certain cases where you want to operate a not completely trusted networked machine, and write scripts to automate some task which involves an unattended SSH login to a server.

With “not completely trusted machine” I mean a computer which is reasonably secured against unauthorized logins, but is physically unattended (which means that unknown persons can have physical access to it).

An established SSH connection has a number of security implications. As I have argued in a previous blog post “Unprivileged Unix Users vs. Untrusted Unix Users”, having access to a shell on a server is problematic if the user is untrusted (as is always the case when the user originates from an untrusted machine), even if he is unprivileged on the server. In my blog post I presented a method to confine a SSH user into a jail directory (via a PAM module using the Linux kernel’s chroot system call) to prevent reading of all world-readable files on the server. However, such a jail directory still doesn’t prevent SSH port forwarding (which I illustrated in this blog post).

In short, any kind of SSH access allows access to at least all of the server’s open TCP ports, even if they are behind its firewall.

Does this mean that giving any kind of SSH access to an untrusted machine should not be done in principle? It does seem so, but there are ways to make the attack surface smaller and make the setup reasonably secure.

Remember that SSH uses some way of authentication.This is either a plain password, or a public/private keypair. In both cases there are secrets which should not be stored on the untrusted machine in a way that allows revealing of the secrets.

So the question becomes: How to supply the secrets to SSH without making it too easy to reveal them?

A private SSH key is permanent and must be stored on a permanent medium of the untrusted machine. To mitigate the possibility that the medium (e.g. hard drive) is extracted and the private key revealed, the private key should be encrypted with a long passphrase. A SSH passphrase needn’t be manually typed every time a SSH connection is made. ssh connects to ssh-agent (if running) to use private keys which may have previously been decrypted via a passphrase.  ssh-agent holds this information in the RAM.

I said “RAM”: For the solution to our present problem, this will be as good as it gets. The method presented below would require technical skills to read out the RAM of a running machine with hardware probes only, which would require (extremely) specialized skills. In this blog post, this is the meaning of the term “reasonably secure”.

On desktop machines, ssh-agent is usually started together with the graphical user interface. Keys and its passphrases can be “added” to it with the command ssh-add. The actual program ssh connects to ssh-agent if the environment variables SSH_AGENT_PID and SSH_AUTH_SOCK are present. This means that any kind of shell script (even unattended ones called from cron) can benefit from this: passphrases won’t be asked if the corresponding key has already been decrypted in memory. The main advantage of this is that this has to be done only once after the reboot of the machine (because the reboot clears the RAM).

On a headless client, without graphical interface, ssh-agent may not even be installed, we have to start it in a custom way. There is an excellent program called keychain which makes this very easy. The sequence of our method will look like this:

  1. The machine is rebooted.
  2. An authorized administrator logs into the machine and uses the keychain command to enter the passphrase which is now stored in RAM by ssh-agent.
  3. The administrator now can log out. The authentication data will remain in the RAM and will be available to unattended shell scripts.
  4. Every login to the machine will clear the authentication information. This ensures that even a successful login of an attacker will render the private key useless. This implies a minor inconvenience for the administrator: He has to enter the passphrase at every login too.

Keychain is available in major distro’s repositories:

apt-get install keychain

Add the following line to either ~/.bashrc or to the system-wide /etc/bash.bashrc:

eval `keychain --clear --eval /path/to/.ssh/id_rsa`

This line will be executed at each login to the server. What does this command do?

  1. keychain will read the private key from the specified path.
  2. keychain will prompt for the passphrase belonging to this key (if there is one).
  3. keychain will look for a running instance of ssh-agent. If there is none, it will start it. If there is one, it will re-use it.
  4. Due to the --clear switch, keychain will clear all keys from ssh-agent. This renders the private key useless even if an attacker manages to successfully log in.
  5. keychain adds the private key plus entered passphrase to ssh-agent which stores it in the RAM.
  6. keychain outputs a short shell script (to stdout) which exports two environment variables (mentioned above) which point to the running instance of ssh-agent for consumption by ssh.
  7. The eval command executes the shell script from keychain which does nothing more but set the two environment variables.

Environment variables are not fully global, they always belong to a running process. Thus, in every unattended script which uses ssh, you need to set these environment variables by evaluating the output of

keychain --eval

for example, in a Bash script:

#!/bin/bash

# Set up environment variables pointing to ssh-agent.
eval `keychain --eval`

# Do tasks involving ssh

It makes sense to gracefully catch SSH connection problems in your scripts. If you don’t do that, the script may hang indefinitely prompting for a passphrase if it has not been added properly. To do this, do a ‘preflight’ ssh connection which simply returns an error:

#!/bin/bash

# Set up environment variables pointing to ssh-agent.
eval `keychain --eval`

# 'Preflight' connection test.
ssh -q -o "BatchMode=yes" -o "ConnectTimeout=10" user@host echo ok
if [ "$?" != "0" ]; then
  echo "SSH connection could not be established"
  exit 99
fi

# At this point, the SSH connection will work.

Conclusion

In everyday practice, security is never perfect. This method is just one way to protect — within reasonable limits — a SSH connection of an unattended/untrusted machine “in the field” to a protected server. As always when dealing with the question of ‘security’, any kind of solution needs to be carefully vetted before deployment in production!

Encrypt backups at an untrusted remote location

In a previous blog post I argued that a good backup solution includes backups at different geographical locations to compensate for local disasters. If you don’t fully trust the location, the only solution is to keep an encrypted backup.

In this tutorial we’re going to set up an encrypted, mountable backup image which allows us to use regular file system operations like rsync.

First, on any kind of permanent medium available, create a large enough file which will hold the encrypted file system. You can later grow the file system (with dd and resize2fs) if needed. We will use dd to create this file and fill this file with zeros. This may take a couple of minutes, depending on the write speed of the hard drive. Here, we create a 500GB file:

dd if=/dev/zero of=/path/to/backup.img bs=100M count=5000

A quicker method to do the same (file will not be filled with zeroes) is:

fallocate -l 500G /path/to/backup.img

Now we will use LUKS to set up a virtual mapping device node for us:

apt-get install cryptsetup

First, we generate a key/secret which will be used to generate the longer symmetric encryption key which in turn protects the actual data. We tap into the entropy pool of the Linux kernel and convert 32 bytes of random data into base64 format (this may take a long time; consider installing haveged as an additional entropy source):

dd if=/dev/random bs=1 count=32 | base64

Store the Base64-encoded key in a secure location and create backups! If this key/secret is lost, you will lose the backup. You have been warned!

Next, we will write the LUKS header into the backup image:

echo "Base64-encoded key" | base64 --decode | cryptsetup luksFormat --key-file=- /path/to/backup.img

Next, we “open” the encrypted drive with the label “backup_crypt”:

echo "Base64-encoded key" | base64 --decode | cryptsetup luksOpen --key-file=- /path/to/backup.img backup_crypt

This will create a device node /dev/mapper/backup_crypt which can be mounted like any other hard drive. Next, create an Ext4 file system on this raw device (“formatting”):

mkfs.ext4 /dev/mapper/backup_crypt

Now, the formatted device can be mounted like any other file system:

mkdir -p /mnt/backupspace_loop
mount -o loop /dev/mapper/backup_crypt /mnt/backupspace_loop

You can inspect the mount status by typing mount. If data is written to this mount point, it will be transparently encrypted to the underlying physical device.

If you are done writing data to it, you can unmount it as follows:

umount /mnt/backupspace_loop
cryptsetup luksClose /dev/mapper/backup_crypt

To re-mount it:

echo "Base64-encoded key" | base64 --decode | cryptsetup luksOpen --key-file=- /path/to/backup.img backup_crypt
mount -o loop /dev/mapper/backup_crypt /mnt/backupspace_loop

Note that we always specify the Base64-encoded key on the command line and pipe it into cryptsetup. This is better than creating a file somewhere on the hard drive, because it only resides in the RAM. If the machine is powered off, the decrypted mount point is lost and only the encrypted image remains.

If you are really security-conscientious, you need to read the manual of cryptsetup to optimize parameters. You may want to use a key/secret longer than the 32 bytes mentioned here.