

URLshot is a PERL hack. It uses the X Virtual FrameBuffer server (Xvfb) and a browser (currently skipstone, because it's tiny and needs minimal user intervention) to grab screenshots of web sites. These are then reduced to thumbnail size and postprocessed for sharpness. The PerlMagick package (a set of Perl bindings for ImageMagick) is used to this effect.
Since it's a PERL script, it'll run on any architecture that satisfies its quirky requirements. Future versions will progressively do away with these unreasonable dependencies, but hacks are almost always based on leveraging tools available to the author.
This hack is used on The Machine Room to generate thumbnails for links. For an example of the script's output, look at this particular view of the Links page of the Machine Room.
URLshot is available as source, Debian and RPM packages. It's provided under the terms of the GNU General Public License, version 2.
|
urlshot − Obtain screenshots of web pages en masse |
|
urlshot [-NVachknqsv] [-B INTEGER] [-W INTEGER] [-D INTEGER] [-H INTEGER] [-w INTEGER] [--bpp=INTEGER] [--display=INTEGER] [--screen-height=INTEGER] [--screen-width=INTEGER] [--width=INTEGER] [--noscale] [--help] [--version] [--quiet] [--verbose] [--noserver] [--kill] [--noclobber] [--any-content] [--nosharpen] [--noscale] FILE... |
|
urlshot is a Perl hack. It reads in a list of filename-URL pairs, loads the URLs up in a web browser, takes screenshots of the browser, post-processes them and saves them to the corresponding filenames. To do this, like any self-respecting Perl hack, it uses a lot of other facilities: Xvfb, the Virtual Framebuffer X Server hosts the display and web browser without need for interactivity or framebuffer hardware (any running X server can be contacted by urlshot, however). Although any web browser can be used by, this version of urlshot was designed with skipstone in mind: a small, gecko-based browser with a minimal but workable feature set. All the image processing is done with PerlMagick (Image::Magick), which obviously depends on the ImageMagick package. As a side effect, urlshot can save files in virtually any raster format you could think of. The base X clients are also needed, as well as lsof and the GET script from libwww-perl. urlshot reads its input from the specified FILE, or from the standard input if no files are given on the command line, or if the - ‘file’ is specified. In reading from standard input, if standard input is a teletype (e.g. if urlshot is run with no file arguments), the program enters interactive mode, where it presents brief instructions and a prompt before each filename-URL pair. The download process first checks to see if the page exists and what content type it returns. If the page exists, the browser is instructed to load it, and urlshot waits for this to complete. Once the browser has loaded the URL, a screenshot is generated using xwd(1), cropped, and if required, sharpened and resized. |
|
urlshot returns an exit code of 0 upon successful termination. |
|
urlshot accepts the following options: |
|
-a, --any-content |
|
Normally, only URLs that yield text/html content will be processed. This option forces urlshot to process URLs regardless of their MIME types. |
|
-B, --bpp=DEPTH |
|
Specifies the bit depth of the virtual X display. The default is 24, with acceptable values being 1, 2, 4, 8, 15, 16 and 24. |
|
-c, --noclobber |
|
Do not overwrite files. If a file already exists, skip processing of the corresponding URL. |
|
-D, --display=NUMBER |
|
Run the server on the specified display number. Be warned: the display is not in the standard X format: you cannot specify a host part. If starting Xvfb, you cannot specify a screen part either (so --display=20 is acceptable, but --display=20.1 is not; --display=yogg-sothoth:20 is right out). If -N is specified, this option sets the X display where the existing server is running. The default value is 20. |
|
-H, --screen-height=HEIGHT |
|
Set the horizontal size of the virtual X display. Enter urlshot -h to see the default. |
|
-h, --help |
|
Display usage information. |
|
-k, --kill |
|
Attempt to kill any X server running on the specified display before commencing. |
|
-N, --noserver |
|
Do not start an X server. Contact a server already handling the display specified using -D. |
|
-n, --noscale |
|
By default, thumbnail screenshots are created. Specify this option to leave images in their original size. |
|
-q, --quiet |
|
Decrease the verbosity level, printing less information. The default verbosity level is 1, so specifying -1 once is meaningful if you need less output. |
|
-s, --nosharpen |
|
By default, thumbnails are sharpened using Imagemagick to enhance detail. Specify this option to inhibit this behaviour. Only thumbnails are sharpened. I.e. if the -n flag is specified to disable scaling down of screenshots, the screenshots will not be sharpened anyway. |
|
-v, --verbose |
|
Increase the verbosity level, printing more information. This option is cumulative (specify it twice for more details). Specified more than twice, this argument causes copious amounts of debugging and status information to be printed (this is almost certainly useless for anything but debugging). The default verbosity level is 1. |
|
-W, --screen-width=WIDTH |
|
Set the horizontal size of the virtual X display. Enter urlshot -h to see the default. |
|
-w, --width=INTEGER |
|
Scale screenshots down to INTEGER pixels horizontally. The thumbnail size is chosen automatically so as to retain proper aspect ratio. This option has no effect if -n was specified. Enter urlshot -h to see the default. |
|
-v, --version |
|
Show version information. |
|
The input format is very simple: a text file with two fields per line, separated by one or more spaces. The first field is the filename of an image file, the second is the URL to obtain a screenshot of. The choice of filename will influence the file format. Depending on the file’s extension, ImageMagick will save different formats. Perhaps the most useful for web work are PNG (extension .png) and JPEG(.jpg). Please refer to ImageMagick(1) for more information on the acceptable formats and their extensions. Waiting for the browser is implemented as a kludge: lsof(1) is used to count the number of open TCP sockets. When it drops to zero, the page is considered to have loaded. This seems to work very well in practice. A timeout stops the program from waiting for ever. Cropping is also a kludge: if a screenshot of the entire root window was generated, the browser’s user interface would have been visible. So urlshot ‘guesses’ the size of the browser’s drawing area. To do this, it asks the browser to load a simple HTML and Javascript file. The Javascript maximises the browser window; the HTML loads a green background. A screenshot is then obtained and the bounding box of all the video green (#00ff00) pixels is evaluated. This works well in practice, provided no browser or theme has video green pixels in the widgets (this is relatively unlikely). A lesser kludge is used to detect whether the browser has crashed, or merely unmapped its window. A screenshot of the default X hatch pattern is taken upon startup, and a SHA256 signature is evaluated. This signature is compared with that of subsequent screenshots. If the signatures are identical, urlshot terminates its execution, on the grounds that it’s taken a screenshot of the X server background (hence, something is wrong with the browser). This avoids clobbering valid screenshots with useless ones of the background, and fixes one of the more annoying bugs of version 0.3.0. |
|
A simple run: grab a screenshot of Google and save it as google.png: echo google.png http://www.google.com | urlshot Grab a larger screenshot of each URL mentioned in file url-list, use a widescreen browser: urlshot -W1024 -H480 -w256 url-list After a while, we’ve updated url-list with new URLs and we want to update our database of thumbnails, so we specify the -c option, and increase verbosity to see what is happening. urlshot -cv -W1024 -H480 -w256 url-list |
|
It’s still an early release. There are numerous bugs. For instance, you can’t change the browser without messing with the script source. You need to configure the browser to ignore pop-ups. Every attempt is made to resize the browser window with Javascript. Unfortunately, this means that a Javascript enabled browser is necessary, and it has to have Javascript turned on. And the Javascript hack to maximise the browser window might not work on other browsers. Better control of various settings (like the name of the browser) should be provided. If the browser has full intensity pure video green pixels, the bounding box guesstimator will fail. The colour should optimally be user-configurable. Better control of the content type should be provided. No control of the timeouts is provided. Sometimes it falls over with X authority errors. Web browser scrollbars are visible in the screenshots. |
|
Written by Alexios Chouchoulas. |
|
Report bugs to Alexios Chouchoulas <alexios@bedroomlan.org>. |
|
Copyright © 2002-2003 Alexios Chouchoulas
<alexios@bedroomlan.org>. |
|
GET(1p), ImageMagick(1), lsof(1), skipstone(1), Xvfb(1). |