Writing Translators

Introduction

The current workflow for translator production involves creating an SQL file, then coding the translator from within the SQL file. While this method is far from ideal, generally, it works fairly well. An example of this workflow can be seen in the scrapers.sql file in SVN.

While this approach is simple, it can be irritating. One must remember that any single quotes should be followed by a second single quote in order to escape the first.

To add or edit translators, you’ll want to have an up-to-date copy of the sqlite3 command-line client (since the verison included with OS X is obsolete). You can obtain the source for SQLite from the website. You’ll then be able to dynamically reload translators without restarting Firefox through syntax such as:

$ /usr/local/bin/sqlite3 \
~/Library/Application\ Support/Firefox/Profiles/<random string>/scholar/scholar.sqlite \
< scrapers.sql

All translators are run in a sandbox. Only functions available to unprivileged JavaScript and functions defined in this document will be available to your translator. This also has implications for the loadDocument and processDocuments utilities, described below.

Translator parsing and execution errors are displayed in the console. To view them, you’ll need to run Firefox from a terminal window. To do this on a Mac, execute /Applications/Firefox.app/Contents/MacOS/firefox from within the Terminal (assuming Firefox is located in the Applications folder). On other platforms, simply execute the Firefox binary from the command line.

Translator Basics

Scholar translators are stored in the translators table in the Scholar database, which can be found at ~/Library/Application Support/Profiles/<random string>/scholar/scholar.sqlite3

The translators table is defined as follows:

CREATE TABLE translators (
        translatorID TEXT PRIMARY KEY,
        lastUpdated DATETIME,
        type INT,
        label TEXT,
        creator TEXT,
        target TEXT,
        detectCode TEXT,
        code TEXT
    );

translatorID

The translatorID field must contain a valid GUID in order to distinguish a translator from other translators with similar names. GUIDs are used as primary keys on the translator table, for updating translators from the central repository, and for loading one translator from another (see below).

A simple Google search for GUID generator will reveal many possible options for generating GUIDs. I’ve been using this one while developing. Remember to delete the braces.

lastUpdated

The lastUpdated field should contain the date a translator was last modified as a standard SQL-style date. At the moment, all of our translators are in one file (scrapers.sql), so I haven’t been too good about updating this value. When the central repository comes into play, however, this field will be critical to keeping all clients up to date.

type

NOTE: type is an SQL reserved keyword, and so I may decide to change its name to translatorType. Don’t let this distract you; the behavior will remain the same.

The type field specifies the function of a given translator. The base values are as follows:

To create a translator that performs multiple functions, simply sum the base types. For example, an import/export translator would be of type 3, while a web/search translator would be of type 12.

label and creator

The label and creator fields specify basic translator metadata. For import and export translators, label is used to generate the list of acceptable file types in the file selection dialog, and therefore should contain only the name of the file type, without such suffixes as “Translator” or “Scraper.” While, at the moment, users never see the label for web or search translators, it’s probably a good idea to abide by these rules for all types (even if I haven’t).

target

The use of the target field depends on translator type.

detectCode

For all translator types, Scholar loads the detectCode before translation, ensuring that any methods defined within it are also accessible from functions defined in the main code field.

For import, web, and search, the detectCode should define the following functions to detect if it is capable of extracting metadata from a given resource:

For import and export, the detectCode may also specify configuration options. Configuration options must be set by calling Scholar.configure(option, value) from the detectCode block (not within any function), where option and value are selected from the list below:

Option Purpose Possible values
getCollections Specifies whether Scholar should prepare the collection structure for export. true
dataMode Specifies the mechanism used to read files for import. If set to “rdf”, also configures RDF datasource export (see below). "line" "block" "rdf"

For export, the detectCode may also specify options to be shown to the user. Display options must be enabled by calling Scholar.addOption(option, defaultValue) from the detectCode block (not within any function). At the moment, all options are checkboxes, and all defaultValues must be either true or false.

Option English representation
exportNotes Export Notes
exportFileData Export Files

If “option” is arbitrary text, the arbitrary text will be printed next to the interface element.

code

Just as the translation engine relies on special functions in the detectCode for detection, the translation engine relies on special functions in the code for translation. These functions are summarized below. code define the functions appropriate to the declared translator type.

Type Function Arguments
import doImport() N/A
export doExport() N/A
web doWeb(doc, url) doc is a JavaScript document
url is the URL string, adjusted for any proxying
search doSearch(item) item is an item array

More information on each type is presented below.

The Item Array Format

All translators must deal in some way with native Scholar items. The translator engine provides the Scholar.Item class (not to be confused with the other Scholar.Item class defined in data_access.js) to generate new items. The property structure of this class closely mirrors the database structure.

For import, web, and search translators, the following syntax generates a new item:

var newItem = new Scholar.Item("book");
newItem.title = "My Book";
newItem.year = 2006;
newItem.creators.push({firstName:"Simon",
                       lastName:"Kornblith",
                       creatorType:"editor"});
newItem.complete();

As the above example shows, the basics of item creation are relatively simple. In its constructor, Scholar.Item takes an typeName (as defined in the itemTypes table), although types may also be specified later through the itemType property.

All fields applicable to the object are also available as properties (with the exception of the dateAdded and dateModified fields, which are available to export translators, but overriden if set by import/web/search translators). For more complex relationships, Scholar.Item uses arrays.

The creators array is an array of objects, each of which should possess firstName, lastName, and creatorType attributes. The content of the first two is self-evident, but the latter should be a valid creatorType as defined in the creatorTypes table.

Scholar provides the Scholar.Utilities.cleanAuthor(author, creatorType, useComma) function to assist in the creation of these objects.

Scholar.Utilities.cleanAuthor("John Doe", "author");
Scholar.Utilities.cleanAuthor("Doe, John", "author", true);
// return {firstName:"John", lastName:"Doe", creatorType:"author"}

The notes array is an array of objects, each of which should possess a note property. Optionally, the objects may also have typeID and seeAlso properties. Refer to documentation on the seeAlso array below.

The tags array is an array of strings. These strings are the tags attached to the item.

To use the seeAlso array, one must specify an itemID property on each item (or note, as the case may be). This ID may be a string, a number, or even an object, but it must be unique. The seeAlso array is an array of previously specified itemIDs.

Independent notes and files behave slightly differently. An independent note has only a note property (and, optionally, an itemID) property. Files are not yet implemented (#39).

Writing Import Translators

Import translators take a text or XML file and translate it into Scholar’s native database format.

Import Detection

To detect which import translator is best equipped to handle a given file, Scholar first checks all translators’ target file extensions against the given file’s file extension. If a translator target matches, Scholar calls its detectCode, if present, to determine if it is capable of translating the given file. If the detectImport() function returns true, it then translates the file.

If no translator capable of translating the given file is found during this first pass, the translation engine runs a second pass, running the detectCode of any translator whose target did not match during the first pass.

Because many file types (e.g., MARC, RDF, RIS, etc.) have no defined extension, it is thus critical that import translators include detectCode.

Import IO

Scholar operates on the principle of stream IO. Rather than reading a file into memory all at once, which can be slow and/or impossible for extremely large data sets, import translators should read a file line-by-line or block-by-block if at all possible.

For the convenience of translator authors, Scholar offers three dataModes for reading files. The dataMode is set using Scholar.configure().

The block dataMode

The block dataMode reads files in blocks of a user-specified length. If no dataMode is defined, Scholar behaves as if the block dataMode were specified.

var block;
while(block = Scholar.read(4096)) {
    // do something
}

When EOF is reached, Scholar.read() will return false.

The line dataMode

The line dataMode reads lines line-by-line.

var line;
while(line = Scholar.read()) {
    // do something
}

When EOF is reached, Scholar.read() will return false.

The rdf dataMode

The rdf dataMode, as its name implies, provides a high-level interface to RDF data. It is currently undocumented. If you really want to use it, talk to Simon.

Writing Export Translators

Export translators are unique among translator types in that they are expected to output text, not Scholar items. The API remains as consistent as possible with other translator types.

Getting Item Data

Rather than passing an array of items, the translator engine prefers to fetch items one at a time, so as to minimize memory footprint.

var item;
while(item = Scholar.nextItem()) {
    // do something
}

If absolutely necessary (e.g., for ID mapping), a translator may read all items into an array beforehand.

Export IO

To write to a file, export translators should use the relatively straightforward Scholar.write(data) function. Remember to add both CRs and LFs to line endings.

Scholar.write("Scholar for Firefox 1.0a1\r\n"); Scholar.write("Test output format\r\n");

If the rdf dataMode is specified, export translators may write using the same high-level RDF interface available to import translators.

Writing Web Translators

Web translators are the most common translator type. They are also generally the most complicated to write, owing to the complexities of scraping and decoding webpages. To combat this complexity, Scholar offers a number of convenience functions to assist web translator developers.

Detecting Web Data

As described above, the detectWeb(doc, url) function in web translator detectCode determines what icon will appear on the web toolbar, while the target must match the page URL.

If the target is left blank, the translator is run on all pages. Obviously, one should avoid this situation if at all possible, although it may be useful for implementing the high number of (barely adopted) standards for embedding metadata in webpages.

Using Solvent and XPath

XPath provides an easy, convenient way to reference a specific element in a page’s DOM. A brief introduction is available at Wikipedia, while the full specification is available the W3C. It is recommended, but not necessary, that you use XPath to simplify referencing DOM nodes in your scrapers.

The SIMILE project’s Solvent provides a visual tool for working with XPath expressions. You’ll need to install it using the Nightly Tester Tools since it’s not officially Firefox 2-compatible. To open Solvent, click the soda bottle-ish icon at the bottom of the screen. To view XPaths of items on a page, click “Capture.” The XPath appears in the box below the button. Typing an XPath into the box will highlight all elements matching that XPath. For our purposes, the other features of Solvent are useless.

To get the value of a single node as referenced via an XPath, use the following syntax:

var table = doc.evaluate("/html/body/h1/text()", elmt, nsResolver,
                   XPathResult.ANY_TYPE, null).iterateNext().nodeValue;

If the referenced item is an element, you may access its properties as if you accessed it by other, non-XPath syntax. To get a list of XPaths, continue to call iterateNext() on the object returned by doc.evaluate() until it returns no value (further documented here).

Search Results

Generally, a web translator should work on both search results and item description pages. On search results pages, Scholar presents a list of items available on the page and allows the user to select which ones he/she would like to import into his/her library. Because in many cases, scraping the complete metadata for each item then presenting the user with a list would be time-consuming, Scholar instead provides an interface for selecting items by a title (or other identifier) before fetching them.

The Scholar.selectItems(associativeArray) function is responsible for displaying this list. It takes an associative array of key => value pairs, where the key is an identifier (typically a URL or other ID) that the translator will use later to retrieve the document, while the value is the human-readable resource title. It returns an array of key => value pairs representing the items the user checked off in the Select Items dialog.

var itemArray = {book1:"The First Book",
                 book2:"The Second Book"};
itemArray = Scholar.selectItems(itemArray);

In many situations, Scholar.Utilities.getItemArray(doc, contextNode, urlRegexp, ignoreRegexp) dramatically simplifies generation of the associative array passed as an argument to Scholar.selectItems().

var urlRegexp = "^http://www\.amazon\.com/(gp/product/|exec/obidos/tg/detail/)";
var ignoreRegexp = "^(Buy new|Hardcover|Paperback|Digital)$";
var itemArray = Scholar.Utilities.getItemArray(doc, doc,
                                               urlRegexp, ignoreRegexp);
itemArray = Scholar.selectItems(itemArray);
Scholar.Utilities.getItemArray() searches for links whose targets match the given urlRegexp within the given node contextNode (usually the document object). Optionally, contextNode may also be an array of nodes.

Loading Other Pages

Scholar provides two mechanisms for loading pages. The first, based on XMLHttpRequest, does not parse the page content, but should work regardless of the domain of the page to be fetched, and permits POST requests.

var onDone = function(responseText, requestObject) {
    // do something
    Scholar.done();
}

Scholar.Utilities.doGet("http://www.example.com/", onDone);
Scholar.Utilities.doPost("http://www.example.com/cgi-bin/form.cgi",
                         "page=1&request=2%2F3", onDone);
Scholar.wait();

The onDone function receives the response text as the first argument (usually, the page source) and the XMLHttpRequest object as the second. You may receive a permission denied error when attempting to access the XMLHttpRequest object directly.

Scholar.wait() specifies that the translation engine should treat this translator as asynchronous, waiting until Scholar.done() is called before dismissing the dialog indicating translation is in progress. Remember to call Scholar.done() when your code is complete.

The second mechanism loads a page’s DOM using Mozilla’s HTML parser and returns a document object. Compared to the first mechanism, working with HTML documents is much easier. Unfortunately, it will not work across domains, it’s slower, and it does not support post requests.

var onDone = function(document) {
    // do something
    Scholar.done();
}
var onError = function() {
    // do something
    Scholar.done();
}

Scholar.Utilities.loadDocument("http://www.example.com/",
                               onDone, onError);
Scholar.wait();

Using this mechanism, you may also load multiple documents in succession.

var onDocumentLoad = function(document) {
    // do something
}
var onAllDocumentsLoaded = function() {
    Scholar.done();
}
var onError = function() {
    // do something
    Scholar.done();
}

var urlsToLoad = new Array("http://www.example.com/1.html",
                           "http://www.example.com/2.html");

Scholar.Utilities.processDocuments(urlsToLoad, onDocumentLoad,
                                   onAllDocumentsLoaded, onError);

Loading Other Translators

Because many websites provide a way to export data to a standard formats such as RIS or MARC, in many cases web translators may not have to deal with item creation and metadata extraction at all. Instead, they may extract data in a standard format, then rely on one of the existing import translators to perform the conversion.

To load a translator, use the Scholar.loadTranslator(type, scraperID) function.

var translation = Scholar.loadTranslator("import");
translation.setTranslator("a6ee60df-1ddc-4aae-bb25-45e0537be973");
translation.setString(text);
translation.translate();

In this example, the web translator loads the MARC translator. It then writes the MARC data to the translator and performs the import operation. The Scholar.eof() method should be called after writing is complete but before doImport.

All of the methods of a given translator become available when that translator is loaded. For example, in many situations, scripts may deal with MARC data without receiving it in binary MARC format.

var translation = Scholar.loadTranslator("import");
translation.setTranslator("a6ee60df-1ddc-4aae-bb25-45e0537be973");
var marc = translation.getTranslatorObject();

Scholar.Utilities.processDocuments(newURLs, function(newDoc) {
    var record = new marc.record();
    for(var i=0; i<fields.length; i++) {
        // extract code, indicators, value from fields
        record.addField(code, indicators, value);
    }

    var newItem = new Scholar.Item();
    record.translate(newItem);
    newItem.complete();
}, function() { Scholar.done() });

In this example, the web translator creates a new MARC record using the record class from the MARC translator, adds MARC fields with the addField() method, and performs translations with the translate() method.