Headless Browsing

PhantomJS

PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. RSelenium can drive PhantomJS using two methods: directly or via the standalone Selenium Server.

Driving PhantomJS Directly

The PhantomJS binary can be driven directly with RSelenium. PhantomJS needs to be started in webdriver mode then RSelenium can communicate with it directly without the need for Selenium Server. The command line options for PhantomJS are outlined at http://phantomjs.org/api/command-line.html. We note that it is necessary to start PhantomJS with the --webdriver option and an optional IP/port. RSelenium as of v1.3.2 has a utility function phantom that will handle starting the PhantomJS binary in webdriver mode by default on port 4444. So to drive PhantomJS sans Selenium Server can be done as follows:

require(RSelenium)
pJS <- phantom()
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$getTitle()[[1]] # [1] "Google"
remDr$close
pJS$stop() # close the PhantomJS process, note we dont call remDr$closeServer()

Driving PhantomJS Using Selenium Server

For completeness we outline the process of opening a PhantomJS browser using selenium server. It is assumed that the PhantomJS binary is in the users path.

require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$close()
remDr$closeServer()

Providing the PhantomJS Path

It may not be possible for a user to have the PhantomJS binary in their path. In this case a user may pass the path of the PhantomJS binary to Selenium Server:

require(RSelenium)
RSelenium::startServer()
eCap <- list(phantomjs.binary.path = "C:/Users/john/Desktop/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
....

So in the above example I suppose the PhantomJS binary has been moved to my Desktop which we assume is not in my path. An extra capability phantomjs.binary.path detailed https://github.com/detro/ghostdriver can be used to provide the path to PhantomJS to Selenium Server.

Additional PhantomJS Capabilities

Setting a User Agent

A user agent can be set using the phantomjs.page.settings.userAgent capability.

pJS <- phantom()
Sys.sleep(5)
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate("http://www.whatsmyuseragent.com/")
remDr$findElement("id", "userAgent")$getElementText()[[1]]

## [1] "Your User Agent String is:\nMozilla/5.0 (Unknown; Linux x86_64)
## AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.7 Safari/534.34"

remDr$close()
eCap <- list(
  phantomjs.page.settings.userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0"
)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.whatsmyuseragent.com/")
remDr$findElement("id", "userAgent")$getElementText()[[1]]

## [1] "Your User Agent String is:\nMozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0)
## Gecko/20120101 Firefox/29.0"

remDr$close()
pJS$stop()

The https://github.com/ariya/phantomjs/wiki/API-Reference-WebPage#webpage-settings In the above example it can be seen that the default useragent identifies us as PhantomJS. Some web content maybe inaccessible or blocked for PhantomJS users. Here we demonstrate changing our user agent so the website sees us as Firefox 29.0.

Other Possible Options

The general form of specifying PhantomJS internal page objects take the form phantomjs.page.settings.SETTING = VALUE where SETTING is the appropriate PhantomJS internal page object. As an example we inhibit the loading of inline images:

require(RSelenium)
pJS <- phantom()
Sys.sleep(5)
eCap <- list(phantomjs.page.settings.loadImages = FALSE)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$screenshot(display = TRUE)
remDr$close()
pJS$stop()

We can see that the images are not loaded: PhantomJS loadImages = FALSE

X Virtual Frame Buffer

For the discussion on xvfb and the related VPS, I refer you to this blog entry. How to setup a VPS with rstudio server and shiny server etc. is outlined.

Setup

The VPS i am connecting to has an ip of 128.199.255.233. I have rstudio server running on port 8787. On the remote server we observe

library(RSelenium)
RSelenium::startServer()
Sys.which('phantomjs')

                 phantomjs
"/usr/local/bin/phantomjs"

Sys.which('firefox')

firefox
     ""

Sys.which('chrome')

chrome
    ""

So we have started a selenium server running on (default) port 4444. Firefox and google chrome are not currently installed on this remote machine. Lets install firefox first. On the remote VPS we run

sudo apt-get install firefox

Now checking in the remote rstudio

Sys.which('firefox')

##            firefox 
## "/usr/bin/firefox"

If we try now to connect to the remote server and open firefox:

remDr <- remoteDriver(remoteServerAddr = "128.199.255.233")
remDr$open()

## [1] "Connecting to remote server"
## Error:    Summary: UnknownError
##    Detail: An unknown server-side error occurred while processing the command.
##    class: org.openqa.selenium.WebDriverException

We can see the problem if we try to run firefox in the remote shell:

PhantomJS loadImages = FALSE

Firefox is install but there is no display on our headless VPS. We can use xvfb to provide a virtual display for our browser to run in.

Xvfb :0 -screen 0 1024x768x24 2>&1 >/dev/null &
export DISPLAY=:0
nohup xvfb-run java -jar selenium-server-standalone.jar > selenium.log &

PhantomJS API Examples

The phantomExecute method of the remoteDriver class allows the user to interact with the PhantomJS API. Currently the method only works for direct calls to PhantomJS using the phantom utility function. Driving PhantomJS through the Selenium Server and calling the phantomExecute method currently doesn’t function and is an open issue (in the ghostDriver project). In the following sections we outline examples of using the PhantomJS API.

Interacting with the Console

The PhantomJS API implements a number of callbacks which can be defined. onLoadFinished is one such callback. This callback is invoked when the page finishes the loading. It may accept a single argument indicating the pages status: success if no network errors occurred, otherwise fail.

We give a simple example of writing to the console log when a page is loaded.

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
result <- remDr$phantomExecute("var page = this;
                                page.onLoadFinished = function(status) {
                                var url = page.url;
                                console.log(\"Status:  \" + status);
                                console.log(\"Loaded:  \" + url);
                               };")
remDr$navigate("http://www.google.com/ncr")

## Status:  success
## Loaded:  http://www.google.com/

remDr$navigate("http://www.bbc.co.uk")

## Status:  success
## Loaded:  http://www.bbc.co.uk/

remDr$navigate("http://www.bbc.com")

## Status:  success
## Loaded:  http://www.bbc.com/

pJS$stop()

It can be seen that the callback persists across page calls.

PhantomJS Writing to File

The next example demonstrates writing to file from PhantomJS. Once again the onLoadFinished callback is utilised. In this example the html source of the page that is navigated to is downloaded to output.htm relative to getwd(). An example is given of using phantom.exit() to close PhantomJS from the API.

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
result <- remDr$phantomExecute("var page = this;
                                var fs = require(\"fs\");
                                page.onLoadFinished = function(status) {
                                var file = fs.open(\"output.htm\", \"w\");
                                file.write(page.content);
                                file.close();
                                phantom.exit();
                               };")
remDr$navigate("http://www.google.com/ncr")
htmlParse("output.htm")['//title/text()'][[1]]

## Google

pJS$stop()

Injecting a Library into PhantomJS

Next we look at includeJs.

This includes an external script from the specified url (usually a remote location) on the page and executes the callback upon completion. The library we shall include is JQuery using the google CDN. Now any page we call with PhantomJS will have the JQuery library loaded after the page has finished loading.

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
remDr$navigate("http://www.google.com/ncr")
# check if the JQuery library is loaded
remDr$executeScript("return window.jQuery == undefined;")[[1]]
# TRUE is returned indicating JQuery is not present
result <- remDr$phantomExecute("var page = this;
                                page.onLoadFinished = function(status) {
                                 var url = page.url;
                                 var jURL = 'http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js';
                                 console.log(\"Status:  \" + status);
                                 console.log(\"Loaded:  \" + url);
                                 page.includeJs(jURL, function() {console.log(\"Loaded jQuery!\");})
                                };"
                               )
remDr$navigate("http://www.google.com/ncr")

## Status:  success
## Loaded:  http://www.google.com/
## Loaded jQuery!

remDr$executeScript("return window.jQuery == undefined;")[[1]]
# FALSE is returned indicating that JQuery is present
webElem <- remDr$executeScript("return $(\"[name='q']\").get(0);")[[1]]
webElem$sendKeysToElement(list("PhantomJS was here"))
remDr$screenshot(display = TRUE)
pJS$stop()

PhantomJS with Jquery injected

Starting a PhantomJS Web Server

PhantomJS has the ability to act as a Web Server. Here we demonstrate setting PhantomJS up as a web server on the localhost on port 8080. When a user browses to http://localhost:8080 they are returned a list of the current blog titles on http://www.r-bloggers.com. The Jquery library is also injected to aid extraction of the blog titles.

pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
"
var server = require('webserver').create();
server.listen(8080, function (request, response) {
  var page = new WebPage();
  page.open('http://www.r-bloggers.com/', function (status) {
    var jURL = 'http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js';
    page.includeJs(jURL, function() {
      console.log(\"Loaded jQuery!\");
      var blogs = page.evaluate(function () {
        res = $('#mainwrapper .post a[rel=\"bookmark\"]');
        return res.map(function(){return this.innerHTML}).toArray().join('\\n');
      });
      response.statusCode = 200;
      response.write('Current blogs on r-bloggers:\\n');
      response.write(blogs);
      response.write('\\n');
      response.close();
      page.close();
    });
  });
});" -> wsScript

remDr$phantomExecute(wsScript)

head(readLines("http://localhost:8080/"))

## Loaded jQuery!
## [1] "Current blogs on r-bloggers:"                        "Specifying complicated groups of time series in hts"
## [3] "Creating Inset Map with ggplot2"                     "R and Vertica"                                      
## [5] "RGolf: NGSL Scrabble"                                "European talks. June-July 2014"

PhantomJS as a Web Server