PhantomJS
is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. RSelenium
can drive PhantomJS
using two methods: directly or via the standalone Selenium Server.
The PhantomJS
binary can be driven directly with RSelenium
. PhantomJS
needs to be started in webdriver mode then RSelenium
can communicate with it directly without the need for Selenium Server. The command line options for PhantomJS
are outlined at http://phantomjs.org/api/command-line.html. We note that it is necessary to start PhantomJS
with the --webdriver
option and an optional IP/port. RSelenium
as of v1.3.2
has a utility function phantom
that will handle starting the PhantomJS
binary in webdriver mode by default on port 4444. So to drive PhantomJS
sans Selenium Server can be done as follows:
require(RSelenium)
pJS <- phantom()
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$getTitle()[[1]] # [1] "Google"
remDr$close
pJS$stop() # close the PhantomJS process, note we dont call remDr$closeServer()
For completeness we outline the process of opening a PhantomJS
browser using selenium server. It is assumed that the PhantomJS
binary is in the users path.
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$close()
remDr$closeServer()
It may not be possible for a user to have the PhantomJS
binary in their path. In this case a user may pass the path of the PhantomJS
binary to Selenium Server:
require(RSelenium)
RSelenium::startServer()
eCap <- list(phantomjs.binary.path = "C:/Users/john/Desktop/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
....
So in the above example I suppose the PhantomJS
binary has been moved to my Desktop which we assume is not in my path. An extra capability phantomjs.binary.path
detailed https://github.com/detro/ghostdriver can be used to provide the path to PhantomJS
to Selenium Server.
A user agent can be set using the phantomjs.page.settings.userAgent
capability.
pJS <- phantom()
Sys.sleep(5)
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate("http://www.whatsmyuseragent.com/")
remDr$findElement("id", "userAgent")$getElementText()[[1]]
## [1] "Your User Agent String is:\nMozilla/5.0 (Unknown; Linux x86_64)
## AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.7 Safari/534.34"
remDr$close()
eCap <- list(
phantomjs.page.settings.userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0"
)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.whatsmyuseragent.com/")
remDr$findElement("id", "userAgent")$getElementText()[[1]]
## [1] "Your User Agent String is:\nMozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0)
## Gecko/20120101 Firefox/29.0"
The https://github.com/ariya/phantomjs/wiki/API-Reference-WebPage#webpage-settings In the above example it can be seen that the default useragent identifies us as PhantomJS
. Some web content maybe inaccessible or blocked for PhantomJS
users. Here we demonstrate changing our user agent so the website sees us as Firefox 29.0
.
The general form of specifying PhantomJS internal page objects take the form phantomjs.page.settings.SETTING = VALUE
where SETTING
is the appropriate PhantomJS internal page object. As an example we inhibit the loading of inline images:
require(RSelenium)
pJS <- phantom()
Sys.sleep(5)
eCap <- list(phantomjs.page.settings.loadImages = FALSE)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.google.com/ncr")
remDr$screenshot(display = TRUE)
remDr$close()
pJS$stop()
We can see that the images are not loaded:
For the discussion on xvfb
and the related VPS, I refer you to this blog entry. How to setup a VPS with rstudio server and shiny server etc. is outlined.
The VPS i am connecting to has an ip of 128.199.255.233
. I have rstudio server running on port 8787. On the remote server we observe
phantomjs
"/usr/local/bin/phantomjs"
firefox
""
chrome
""
So we have started a selenium server running on (default) port 4444. Firefox and google chrome are not currently installed on this remote machine. Lets install firefox first. On the remote VPS we run
Now checking in the remote rstudio
## firefox
## "/usr/bin/firefox"
If we try now to connect to the remote server and open firefox:
## [1] "Connecting to remote server"
## Error: Summary: UnknownError
## Detail: An unknown server-side error occurred while processing the command.
## class: org.openqa.selenium.WebDriverException
We can see the problem if we try to run firefox in the remote shell:
Firefox is install but there is no display on our headless VPS. We can use xvfb to provide a virtual display for our browser to run in.
Xvfb :0 -screen 0 1024x768x24 2>&1 >/dev/null &
export DISPLAY=:0
nohup xvfb-run java -jar selenium-server-standalone.jar > selenium.log &
The phantomExecute
method of the remoteDriver
class allows the user to interact with the PhantomJS
API. Currently the method only works for direct calls to PhantomJS
using the phantom
utility function. Driving PhantomJS
through the Selenium
Server and calling the phantomExecute
method currently doesn’t function and is an open issue (in the ghostDriver project). In the following sections we outline examples of using the PhantomJS
API.
The PhantomJS
API implements a number of callbacks which can be defined. onLoadFinished is one such callback. This callback is invoked when the page finishes the loading. It may accept a single argument indicating the pages status: success
if no network errors occurred, otherwise fail
.
We give a simple example of writing to the console log when a page is loaded.
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
result <- remDr$phantomExecute("var page = this;
page.onLoadFinished = function(status) {
var url = page.url;
console.log(\"Status: \" + status);
console.log(\"Loaded: \" + url);
};")
remDr$navigate("http://www.google.com/ncr")
## Status: success
## Loaded: http://www.google.com/
## Status: success
## Loaded: http://www.bbc.co.uk/
## Status: success
## Loaded: http://www.bbc.com/
It can be seen that the callback persists across page calls.
The next example demonstrates writing to file from PhantomJS
. Once again the onLoadFinished
callback is utilised. In this example the html source of the page that is navigated to is downloaded to output.htm
relative to getwd()
. An example is given of using phantom.exit()
to close PhantomJS
from the API.
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
result <- remDr$phantomExecute("var page = this;
var fs = require(\"fs\");
page.onLoadFinished = function(status) {
var file = fs.open(\"output.htm\", \"w\");
file.write(page.content);
file.close();
phantom.exit();
};")
remDr$navigate("http://www.google.com/ncr")
htmlParse("output.htm")['//title/text()'][[1]]
## Google
Next we look at includeJs.
This includes an external script from the specified url (usually a remote location) on the page and executes the callback upon completion. The library we shall include is JQuery
using the google CDN. Now any page we call with PhantomJS
will have the JQuery
library loaded after the page has finished loading.
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
remDr$navigate("http://www.google.com/ncr")
# check if the JQuery library is loaded
remDr$executeScript("return window.jQuery == undefined;")[[1]]
# TRUE is returned indicating JQuery is not present
result <- remDr$phantomExecute("var page = this;
page.onLoadFinished = function(status) {
var url = page.url;
var jURL = 'http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js';
console.log(\"Status: \" + status);
console.log(\"Loaded: \" + url);
page.includeJs(jURL, function() {console.log(\"Loaded jQuery!\");})
};"
)
remDr$navigate("http://www.google.com/ncr")
## Status: success
## Loaded: http://www.google.com/
## Loaded jQuery!
remDr$executeScript("return window.jQuery == undefined;")[[1]]
# FALSE is returned indicating that JQuery is present
webElem <- remDr$executeScript("return $(\"[name='q']\").get(0);")[[1]]
webElem$sendKeysToElement(list("PhantomJS was here"))
remDr$screenshot(display = TRUE)
pJS$stop()
PhantomJS
has the ability to act as a Web Server. Here we demonstrate setting PhantomJS
up as a web server on the localhost on port 8080
. When a user browses to http://localhost:8080
they are returned a list of the current blog titles on http://www.r-bloggers.com. The Jquery
library is also injected to aid extraction of the blog titles.
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
"
var server = require('webserver').create();
server.listen(8080, function (request, response) {
var page = new WebPage();
page.open('http://www.r-bloggers.com/', function (status) {
var jURL = 'http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js';
page.includeJs(jURL, function() {
console.log(\"Loaded jQuery!\");
var blogs = page.evaluate(function () {
res = $('#mainwrapper .post a[rel=\"bookmark\"]');
return res.map(function(){return this.innerHTML}).toArray().join('\\n');
});
response.statusCode = 200;
response.write('Current blogs on r-bloggers:\\n');
response.write(blogs);
response.write('\\n');
response.close();
page.close();
});
});
});" -> wsScript
remDr$phantomExecute(wsScript)
head(readLines("http://localhost:8080/"))
## Loaded jQuery!
## [1] "Current blogs on r-bloggers:" "Specifying complicated groups of time series in hts"
## [3] "Creating Inset Map with ggplot2" "R and Vertica"
## [5] "RGolf: NGSL Scrabble" "European talks. June-July 2014"