Is It Possible To Plug A JavaScript Engine With Ruby And Nokogiri?
Solution 1:
You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.
You can even use Watir with a so-called 'headless' browser on a linux machine.
Watir headless example
Suppose we have this HTML:
<p id="hello">Hello from HTML</p>
and this Javascript:
document.getElementById('hello').innerHTML = 'Hello from JavaScript';
(Demo: http://jsbin.com/ivihur)
and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb
and firefox
installed, for example on Ubuntu do:
$ apt-get install xvfb firefox
You will also need the watir-webdriver
and headless
gems so go ahead and install them as well:
$ gem install watir-webdriver headless
Then you can read the dynamic content from the page with something like this:
require 'rubygems'
require 'watir-webdriver'
require 'headless'
headless = Headless.new
headless.start
browser = Watir::Browser.new
browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text
browser.close
headless.destroy
If everything went right, this will output:
Hello from JavaScript
I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto
and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).
Solution 2:
Capybara + PhantomJS
My favorite Ruby-controlled headless browser is PhantomJS. PhantomJS is a headless WebKit-based browser. It includes Poltergeist which is a driver for Capybara.
In summary, the stack looks like this:
Capybara -> Poltergeist -> PhantomJS -> WebKit
Note that you can use PhantomJS directly with selenium-webdriver, but the Capybara API is nicer (IMHO).
Being a minimal WebKit implementation, PhantomJS has a faster startup time than a full browser like Chrome or IE.
Sample code to scrape google result links:
module Test
class Google
include Capybara::DSL
def get_results
visit('/')
fill_in "q", :with => "Capybara"
click_button "Google Search"
all(:xpath, "//li[@class='g']/h3/a").each { |a| puts a[:href] }
end
end
end
scraper = Test::Google.new
scraper.get_results
In addition to the standard Capybara features, Poltergeist can do some very convenient things:
- Inject and run your own javascript with
page.evaluate_script
andpage.execute_script
page.within_frame
andpage.within_window
page.status_code
andpage.response_headers
page.save_screenshot
<- This is really nice when things go wrong!page.driver.render_base64(format, options)
page.driver.scroll_to(left, top)
page.driver.basic_authorize(user, password)
element.native.send_keys(*keys)
- cookie handling
- drag-and-drop
These features are listed on the Poltergeist GitHub page: https://github.com/teampoltergeist/poltergeist.
Celerity
If you really want to eke out as much performance as possible, and don't mind switching to JRuby to do so, I have found Celerity to be super fast.
Celerity is a wrapper around Java's HTMLUnit. It is speedy because HTMLUnit is not a full browser, it is more of an emulator that executes JavaScript. The downside is that it doesn't support all the JavaScript that a full browser does, so it won't support very JS-heavy sites, but it is sufficient for most sites and getting better all the time.
Another advantage is the multithreaded nature of JRuby. With the Peach (parallel each) gem, you can fire off many browsers in parallel. I have done this with a test suite in the past and drastically reduced the time to finish. In fact, we made a load tester using Celerity + Peach that was much more sophisticated than your typical JMeter, Grinder, apachebench, etc. It could really exercise our site!
Post a Comment for "Is It Possible To Plug A JavaScript Engine With Ruby And Nokogiri?"