Skip to content Skip to sidebar Skip to footer

Is It Possible To Plug A JavaScript Engine With Ruby And Nokogiri?

I'm writing an application to crawl some websites and scrape data from them. I'm using Ruby, Curl and Nokogiri to do this. In most cases it's straightforward and I only need to pin

Solution 1:

You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.

You can even use Watir with a so-called 'headless' browser on a linux machine.

Watir headless example

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(Demo: http://jsbin.com/ivihur)

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

If everything went right, this will output:

Hello from JavaScript

I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

Source: http://watirwebdriver.com/headless/


Solution 2:

Capybara + PhantomJS

My favorite Ruby-controlled headless browser is PhantomJS. PhantomJS is a headless WebKit-based browser. It includes Poltergeist which is a driver for Capybara.

In summary, the stack looks like this:

Capybara -> Poltergeist -> PhantomJS -> WebKit

Note that you can use PhantomJS directly with selenium-webdriver, but the Capybara API is nicer (IMHO).

Being a minimal WebKit implementation, PhantomJS has a faster startup time than a full browser like Chrome or IE.

Sample code to scrape google result links:

module Test
  class Google
    include Capybara::DSL

    def get_results
      visit('/')
      fill_in "q", :with => "Capybara"
      click_button "Google Search"
      all(:xpath, "//li[@class='g']/h3/a").each { |a| puts a[:href] }

    end
  end
end

scraper = Test::Google.new
scraper.get_results

In addition to the standard Capybara features, Poltergeist can do some very convenient things:

  • Inject and run your own javascript with page.evaluate_script and page.execute_script
  • page.within_frame and page.within_window
  • page.status_code and page.response_headers
  • page.save_screenshot <- This is really nice when things go wrong!
  • page.driver.render_base64(format, options)
  • page.driver.scroll_to(left, top)
  • page.driver.basic_authorize(user, password)
  • element.native.send_keys(*keys)
  • cookie handling
  • drag-and-drop

These features are listed on the Poltergeist GitHub page: https://github.com/teampoltergeist/poltergeist.

Celerity

If you really want to eke out as much performance as possible, and don't mind switching to JRuby to do so, I have found Celerity to be super fast.

Celerity is a wrapper around Java's HTMLUnit. It is speedy because HTMLUnit is not a full browser, it is more of an emulator that executes JavaScript. The downside is that it doesn't support all the JavaScript that a full browser does, so it won't support very JS-heavy sites, but it is sufficient for most sites and getting better all the time.

Another advantage is the multithreaded nature of JRuby. With the Peach (parallel each) gem, you can fire off many browsers in parallel. I have done this with a test suite in the past and drastically reduced the time to finish. In fact, we made a load tester using Celerity + Peach that was much more sophisticated than your typical JMeter, Grinder, apachebench, etc. It could really exercise our site!


Post a Comment for "Is It Possible To Plug A JavaScript Engine With Ruby And Nokogiri?"