intelligent internet agents

August 14, 2009, 5:12 pm
Filed under: crawl, engine, software | Tags: , , , ,

htmlunit repository – svn co .
open project in netbeans as marven2 project
run test (from netbeans or command line mvn test)
run build (from netbeans or from command line mvn build)

This creates the library file needed for celerity which should then be placed in the htmlunit folder of the celerity folder. After this build, test and install the gem.


August 14, 2009, 5:06 pm
Filed under: crawl, engine, software | Tags: , , ,

celerity – project home page

celerity – source page

celerity installation

git clone git://

jruby -S gem install hoe – test dependencies
jruby -S gem install sinatra – test dependencies
jruby -S gem build – build the gem
jruby -S rake watispec:init – to get the watir spec tests
jruby -S rake spec (or jruby -S spec spec/**/*_spec.rb) – to run test
jruby -S gem install – to install the gem

August 11, 2009, 10:17 am
Filed under: crawl, engine, software, viewers | Tags: ,

browser = => "localhost:6429",
                                :log_level => :off)

Viewers available at:

Page Caching
August 11, 2009, 7:33 am
Filed under: crawl, engine, software | Tags: , ,

Note: This does not work. Only the pointer to the object gets marshaled, not the object itself.

require "rubygems"
require "celerity"

browser = :browser => :firefox, :log_level => :off


browser.resynchronized do, "News").click
end"#{Dir.pwd}/monster.dmp", 'w') do |f|
end"#{Dir.pwd}/monster.dmp", 'r') do |f| = Marshal.load(f)

puts browser.url

see also

Breadth first or Depth First.
August 10, 2009, 8:53 pm
Filed under: crawl, strategy | Tags: , ,

Unless you can cache pages, a breadth first is more expensive that depth first. In a dynamic page environment it is more expensive to do a breadth first search (which you would normally do in a pure html environment). In the dynamic environment you must always be in the context of the page that contains the link.

A breadth first will require a permanent rollback to the current position in the FIFO list of the breadth first search tree.

Root -> Link a - current page root
Root -> Link b - current page root
Root -> link c - current page root
Click Link a
Link a -> link d - current page a
Link a -> link e - current page a
Link a -> link f - current page a
Goto Root
Click link b

The deeper the tree get the more rollback needs to be done – from root to wherever the current tree position is.

The form percept
August 10, 2009, 2:33 pm
Filed under: forms, percepts | Tags: , ,

In general the form tag has an method “GET” or “POST” that can be associated with and an action. The action is defined by a target URL (relative or absolute) behind which, some piece of software on the server side will pick up and process the data that the user provided through the input elements from withing the form tags on the HTML page.


This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:

  • If the method is “get” and the action is an HTTP URI, the user agent takes the value of action, appends a `?’ to it, then appends the form data set, encoded using the “application/x-www-form-urlencoded” content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
  • If the method is “post” and the action is an HTTP URI, the user agent conducts an HTTP “post” transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.

For any other value of action or method, behavior is unspecified.

Inputs outside form tags.

You might think that this is invalid:


And of course it is.

What does it mean? Why are input elements outside a form tag valid.
Actually in the case above the form tags are irrelevant because they have no elements within them.

So a more appropriate notation would be:


Inputs can be children of and block element. Why?, well you can of course let users input data, manipulate it and display the results using client side javascript without ever having to send it to the server. This is the idea behind inputs outside the form tags.

Think of a simple calculator on a Html page – this need no server side processing at all, put it does need input from the user.

What does this mean?

1. If input elements are outside a form then we can take for granted that they javascript will be processing them. BUT we do not know if they will be processed entirely by the client, partly by the client and partly by the server, or entirely by the server.

2. If input elements are within a form then we know that the inputs will be sent to the server IF there is a submit button defined. If there is no submit button defined then we can take it for granted that some client side javascript will be listening to some element like a link and will send some, or all – as is – or changed – parameters to the server, either using standard or custom javascript methods.

3. If input elements are outside a form then we can not identify which inputs belong together. Again, different pieces of javascript code could be responsible for sending different sets of input values to the server.

The answer is clustering input using the DOM tree:

This will have the best indication of what inputs belong together from a user perspective. This also has the advantage that a “unique set” can be identified – this means that we can identify the the form as being unique on any leaf of the website tree.