Filed under: crawl, strategy | Tags: breadth first, crawl strategy, depth first
Unless you can cache pages, a breadth first is more expensive that depth first. In a dynamic page environment it is more expensive to do a breadth first search (which you would normally do in a pure html environment). In the dynamic environment you must always be in the context of the page that contains the link.
A breadth first will require a permanent rollback to the current position in the FIFO list of the breadth first search tree.
Root -> Link a - current page root Root -> Link b - current page root Root -> link c - current page root Click Link a Link a -> link d - current page a Link a -> link e - current page a Link a -> link f - current page a Goto Root Click link b
The deeper the tree get the more rollback needs to be done – from root to wherever the current tree position is.
Leave a Comment so far
Leave a comment