Project Phase 3: A master/worker crawler for the Gnutella network.

 

Late policy: no late submissions accepted.

 

Story: You will extend you Gnutella network crawler to harness multiple crawler nodes and employ a master/worker architecture.

 

Starting from the same bootstrapping point as in the previous project phase (reala.ece.ubc.ca:5627) your crawler should attempt to crawl the Gnutella network as fast as possible for a specified period of time and collect information about the nodes participating in the network and the files shared.

 

Similar to previous phase, your system should have two working modes selectable by command line at startup: the “minimal” and the “full” crawling modes. 

 

In minimal” crawling mode you collect: for each Gnutella node: the IP, the ‘agent’, whether the node is an untrapeer or not, and its status. Please see the description of the previous project phase for a classification of node statuses.

 

At the end of the crawl the master should report (a) the list of nodes discovered (IP, agent, status, ultrapeer), and (b) aggregate statistics: the number of nodes discovered (overall and, by status, in each class/subclass); the discovery rate (overall and in each class) in nodes/second.

 

To make results comparable your search should prioritize discovering and crawling supperpeers.

 

A node is considered ‘crawled’ after you have tried to connect it

 

[optional] In “full” crawling mode you collect everything you have collected for the “minimal” mode plus: the number of files shared by the nodes, file’s size, and the files’ extensions.

 

At the end of the crawl the master should report, in addition to the reporting for “minimal” mode aggregate statistics: average/maximum number of files shared per node, average/minimum/maximum files size, and what are the 10 most frequent file extensions. You are welcome to collect more information, for example: the domain name of the nodes, file popularity distributions, or (harder!) geographical node location, and present elaborated statistics.

 

 

Detailed requirements and guidelines

 

For full points:

·         Your crawler nodes should use non-blocking IO for the worker node. For performance/scalability you may also decide to use non-blocking for the master as well.

·         The system should be able to deal with master node failures. (One way to deal with this is a primary/backup replication scheme).

 

The master node:

·         The master should be designed in a way to manage an unspecified number of workers and assign work to each of them.

·         Also the master should accept new workers that may start late and gracegully deal with failures

·         If the user decides to cancel the crawl, the master should have a way to stop the crawling of all workers if not immediately then at least within a reasonable time interval.

·         The master should avoid assigning duplicate work to workers.

 

The worker nodes:

·         The worker nodes can be based on the node you implemented in phase 2 of the project plus some additional code to communicate with the master.  Note, however, that for full points workers have to use non-blocking I/O

·         Workers should avoid performing duplicate work.

 

Scaling requirements: You should aim to have your crawler working on at least 100 PlanetLab nodes (for this you might need to deploy it on a few more nodes since some PlanetLab nodes are unstable).  Also you should design this to scale to the size of your target: the fnutella network has an estimate of 1-2million nodes.

 

Keep in mind that you share the PlanetLab node with many other people, so these nodes tend to be slow and have limited resources. Consequently you should minimize the state kept by your workers and limit the number of open connections you use.

 

To reduce the stress we generate on the PlanetLab platform, and, more importantly, to make results comparable across projects your implementation and deployment should follow the following constraints:

§     Deployment: all components of your system (master and workers) should run on PlanetLab

§     Memory usage: The workers should limit their memory usage to at most 64MB, and the master (or masters) should not use more than 300MB of memory each. This limit can be enforced by setting the JVM maximum heap size at startup by using:

 

java –Xmx32m <your-crawler-command-line>

 

64MB of memory should be more than enough for the worker nodes to maintain the necessary state. For the master (or masters) 300MB should be enough to represent representing the minimal data that needs to be kept ‘hot’, in-memory for a few million nodes. You can always dump the rest of the data on disk and have it ready for post-processing.

§     Number of concurrent open sockets: the workers should never have more than 200 open connections at any point in time.

§     TCP Timeouts: configure the socket created by the crawler to have a default timeout of 30 seconds. The timeout for TCP connections should be a command line argument.

 

 

Marking

·         You will be marked on:

o        Report describing your design choices

o        In-class presentation of your design and performance results

o        Code performance, code correctness (e.g., ability to avoid cycles, to stop the crawling, to integrate workers that join or fail dynamically during the computation), ability to deal with server failures, and code quality (code design).

·         The performance criteria:

o        Scalability and performance of the whole solution: number of distinct Gnutella nodes (superpeers and regular nodes) you discover per hour (reported from a crawl that is at least 30min long), number of workers you actually handle, ability to crawl a large part of the network, etc.

1.     The primary performance metric we ask everyone to report (and you should primarily optimize for): the number of Gnutella supperpeers you discover and you actually crawl.

o        Performance of an individual worker node: average number of IPs discovered per worker node per hour

·         If you choose to build your worker node using blocking IO, your code will be marked out of 65%, while if you use non-blocking IO for your workers you will be graded out of 100%.  The master can use either blocking or non-blocking IO.

·         If your system does not survive master failures then you lose 10% of the mark for this phase.

 

Submissions:

Send to 202020m@gmail.com:

·         A brief readme file containing

o        Group number, your names, student IDs

o        A report outlining your main design choices and reasons to make them

·         Statistics gathered after running your crawler for 30 minutes in “minimal” crawling mode and in “full” crawling mode (if implemented).

 

·         A ZIP or TAR file containing your source files.

·         JAR files containing your compiled code.

 

 

References:

1-    A similar project:

 

Characterizing Files in the Modern Gnutella Network
Daniel Stutzbach, Shanyu Zhao, Reza Rejaie

     More papers from the same group: http://www.cs.uoregon.edu/~reza/PUB/

 

2-     Gnutella protocol information: http://gnutella-specs.rakjar.de/index.php/Main_Page

3-    Non-blocking IO tutorials:

http://www.onjava.com/pub/a/onjava/2002/09/04/nio.html

http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/index.html

http://javanio.info/