Project Phase 2: Crawling the Gnuttela network from a single node

 

Late policy: 33% point deduction for each late day.

 

Story: You will implement a single-node Gnutella network crawler (a program that automatically gathers information on the nodes participating in the Gnutella network). Starting from a bootstrap node your crawler should run for a specified period of time and collect information about the nodes participating in the network and the files shared.

 

Your system should have two working modes selectable by command line at startup: the “minimal” and the “full” crawling modes. 

 

In “minimal” crawling mode you collect: for each Gnutella node: the IP, the ‘agent’, and its status. Node status can be on of: (i) crawler discovered the node should exist but has not tried (yet) to connect to it; (ii) crawler tried to initiate a TCP connection but failed due to a timeout; (iii) crawler tried to initiate a TCP connection but failed due to an exception different from timeout (multiple exceptions are possible– e.g., unroutable IP address, connection rejected by destination - try to provide some subclasses here); (iv) crawler succeeded initiating a TCP connection but there was an error at the Gnutella processing level (again, multiple exceptions are possible, try to provide subclasses) ; (v) node successfully crawled.

 

At the end of the crawl the master should report (a) the list of nodes discovered (IP, agent, status), and (b) aggregate statistics: the number of nodes discovered (overall and in each of the above four classes and subclasses); the discovery rate (overall and in each class) in nodes/second;

 

 

[optional] In “full” crawling mode you collect everything you have collected for minimal plus: the number of files shared by the nodes, file’s size, and the files’ extensions.

 

At the end of the crawl the master should report, in addition to the reporting for “minimal” mode aggregate statistics: average/maximum number of files shared per node, average/minimum/maximum files size, and what are the 10 most frequent file extensions. You are welcome to collect more information, for example: the domain name of the nodes, file popularity distributions, or (harder!) geographical node location, and present elaborated statistics.

 

 

Notes:

·         We are running a Gnutella node on reala.ece.ubc.ca:5627 as a bootstrap node to provide you a stable entry point to the Gnutella network.

·         The command line for your crawler should be: java crawler <-full|-minimal> <timeout=XX> <host:port> <time> where ‘host:port’ is the bootstrap point and ‘time’ is the maximum time your crawler should run. Timeout parameter specifies the timeout for crawling one node (default 30s).

·         To simplify the implementation we provide sample code that connects to a node and extracts some of the required information. Our main goal coding this piece was to make it easy to understand – so feel free to improve as you see fit.  The code is available through this web page.  You can find a brief description of the Gnutella protocol here.

·         There is an OS dependent upper limit for the number of sockets (and implicitly TCP connections) active at any time. If at runtime you exceed this limit your code will generate exceptions – you’ll probably need to wait until your socket related resources are freed by the system so that you can reuse them.

·        Gnutella overlay topology is complex and includes cycles: make sure you do not crawl the same node multiple times.

·         You can implement calculation of aggregate statistics as a post-processing step after crawling is done.

 

 

Grading criteria

·         You will be graded on code correctness (e.g., report correct results, ability to avoid cycles) and quality (design).  Performance is not an issue at this stage.

 

 

An option

 

·         You are already familiar with blocking IO from previous assignments. While blocking IO makes coding simpler it limits scalability as programs have to allocate one thread for each open connection. On the other side, programming with non-blocking IO is complex but enables efficient resource usage. This is especially important for the final version of your crawler, as a large number of open sockets increase the number of Gnutella servents (nodes) it can explore in parallel. 

 

For the next phase of the project you will need to choose between using blocking and non-blocking IO. You can choose to implement using non-blocking I/O from this phase but I do not recommend it: deal with complex issues one at a time – first make a simple crawler work then optimize it for performance. You can find introductory references to Java non-blocking IO programming below.

 

 

Submissions:

Send to 202020m@gmail.com:

·         A brief readme file listing your name, student ID, the project phase number.

·         A list of main design choices and reasons to make them

·         Statistics gathered after running your crawler for 15 minutes. 

·         A ZIP or TAR file containing your source files.

·         A JAR file containing your compiled code.

·         To test correctness of your crawlers, the day of the submission I will send you a bootstrap point for my private (and small) Gnutella network you'll have to crawl.

 

 

References:

1-     Gnutella protocol information: http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html

2-     Non-blocking IO tutorials:

http://www.onjava.com/pub/a/onjava/2002/09/04/nio.html

http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/index.html