Project
Phase 2: Crawling
the Gnuttela network from a single node
Late
policy: 33% point
deduction for each late day.
Story: You will implement a single-node
Gnutella network crawler (a program that automatically gathers
information on the nodes participating in the Gnutella network). Starting from a bootstrap node
your crawler should run for a specified period of time and collect information
about the nodes participating in the network and the files shared.
Your system should have two working modes
selectable by command line at startup: the “minimal”
and the “full” crawling modes.
In “minimal”
crawling mode you collect: for each Gnutella node: the IP, the ‘agent’, and its
status. Node status can be on of: (i) crawler discovered the node should exist but has
not tried (yet) to connect to it; (ii) crawler tried to initiate a TCP
connection but failed due to a timeout; (iii) crawler tried to initiate a TCP
connection but failed due to an exception different from timeout (multiple
exceptions are possible– e.g., unroutable IP address, connection rejected by
destination - try to provide some subclasses here); (iv) crawler succeeded
initiating a TCP connection but there was an error at the Gnutella processing
level (again, multiple exceptions are possible, try to provide subclasses) ;
(v) node successfully crawled.
At the end of the crawl the master should report
(a) the list of nodes discovered (IP, agent, status), and (b) aggregate
statistics: the number of nodes discovered (overall and in each of the above four
classes and subclasses); the discovery rate (overall and in each class) in
nodes/second;
[optional] In “full” crawling mode you collect everything you have collected for
minimal plus: the number of files shared by the nodes, file’s size, and the
files’ extensions.
At the end of the crawl the master should report, in
addition to the reporting for “minimal” mode aggregate statistics:
average/maximum number of files shared per node, average/minimum/maximum files
size, and what are the 10 most frequent file extensions. You are welcome to
collect more information, for example: the domain name of the nodes, file
popularity distributions, or (harder!) geographical node location, and present
elaborated statistics.
Notes:
·
We
are running a Gnutella node on reala.ece.ubc.ca:5627 as a bootstrap node to provide
you a stable entry point to the Gnutella network.
·
The
command line for your crawler should be: java crawler
<-full|-minimal> <timeout=XX> <host:port>
<time> where
‘host:port’ is the bootstrap point and ‘time’ is the
maximum time your crawler should run. Timeout parameter specifies the timeout
for crawling one node (default 30s).
·
To
simplify the implementation we provide sample code that connects to a node and
extracts some of the required information. Our main goal coding this piece was
to make it easy to understand – so feel free to improve as you see fit. The code is available through this web page. You can find a brief description of the
Gnutella protocol here.
·
There
is an OS dependent upper limit for the number of sockets (and implicitly TCP
connections) active at any time. If at runtime you exceed this limit your code
will generate exceptions – you’ll probably need to wait until your socket
related resources are freed by the system so that you can reuse them.
·
Gnutella
overlay topology is complex and includes cycles: make sure you do not crawl the same node multiple times.
·
You can implement calculation of aggregate statistics as a post-processing
step after crawling is done.
Grading
criteria
·
You
will be graded on code correctness (e.g., report correct results, ability to
avoid cycles) and quality (design).
Performance is not an issue at this stage.
An
option
·
You
are already familiar with blocking IO from previous assignments. While blocking
IO makes coding simpler it limits scalability as programs have to allocate one
thread for each open connection. On the other side, programming with
non-blocking IO is complex but enables efficient resource usage. This is especially
important for the final version of your crawler, as a large number of open
sockets increase the number of Gnutella servents (nodes) it can explore in
parallel.
For the next phase of the project you will need to choose
between using blocking and non-blocking IO. You can choose to implement using
non-blocking I/O from this phase but I do not recommend it: deal with complex
issues one at a time – first make a simple crawler work then optimize it for
performance. You can find introductory references to Java non-blocking IO
programming below.
Submissions:
Send to 202020m@gmail.com:
·
A
brief readme file listing your name, student ID, the project phase number.
·
A
list of main design choices and reasons to make them
·
Statistics
gathered after running your crawler for 15 minutes.
·
A
ZIP or TAR file containing your source files.
·
A
JAR file containing your compiled code.
·
To
test correctness of your crawlers, the day of the submission I will send you a
bootstrap point for my private (and small) Gnutella network you'll have to
crawl.
References:
1- Gnutella
protocol information: http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html
2-
Non-blocking
IO tutorials:
http://www.onjava.com/pub/a/onjava/2002/09/04/nio.html
http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/index.html