Project Phase 3: A master/worker crawler for the Gnutella network.
Late
policy: no late
submissions accepted.
Story: You will extend you Gnutella
network crawler to harness multiple crawler nodes and employ a
master/worker architecture.
Starting
from the same bootstrapping point as in the previous project phase (reala.ece.ubc.ca:5627)
your crawler should attempt to crawl the Gnutella network as fast as possible
for a specified period of time and collect information about the nodes
participating in the network and the files shared.
Similar
to previous phase, your system should have two working modes selectable by
command line at startup: the “minimal” and the “full” crawling modes.
In “minimal” crawling mode you collect: for each Gnutella node: the
IP, the ‘agent’, whether the node is an untrapeer or
not, and its status. Please see the description of the previous project phase
for a classification of node statuses.
At the end of the crawl the master should report
(a) the list of nodes discovered (IP, agent, status, ultrapeer),
and (b) aggregate statistics: the number of nodes discovered (overall and,
by status, in each class/subclass); the discovery rate (overall and in each
class) in nodes/second.
To make results comparable your search should
prioritize discovering and crawling supperpeers.
A node is considered ‘crawled’ after you have tried
to connect it
[optional] In “full” crawling mode you collect everything you
have collected for the “minimal” mode plus: the number of files shared by the
nodes, file’s size, and the files’ extensions.
At the end of the crawl the master should report,
in addition to the reporting for “minimal” mode aggregate statistics:
average/maximum number of files shared per node, average/minimum/maximum files
size, and what are the 10 most frequent file extensions. You are welcome to
collect more information, for example: the domain name of the nodes, file popularity
distributions, or (harder!) geographical node location, and present elaborated
statistics.
Detailed
requirements and guidelines
For
full points:
·
Your
crawler nodes should use non-blocking IO for the worker node. For performance/scalability
you may also decide to use non-blocking for the master as well.
·
The
system should be able to deal with master node failures. (One way to deal with
this is a primary/backup replication scheme).
The
master node:
·
The
master should be designed in a way to manage an unspecified number of workers
and assign work to each of them.
·
Also
the master should accept new workers that may start late and gracegully deal with failures
·
If
the user decides to cancel the crawl, the master should have a way to stop the
crawling of all workers if not immediately then at least within a reasonable
time interval.
·
The
master should avoid assigning duplicate work to workers.
The
worker nodes:
·
The
worker nodes can be based on the node you implemented in phase 2 of the project
plus some additional code to communicate with the master. Note, however, that for full points workers
have to use non-blocking I/O
·
Workers
should avoid performing duplicate work.
Scaling
requirements: You
should aim to have your crawler working on at least 100 PlanetLab
nodes (for this you might need to deploy it on a few more nodes since some PlanetLab nodes are unstable). Also you should design this to scale to the
size of your target: the fnutella network has an
estimate of 1-2million nodes.
Keep in mind that you share the PlanetLab node with many other people, so these nodes tend
to be slow and have limited resources. Consequently you should minimize the
state kept by your workers and limit the number of open connections you use.
To reduce
the stress we generate on the PlanetLab platform,
and, more importantly, to make results comparable across projects your
implementation and deployment should follow the following constraints:
§ Deployment: all components of your system
(master and workers) should run on PlanetLab
§
Memory usage: The workers should limit their
memory usage to at most 64MB, and the master (or masters) should not use more
than 300MB of memory each. This limit can be enforced by setting the JVM
maximum heap size at startup by using:
java –Xmx32m
<your-crawler-command-line>
64MB of memory should be more than
enough for the worker nodes to maintain the necessary state. For the master (or
masters) 300MB should be enough to represent representing the minimal data that
needs to be kept ‘hot’, in-memory for a few million nodes. You can always dump
the rest of the data on disk and have it ready for post-processing.
§
Number of concurrent open sockets: the workers should never have
more than 200 open connections at any point in time.
§
TCP Timeouts: configure the socket created by
the crawler to have a default timeout of 30 seconds. The timeout for TCP
connections should be a command line argument.
Marking
·
You
will be marked on:
o
Report
describing your design choices
o
In-class
presentation of your design and performance results
o
Code
performance, code correctness (e.g., ability to avoid cycles, to stop the crawling,
to integrate workers that join or fail dynamically during the computation),
ability to deal with server failures, and code quality (code design).
·
The
performance criteria:
o
Scalability
and performance of the whole solution: number of distinct Gnutella nodes (superpeers and regular nodes) you discover per hour
(reported from a crawl that is at least 30min long), number of workers you
actually handle, ability to crawl a large part of the network, etc.
1. The primary performance metric we ask everyone to report (and you
should primarily optimize for): the number of Gnutella supperpeers
you discover and you actually crawl.
o
Performance
of an individual worker node: average number of IPs
discovered per worker node per hour
·
If
you choose to build your worker node using blocking IO, your code will be
marked out of 65%, while if you use non-blocking IO for your workers you will
be graded out of 100%. The master can
use either blocking or non-blocking IO.
·
If
your system does not survive master failures then you lose 10% of the mark for
this phase.
Submissions:
Send to 202020m@gmail.com:
·
A
brief readme file containing
o
Group
number, your names, student IDs
o
A
report outlining your main design choices and reasons to make them
·
Statistics
gathered after running your crawler for 30 minutes in “minimal” crawling mode
and in “full” crawling mode (if implemented).
·
A
ZIP or TAR file containing your source files.
·
JAR
files containing your compiled code.
References:
1- A
similar project:
Characterizing
Files in the Modern Gnutella Network
Daniel Stutzbach, Shanyu
Zhao, Reza Rejaie
More papers from the same group: http://www.cs.uoregon.edu/~reza/PUB/
2- Gnutella
protocol information: http://gnutella-specs.rakjar.de/index.php/Main_Page
3-
Non-blocking
IO tutorials:
http://www.onjava.com/pub/a/onjava/2002/09/04/nio.html
http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/index.html