Crawling Gnutella Network
The Gnutella network (in its latest instantiation: protocol version 0.6) has a hierarchal structure, with two types of nodes: Ultrapeers and Leaves. The Ultrapeers form the core of the network and act as a proxy for Gnutella network for the leaves that are connected to them. The Leaves connect to one or more Ultrapeers and use them to search for files.
Guidelines:
In order to crawl a Gnutella node you should do the following:
To get information about node’s Ultrapeers and Leaves:
To get information about the shared files at the node:
You will need to study and follow the exact messages format for Gnutella 0.6 protocol, a summary of the needed messages is found in subsequent sections.
Crawling the Network for a
List of Nodes
To crawl the network you open a TCP connection to the node and send a <GNUTELLA CONNECT … Crawler> message. The node will replay back with a list of peers. The exact message exchange is presented below:
|
Crawler |
Servent |
GNUTELLA CONNECT/0.6<CR><LF>User-Agent: UBCECE (crawl) <CR><LF>X-Ultrapeer: False<CR><LF>Query-Routing: 0.1<CR><LF>Crawler: 0.1<CR><LF><CR><LF> |
|
|
GNUTELLA/0.6 200 OK<CR><LF>User-Agent: BearShare<CR><LF>Leaves: 127.0.0.1:6346,127.0.0.2:6346, 127.0.0.3:6346<CR><LF>Peers: 127.0.0.4:6346,127.0.0.5:6346, 127.0.0.6:6346 <CR><LF> |
|
GNUTELLA/0.6 200 OK<CR><LF> <CR><LF> |
|
|
Disconnect |
Disconnect |
The Gnutella node response
will contain a list of its leaves (if the node is an ultapeer node) and a list
of other Peers connected to the node. The Peers are the non-leaf
clients including Gnutella 0.6 Ultrapeers and old Gnutella 0.4 style clients.
In the crawler handshake message replace the UBECE string with a string to identify your own crawler.
Browsing Host Content
To retrieve the list of files shared at a specific Gnutella node, you open a TCP connection to the node and send a HTTP GET request. (This is the same as regular HTTP GET with a small change in the file types requested).
A typical message exchange is presented below, note that the content-type exchanged is <application/ x-gnutella-packets>.
|
Crawler |
Client |
GET / HTTP/1.1<CR><LF>Host: Crawler_IP:PORT<CR><LF>User-Agent: UBCECE <CR><LF>Accept: text/html , application/x-gnutella-packets<CR><LF>Connection: close<CR><LF><CR><LF> |
|
|
HTTP/1.1 200 OK<CR><LF>Server: LimeWire/x.y<CR><LF>Content-Type: application/x-gnutella-packets <CR><LF>Connection: close<CR><LF><CR><LF><Actual Gnutella Query Response> |
|
Disconnect |
|
The Gnutella node responds
with an HTTP message containing a list of the shared files at that node. The list
is formatted in multiple Gnutella Query response message (described in the next
section) for all implementations except BearShare. BearShare will send the list
as an HTML page.
For this assignment you are
requested to process the Query Hit responses only and not the HTML responses.
You’ll list the Bearshare nodes as sharing zero files.
Gnutella Standard
Message Format:
Once a servent (Gnutella Node)
has connected successfully to the network, it communicates with other servents
by sending and receiving Gnutella protocol messages. Each message is preceded
by a Message Header with the byte structure given below.
Message Header
The message
header is 23 bytes divided into the following fields.
|
Bytes |
Description |
|
0-15 |
Message ID/GUID (Globally Unique ID) |
|
16 |
Payload Type |
|
17 |
TTL (Time To Live) |
|
18 |
Hops |
|
19-22 |
Payload Length |
Message ID :A 16-byte string (GUID) uniquely identifying the message on the
network
Payload Type: Indicates the type of message. Gnutella servents MUST accept
all the following types:
|
Type |
Message |
|
0x00 |
|
|
0x01 |
Pong |
|
0x02 |
Bye |
|
0x40 |
Push |
|
0x80 |
Query |
|
0x81 |
Query Hit |
TTL :Time To Live.
Hops :The number
of times the message has been forwarded
Payload Length:The length of the message immediately following this header.
Immediately
following the message header, is the payload. The massage carrying the Query
response is the Query hit message (0x81). So we will focus our discussion
here on this message only. (Note: BearShare uses HTML instead of this message
format)
Query Hit
messages has the following fields:
You will be
mainly interested in this assignment in extracting files' names and sizes.
|
Bytes |
Field name |
Description |
|
0 |
Number of Hits |
The number of query hits in the result set. |
|
1-2 |
Port |
The port number on which the responding host can accept incoming HTTP file requests. This is usually the same port as is used for Gnutella network traffic, but any port MAY be used. |
|
3-6 |
IP Address |
The IP address of the responding host. Note: This field is in big-endian format. |
|
7-10 |
Speed |
The speed (in kb/second) of the responding host. |
|
11- |
Result Set |
A set of responses to the corresponding Query. This set contains Number_of_Hits elements. See below for the format. |
|
x |
Extended QHD |
This block is not strictly required, but strongly recommended. It is sometimes called EQHD, or (incorrectly) just QHD. |
|
x |
Private Data |
Undocumented vendor-specific data. This field continues till the servent Identifier, which uses the last 16 bytes of the message. |
|
Last 16 |
Servent Identifier |
A 16-byte string uniquely identifying the responding servent on the network. |
Query Hit
Result Item
Each item
contained in the query hit result is structured as follows:
|
Bytes |
Field name |
Description |
|
0-3 |
File Index |
A number, assigned by the responding host, which is used to uniquely identify the file matching the corresponding query. |
|
4-7 |
File Size |
The size (in bytes) of the file whose index is "File Index". For large files whose size cannot be expressed with a 32-bit integer, a GGEP LF block can be used in the extensions block. |
|
8- |
File Name |
The name of the file whose index is "File Index". Terminated by a null byte (i.e. 0x00). |
|
x |
Extensions block. |
Allowed extension types are HUGE, GGEP and plain text metadata. This field is terminated by a null (0x00), even if there are no extensions (resulting in a double null). Also, the extensions block itself MUST NOT contain any null bytes. |
You should not
care about extensions block for this assignment, but you should know how
to parse it.
You should pay
attention that the HTTP GET reply could contain multiple Query Hit Results, and
there is no separator between responses, so you should depend on the Query Hit
Response Length field (in the message header) for parsing.
References: