Crawling Gnutella Network

 

The Gnutella network (in its latest instantiation: protocol version 0.6) has a hierarchal structure, with two types of nodes: Ultrapeers and Leaves. The Ultrapeers form the core of the network and act as a proxy for Gnutella network for the leaves that are connected to them. The Leaves connect to one or more Ultrapeers and use them to search for files.

 

Guidelines:

In order to crawl a Gnutella node you should do the following:

 

To get information about node’s Ultrapeers and Leaves:

 

To get information about the shared files at the node:

 

You will need to study and follow the exact messages format for Gnutella 0.6 protocol, a summary of the needed messages is found in subsequent sections.

 

 

Crawling the Network for a List of Nodes

To crawl the network you open a TCP connection to the node and send a <GNUTELLA CONNECT … Crawler> message. The node will replay back with a list of peers. The exact message exchange is presented below:

 

Crawler

Servent

GNUTELLA CONNECT/0.6<CR><LF>
User-Agent: UBCECE (crawl) <CR><LF>
X-Ultrapeer: False<CR><LF>
Query-Routing: 0.1<CR><LF>
Crawler: 0.1<CR><LF>
<CR><LF>
 
 
GNUTELLA/0.6 200 OK<CR><LF>
User-Agent: BearShare<CR><LF>
Leaves: 127.0.0.1:6346,127.0.0.2:6346, 127.0.0.3:6346<CR><LF>
Peers: 127.0.0.4:6346,127.0.0.5:6346, 127.0.0.6:6346 <CR><LF>

GNUTELLA/0.6 200 OK<CR><LF> <CR><LF>

 

Disconnect                      

Disconnect

 

The Gnutella node response will contain a list of its leaves (if the node is an ultapeer node) and a list of other Peers connected to the node. The Peers are the non-leaf clients including Gnutella 0.6 Ultrapeers and old Gnutella 0.4 style clients.

 

In the crawler handshake message replace the UBECE string with a string to identify your own crawler.

 

Browsing Host Content

To retrieve the list of files shared at a specific Gnutella node, you open a TCP connection to the node and send a HTTP GET request. (This is the same as regular HTTP GET with a small change in the file types requested).

 

A typical message exchange is presented below, note that the content-type exchanged is <application/ x-gnutella-packets>.

 

Crawler

Client

GET / HTTP/1.1<CR><LF>
Host: Crawler_IP:PORT<CR><LF>
User-Agent: UBCECE <CR><LF>
Accept: text/html , application/x-gnutella-packets<CR><LF>
Connection: close<CR><LF>
<CR><LF>
 
 
HTTP/1.1 200 OK<CR><LF>
Server: LimeWire/x.y<CR><LF>
Content-Type: application/x-gnutella-packets <CR><LF>
Connection: close<CR><LF>
<CR><LF>
<Actual Gnutella Query Response>

Disconnect                      

 

 

The Gnutella node responds with an HTTP message containing a list of the shared files at that node. The list is formatted in multiple Gnutella Query response message (described in the next section) for all implementations except BearShare. BearShare will send the list as an HTML page.

 

For this assignment you are requested to process the Query Hit responses only and not the HTML responses. You’ll list the Bearshare nodes as sharing zero files.

 

 

Gnutella Standard Message Format:

Once a servent (Gnutella Node) has connected successfully to the network, it communicates with other servents by sending and receiving Gnutella protocol messages. Each message is preceded by a Message Header with the byte structure given below.

 

Message Header

The message header is 23 bytes divided into the following fields.

 

Bytes

Description

0-15

Message ID/GUID (Globally Unique ID)

16

Payload Type

17

TTL (Time To Live)

18

Hops

19-22

Payload Length

 

Message ID :A 16-byte string (GUID) uniquely identifying the message on the network

Payload Type: Indicates the type of message. Gnutella servents MUST accept all the following types:

 

Type

Message

0x00

Ping

0x01

Pong

0x02

Bye

0x40

Push

0x80

Query

0x81

Query Hit

 

TTL :Time To Live.

Hops :The number of times the message has been forwarded

Payload Length:The length of the message immediately following this header.

 

Immediately following the message header, is the payload. The massage carrying the Query response is the Query hit message (0x81). So we will focus our discussion here on this message only. (Note: BearShare uses HTML instead of this message format)

 

 

Query Hit (0x81)

Query Hit messages has the following fields:

You will be mainly interested in this assignment in extracting files' names and sizes.

 

Bytes

Field name

Description

0

Number of Hits

The number of query hits in the result set.

1-2

Port

The port number on which the responding host can accept incoming HTTP file requests. This is usually the same port as is used for Gnutella network traffic, but any port MAY be used.

3-6

IP Address

The IP address of the responding host. Note: This field is in big-endian format.

7-10

Speed

The speed (in kb/second) of the responding host.

11-

Result Set

A set of responses to the corresponding Query. This set contains Number_of_Hits elements. See below for the format.

x

Extended QHD

This block is not strictly required, but strongly recommended. It is sometimes called EQHD, or (incorrectly) just QHD.

x

Private Data

Undocumented vendor-specific data. This field continues till the servent Identifier, which uses the last 16 bytes of the message.

Last 16

Servent Identifier

A 16-byte string uniquely identifying the responding servent on the network.

 

Query Hit Result Item

Each item contained in the query hit result is structured as follows:

 

Bytes

Field name

Description

0-3

File Index

A number, assigned by the responding host, which is used to uniquely identify the file matching the corresponding query.

4-7

File Size

The size (in bytes) of the file whose index is "File Index". For large files whose size cannot be expressed with a 32-bit integer, a GGEP LF block can be used in the extensions block.

8-

File Name

The name of the file whose index is "File Index". Terminated by a null byte (i.e. 0x00).

x

Extensions block.

Allowed extension types are HUGE, GGEP and plain text metadata. This field is terminated by a null (0x00), even if there are no extensions (resulting in a double null). Also, the extensions block itself MUST NOT contain any null bytes. 

 

You should not care about extensions block for this assignment, but you should know how to parse it.

 

You should pay attention that the HTTP GET reply could contain multiple Query Hit Results, and there is no separator between responses, so you should depend on the Query Hit Response Length field (in the message header) for parsing.

 

 

References:

1-     http://gnutella-specs.rakjar.de/index.php/Main_Page

2-     www.limewire.com