Ufetch: a Basic Web Client

The web has certainly become one of the most ubiquitous and important publishing spaces around. What defines the web, technically, is html and http. Html is the publishing format and http is the transport protocol.

Let's consider http. This protocol is so simple. Some would say simplistic — it certainly has it share of detractors. While it may not be the most sophisticated protocol around, it gets the job done.

It's a client/server protocol and, like many internet standards, is text-based. The client sends an http request and the server responds with a reply. There are only a few commands in the repertory: 'GET', 'PUT', 'POST', and a few others. See RFC2616 for all the details.

The bytes sent out across the network are composed of a header and the data (content). The header is nothing more than a few lines of simple text. The first line contains the command, the remaining lines contain 'key: value' pairs. If either server or client don't understand a particular key, it's ignored — this leaves quite a bit of leeway for fun.

Normally, you never see these headers. As a counter-example, consider email. Most email clients allow you to see the message headers if you want. This makes it more accessible to understand the mail protocol. But most web clients, such as web browsers, never let you see this stuff. Too bad — it can be interesting and informative.

I wrote a simple web client for just that purpose. Called ufetch, it's a command line utility that fetches data from web servers. For example, in a Terminal window type:

ufetch www.bebits.com

This will download the home page of BeBits and put it in a file called f.data. As it runs, it spits out various status info to the screen.

You may be familiar with an similar Unix utility called wget. wget is actually more powerful, as it will download from ftp servers as well. But ufetch is simpler, both in its design and its source code. I think this makes it a spiffy tool for learning about various details of the http protocol and web client/server communication.

ufetch was inspired by an old BeOS Newsletter article by Benoit Schillings called Mining the Net. Benoit created a sample C++ program called site_getter for fetching URL resources. I took the code, converted it to C, removed stuff I didn't need, added other stuff, tweaked, coddled, and massaged the code to my heart's content. It is so completely modified that I don't think there's one line of code remaining in ufetch from Benoit's original code. But it certainly was inspired by his work and his comments.

It's really not very hard to implement a web client. The simple text format of the headers makes them trivial to deal with. Most of the work in ufetch involves establishing connections to the web servers and sending/receiving data. Even this, however, is pretty simple because the sockets interface handles all the low-level grunge. If you are a member of the Haiku networking team, then you have the task of implementing the sockets interface. But as a network programmer, you needn't be concerned with the details and only need to know how to use the sockets themselves.

The sockets interface was originally designed by Unix programmers at Berkeley. Which is why they are often referred to as "berkeley sockets". This interface has been ported to other platforms such as Windows, often with many changes and alterations. The BeOS sockets interface is very close to the BSD module, but varies slightly (most notably in that sockets are not true file descriptors).

The semantics of socket operations is similar to file operations. You create a socket and then bind or connect it to a network address (similar to 'open' for a file). While connected, you cand send and receive data (like the 'read' and 'write' for files). When finished, you close the socket. You are required to know the IP address of a remote socket in order to connect, but there are database functions for determining the IP address when given a URL.

Walking thru an example

Ok, let's see how this works in practice. Consider the sample command line:

ufetch www.bebits.com

First, the URL is split into (protocol, host, port, resource). There is no "http://" in the URL, so 'http' will be assumed for the protocol. The host is 'www.bebits.com'. No port is specified, so it defaults to the standard web port 80. No resource is specified either, so it defaults to '/'.

Next, the IP address of host 'www.bebits.com' is looked up using the standard network function gethostbyname(). In this case, it returns 28.245.212.78 (in dotted format).

A socket is created using the socket() function. Then the socket is connected to the web server with the connect() function using port=80 and IP=28.245.212.78. If it's unable to connect, an error message is sent to the screen and the program exits.

The following request header is generated:

GET / HTTP/1.1
Host: www.bebits.com
User-Agent: ufetch
Accept: */*
Connection: close

The first line is the status line. 'GET' is the command and '/' is the requested resource. This is followed by the http version. The remaining lines are standard header tags. The 'Accept: */*' line says to accept anything (ufetch is not picky). The 'Connection: close' was added after some real world testing: HTTP 1.1 supports persistent connections (unlike 1.0), so you need this tag to avoid a delay in terminating the connection.

This request header is sent using the send() function. Then recv() is called to receive the reply. A block of memory is allocated to hold the incoming data. The reply header will be the first part of this data, followed by the data bytes for the resource. It's difficult (impossible?) to know exactly how big the header will be, but the end of header is always easy to find — the first blank line in the incoming stream marks the end.

Here's the header received from the BeBits server for this example:

HTTP/1.1 200 OK
Date: Fri, 14 Dec 2001 19:29:07 GMT
Server: Apache/1.3.9 (Unix) TARP/0.42-alpha PHP/4.0.4pl1 secured_by_Raven/1.4.2
X-Powered-By: PHP/4.0.4pl1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

A web client might want to parse this header into all the different tags and make use of the info. For the most part, ufetch doesn't bother. The one exception is for redirects. Often a particular URL simply redirects to another URL. In this case, the response code is 301 or 302 and there's a tag called 'Location:' that identifies the URL to redirect to. This is the only header tag that ufetch cares about.

The header is sent to the screen so that it can be viewed along with the status info. The data bytes, however, are written out to a file called f.data. This makes it easy to find, but it also means that each run of the program will clobber the output of the previous run.

ufetch is certainly limited in what in can do. But it's loads of fun to use. You can see all kinds of interesting info being returned by web servers. Try it and see how many web sites are sending cookies you didn't know about. Or snoop on just what software the server is running. It might even be useful as a way to debug connections to certain troublesome servers. Expanding ufetch in any number of ways would not take too much effort. Have fun.

The sources of Ufetch are available fordownload from here