Ufetch: a Basic Web Client
The web has certainly become one of the most ubiquitous and important publishing spaces around. What defines the web, technically, is html and http. Html is the publishing format and http is the transport protocol.
Let's consider http. This protocol is so simple. Some would say simplistic it certainly has it share of detractors. While it may not be the most sophisticated protocol around, it gets the job done.
It's a client/server protocol and, like many internet standards, is text-based. The client sends an http request and the server responds with a reply. There are only a few commands in the repertory: 'GET', 'PUT', 'POST', and a few others. See RFC2616 for all the details.
The bytes sent out across the network are composed of a header and the data (content). The header is nothing more than a few lines of simple text. The first line contains the command, the remaining lines contain 'key: value' pairs. If either server or client don't understand a particular key, it's ignored this leaves quite a bit of leeway for fun.
Normally, you never see these headers. As a counter-example, consider email. Most email clients allow you to see the message headers if you want. This makes it more accessible to understand the mail protocol. But most web clients, such as web browsers, never let you see this stuff. Too bad it can be interesting and informative.
I wrote a simple web client for just that purpose. Called ufetch, it's a command line utility that fetches data from web servers. For example, in a Terminal window type:
ufetch www.bebits.comThis will download the home page of BeBits and put it in a file called f.data. As it runs, it spits out various status info to the screen.
You may be familiar with an similar Unix utility called wget. wget is actually more powerful, as it will download from ftp servers as well. But ufetch is simpler, both in its design and its source code. I think this makes it a spiffy tool for learning about various details of the http protocol and web client/server communication.
ufetch was inspired by an old BeOS Newsletter article by Benoit Schillings called Mining the Net. Benoit created a sample C++ program called site_getter for fetching URL resources. I took the code, converted it to C, removed stuff I didn't need, added other stuff, tweaked, coddled, and massaged the code to my heart's content. It is so completely modified that I don't think there's one line of code remaining in ufetch from Benoit's original code. But it certainly was inspired by his work and his comments.
It's really not very hard to implement a web client. The simple text format of the headers makes them trivial to deal with. Most of the work in ufetch involves establishing connections to the web servers and sending/receiving data. Even this, however, is pretty simple because the sockets interface handles all the low-level grunge. If you are a member of the Haiku networking team, then you have the task of implementing the sockets interface. But as a network programmer, you needn't be concerned with the details and only need to know how to use the sockets themselves.
The sockets interface was originally designed by Unix programmers at Berkeley. Which is why they are often referred to as "berkeley sockets". This interface has been ported to other platforms such as Windows, often with many changes and alterations. The BeOS sockets interface is very close to the BSD module, but varies slightly (most notably in that sockets are not true file descriptors).
The semantics of socket operations is similar to file operations. You create a socket and then bind or connect it to a network address (similar to 'open' for a file). While connected, you cand send and receive data (like the 'read' and 'write' for files). When finished, you close the socket. You are required to know the IP address of a remote socket in order to connect, but there are database functions for determining the IP address when given a URL.
Walking thru an example
Ok, let's see how this works in practice. Consider the sample command line:
First, the URL is split into (protocol, host, port, resource). There is no "http://" in the URL, so 'http' will be assumed for the protocol. The host is 'www.bebits.com'. No port is specified, so it defaults to the standard web port 80