Project 2 : Part I - Web Server (revised)

CS219 - Programming for the World Wide Web

Due : Friday, January 29, 1999. Midnight.

Note: The information about CGI has been updated. Please disregard the previous handout

Introduction

In this project you will learn about the server side of the web by implementing a simple, yet functional web server and using it to construct a simple shopping cart application. Along the way, you will learn about the HTTP protocol, CGI, server side includes, and other server-related topics. You may use any programming language that you wish for this project although Java or Python are recommended.

Important Notice : This a two part project. You should plan to finish the first part of the assignment by January 22 (or plan on suffering a lot next week). The second part of this project will be distributed later this week. The web-server you develop will also be the basis for project 3.

Please read this entire handout before you begin work.

The Web Server

The first part of this project involves the creation of a simple web server that can deliver HTML documents and images as well as executing CGI scripts.

You will run your web-server as follows:

webserver docroot port

where "docroot" is the document root directory (containing all of your HTML documents and CGI scripts) and port is the TCP port number that clients will use to connect to the server. For example,

% webserver /home/beazley/html 10000
Web server listening on classes.cs.uchicago.edu port 10000

For testing purposes, you may want to run the server on a directory containing your own homepages. If you do not have a homepage you can test your server using the web pages available on classes. For example:

webserver /opt/local/www/http/docs/roots/www.classes.cs.uchicago.edu 10000

To test your server, use any browser (Netscape, Internet Explorer, Lynx, etc...) by entering a URL containing the host and port number of your server as follows:

http://classes.cs.uchicago.edu:10000/

The HTTP Protocol

Your server will implement a small, but useful subset of the HTTP 1.0 protocol. This protocol specifies how a client and server establish a connection, how a client requests a document, how the server responds to the request, and how the connection is closed. This protocol is described in RFC1945 and is supported by most web browsers.

Making a connection

To connect with the server, the client establishes a TCP connection (exactly like in Project 1). By default, web servers use port 80. However, for this project you will have to choose a port number that is greater than 1024 as shown above.

Requesting a document

When a browser wants to retrieve a document, it sends a GET message to the server. A typical GET message looks like this :

GET /index.html HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.05 [en] (X11; I; Linux 2.0.35 i686)
Host: localhost:10000
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
< blank line >

The first line of the request is the most important because it indicates the intended operation (GET), the document (index.html) and the HTTP protocol version (1.0). The rest of the request contains additional information about the client. In this case, the "User-Agent" indicates the type of browser being used (Netscape running on a Linux machine). "Host" indicates the hostname given to the browser. This can be used by a web-server for virtual hosting. The "Accept" properties indicate various capabilities of the browser including the types of images that are supported, the language preference, and character set. For this project, you won't have to worry about these properties too much. Finally, the GET request is terminated by a single blank line.

There is one additional little detail to note about requests. If the filename requested by the client is really a directory like this,

GET /yourpage/html/ HTTP/1.0
User-Agent: Mozilla/4.05 [en] (X11; I; Linux 2.0.35 i686)
Host: localhost:10000
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8

the web-server usually tries to return a file named 'index.html' in that directory. If the 'index.html' file does not exist, a server may choose to report an error back to the client or may generate a listing of the files in the requested directory (the precise behavior is actually determined by a server configuration file and can be set on a directory by directory basis). Your server should look for "index.html" and report an error if it isn't found.

The Server Response

After receiving a request, the server sends a response that looks like this:

HTTP/1.0 200 OK
Server: cs219/yourname
Content-type: text/html
Content-length: 1253

<html>
<head>
<title>
This is my HTML document
</title>
</head>
<body>
...
</body>
</html>

The first line of the response provides the HTTP version (1.0), a response code, and a response message. The following list shows the possible response codes:

2xx Successful      (all 200-299 codes are success codes)
200 OK              
201 Created         
202 Accepted
204 No Content

3xx Redirection     (all 300-399 codes to the client to go elsewhere)
300 Multiple Choices
301 Moved Permanently
302 Moved Temporarily
304 Not Modified

4xx Client Error    (all 400-499 indicate errors on client end)
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found

5xx Server Error    (500-599 indicate errors with server)
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable

Your server will probably only generate a few of these codes (most notably, the 200,400, and 404 codes).

After the response code, a number of attributes about the server can be given. These are optional, but a typical response might include the version of the server being used, file modification dates, and so forth.

The contents of the document are described by the "Content-type" and "Content-length" fields. The content type is specified as a MIME header type. Typical values are as as follows:


Content-type: text/plain                -  A plain text document
Content-type: text/html                 -  An HTML document
Content-type: image/gif                 -  A GIF image
Content-type: image/jpeg                -  A JPEG image
Content-type: application/java          -  A Java class file

The "Content-length" field only needs to be specified for binary data such as images and Java applets (it can be ommitted for HTML and textual data).

The response header is terminated by a single blank line. After the blank line, the actual document data is transmitted. In the case of HTML files, you will see the HTML text describing the page. In the case of images and other binary data, the raw binary data is written. When writing binary data, make sure the Content-length field exactly matches the actual length of the sent data. Also, be careful not to send any extraneous data in the header (such as an extra blank line).

Other HTTP requests

Besides the "GET" request, there are several other HTTP requests. A "POST" request is used to send form data to the server (most commonly used with CGI) A "HEAD" request is used to find out information about a file such as its modification date and size. A "PUT" request is used to upload a file to the server. Your server will only have to recognize GET and POST requests. The POST request is described shortly in the section of CGI scripting.

A Simple Web Server (Example)

Writing a simple web-server is actually quite easy. Here is how a server that only understands GET requests would work:

Open up a socket.
Listen for a connection.
See if we got a GET request (if not, generate an error).
Append the requested document name to the document root directory and see if it exists and is valid. If not, send an error message back.
Determine what kind of file the client has requested (HTML, GIF, JPEG, Java class file, etc...).
Construct an appropriate response message and send it back to the client.
Close the client connection.
Go back to step 2.

Here is a bare-bones web server (written in Python) that can serve HTML, GIF, and JPEG files. This server is a simplified version of what you will be writing (most notably, it lacks security, doesn't support concurrent connections, and it can't run CGI).

# Simple Python web-server

import sys
import string
from socket import *

# Check the command line arguments
if len(sys.argv) != 3:
	print "Usage : webserver docroot port"
	sys.exit(0)

# Set the document root and port
docroot = sys.argv[1]
port    = string.atoi(sys.argv[2])

# Open up a socket
serversock = socket(AF_INET, SOCK_STREAM)
serversock.bind("", port)
serversock.listen(5)

print "Web-server listening on port %s " % (port,)

while 1:
	(conn,addr) = serversock.accept()       # Get a connection
	print "Connection from %s" % (addr,)
	request = ""
	c = conn.recv(1)                       
	while c != "\n":
		request = request+c
		c = conn.recv(1)
	request = string.split(request," ")    
	method = request[0]
	document = request[1]
		
	if (method == "GET"):
		# Form the full filename
		file = docroot + document
		
		# Try opening the file
		try:
			f = open(file)
			conn.send("HTTP/1.0 200 OK\n")
			# Figure out the content type
			if (file[-5:] == ".html"):
				conn.send("Content-type: text/html\n")
			elif (file[-4:] == ".gif"):
				conn.send("Content-type: image/gif\n")
			elif (file[-4:] == ".jpg"):
				conn.send("Content-type: image/jpeg\n")
			else:
				conn.send("Content-type: text/plain\n")
			# Read the file and send it
			data = f.read()
			conn.send("Content-length: %d\n" % (len(data),))
			conn.send("\n")
			conn.send(data)
		except:
			conn.send("HTTP/1.0 404 Not Found\n")
			conn.send("Content-type: text/html\n\n")
			conn.send("<h1> File Not Found</h1>")
	else:
		conn.send("HTTP/1.0 501 Not Implemented\n")
		conn.send("Content-type: text/html\n\n")
		conn.send("<h1> Unimplemented request type</h1>")
	conn.close()

CGI

CGI (Common Gateway Interface) is a method web servers use to provide a gateway to other services and programs. For example, CGI might be used to transmit form data submitted from a web-page to a database server living on another machine. Please refer to the class handouts for more details about the inner workings of CGI.

Your web server must support the CGI interface. This section describes a few of the more tricky implementation details.

CGI requests

CGI requests arrive at the server in two basic formats. First, a client may send a CGI GET request that looks like this:

GET /cgi-bin/foo.cgi?cmd=lookup&name=dave HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.05 [en] (X11; I; Linux 2.0.35 i686)
Host: localhost:10000
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
< blank line >

With GET requests, the CGI query string is appended to the filename following a question mark. Thus, the above request tells the server to run the CGI program "foo.cgi" with a query string of "cmd=lookup&name=dave".

The second format of CGI requests is as a POST request that looks like this:

POST /cgi-bin/foo.cgi HTTP/1.0
User-Agent: Mozilla/4.05 [en] (X11; I; Linux 2.0.35 i686)
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8

cmd=lookup&name=dave

A POST request is almost identical to the GET request except that the query string is passed as data following the request header (separated by a blank line). POST requests are most commonly used with forms that are transmitting large amounts of data to the server (since there is a limited amount of data that can be sent with a GET request).

The query string contains the form data entered by the user (which has been heavily "munged" by the browser). It is up to the CGI program to decode and interpret the query string. Thus, all your server has to do is capture the query string and make it is gets properly passed to the CGI program.

The cgi-bin diectory

CGI programs typically reside in a special directory called "cgi-bin" that is located in the top of the document tree. To know when to execute a CGI program, your server should check the name of the requested file to see if it is located in the cgi-bin directory. If so, it should attempt to run the program as a separate process (described momentarily). If the file does not exist or is not executable, the server should generate an internal error message.

Since the server is going to be running a separate program, extreme care needs to be given to the contents and use of the cgi-bin directory. In particular

The server should never execute a program that lives outside of the cgi-bin directory.
The server should carefully check the filename of any CGI request. For example, it should not allow a request like this to work:
```
GET /cgi-bin/../../../../../../usr/bin/rm?-rf+/ HTTP/1.0
```
You should be very careful about what programs are put in the cgi-bin directory. In particular, you should never put system commands (rm, more, etc...) in cgi-bin, nor should you put any programming language interpreters such as Perl, Python, or Tcl.

Running a CGI program

CGI programs usually run externally to the web server and may be written in a variety of languages including Perl, Python, Tcl, C/C++, and Java. To run a CGI program, you have to play around with our hopefully now familiar fork() function and a few other system commands. Here's roughly what happens in the server.

Examine the request header to find the request method and query string (if available).
Set up two pipes that will be used to connect the server and the CGI program.
Perform a fork(). The child process will run the CGI program.
Connect the pipes to the standard input and standard output to the child process.
Set a number of environment variables (see below).
Perform an exec() on the CGI program.
The server sends POST data (if applicable).
The server reads data from the CGI program.
Wait for the CGI program to finish (by calling wait()).
Parse the CGI headers to see if there is anything interesting.
Send the CGI response to the client.
Close the client connection.

The most tricky part is the handling of the input and output streams of the CGI program. The following code shows how to call another program from Python while grabbing it's input and output streams. Your CGI implementation will be quite similar:

# -----------------------------------------------------------------------------
# Executes another process using pipes
# -----------------------------------------------------------------------------
import os

# Here is some data
data = "Hi there, this is a test of pipes"

# Create a pipe
(pinput,coutput) = os.pipe()    # Create a pipe from child to parent
(cinput,poutput) = os.pipe()    # Create a pipe from parent to child

# Run 'wc' on the data above
pid = os.fork()
if pid != 0:
        # I'm the parent
        os.close(coutput)
        os.close(cinput)
        os.write(poutput,data)
        os.close(poutput)
        response = os.read(pinput,1000)
        print "py:", response
        os.waitpid(pid,0)           # Wait for child to exit
else:
        # I'm the child
        os.close(poutput)
        os.close(pinput)
        os.dup2(cinput,0)
        os.dup2(coutput,1)
        e = {}                      # Environment variables
        e['FOO'] = 'BAR'
        os.execve("/usr/bin/wc",["wc"],e)    # Run 'wc'

Here is the same code written in Java:

// Disclaimer : This is a gross hack.   You can probably do better
import java.lang.*;
import java.io.*;

public class Pipe {
  public static void main(String[] args) {
    try {
    String data = "Hi there, this is a test of exec";
    Runtime r = Runtime.getRuntime();

    // Create the command array
    // Note : Additional arguments would go in cmd[1],cmd[2], etc...
    String [] cmd = new String[1];
    cmd[0] = "/usr/bin/wc";

    // Create some environment variables
    String [] env = new String[2];
    env[0] = "FOO=BAR";
    env[1] = "SPAM=YES";

    // Run the command
    Process p = r.exec(cmd,env);     

    // Grab the input and output streams
    OutputStream o = p.getOutputStream();
    InputStream i = p.getInputStream();
    
    // Send data to the process
    
    o.write(data.getBytes());

    // For a program reading from stdin, we need to close
    // the OutStream to generate an end of file
    o.close(); 

    // Get the results back.  This is a hack.  You would
    // Probably want to use BufferedReader or some other
    // input method.

    byte[] result = new byte[16000];    // Ugh.
    i.read(result);

    // Wait for the child process to finish
    try {
      p.waitFor();
    } catch (InterruptedException e) { }

    // Print out the result
    System.out.println(new String(result));
    } catch (IOException e) { 
      System.out.println("An error occurred.");
    }
  }
}

This sample code can be found in :

/usr/local/classes/current/CS219/projects/project2

The CGI environment

The following Unix environment variables are typically set for CGI program execution.

GATEWAY_INTERFACE             CGI version number
SERVER_NAME                   Hostname of the server
SERVER_SOFTWARE               Name of server program
SERVER_PROTOCOL               HTTP protocol the server is using
SERVER_PORT                   Port number the server is using
REQUEST_METHOD                The HTTP request method (GET, POST, etc..)
PATH_INFO                     Additional Path information
SCRIPT_NAME                   Name of the CGI script (/cgi-bin/program.cgi)
DOCUMENT_ROOT                 Top of the web document tree
QUERY_STRING                  Query information passed in the URL
REMOTE_HOST                   Hostname of the client
REMOTE_ADDR                   Remote IP address of the client
REMOTE_IDENT                  User making the request (may be unavailable)
CONTENT_TYPE                  Mime type of the query data
CONTENT_LENGTH                Length of the query data
HTTP_FROM                     E-mail of user making request 
HTTP_ACCEPT                   List of MIME types client can support
HTTP_USER_AGENT               The browser the client is using
HTTP_REFERER                  The URL of the document the client points to
                              before accessing the CGI program.

Your server should set the following environment variables

REQUEST_METHOD
SCRIPT_NAME
DOCUMENT_ROOT
QUERY_STRING

What to do

Now that you've seen a few details, you have to write a simple web server that supports the following features.

Connections

Your server should allow concurrent client connections exactly as in project 1. In other words, after receiving a client connection, you should either call fork() or start a new thread (if using Java) to handle the client connection. The parent process should go back to listening for more connections.

HTTP Methods

You only need to support the GET and POST methods. POST methods will only be used for CGI applications. If your server receives any other kind of HTTP request (HEAD or PUT), it should generate a "501 Not Implemented" error.

Document types

Your server needs to support the following file types. The file-type should be determined by the file suffix and be case-insensitive:

MIME type             File Suffix
-----------------------------------------------------
text/html             .html, .htm
image/gif             .gif
image/jpeg            .jpg, .jpeg
application/java      .class
text/plain            anything that's not listed above

Any file-type that can't be determined (the file has an unknown suffix) should be returned as type text/plain.

CGI

Any request for a file in the /cgi-bin/ directory should be treated as a CGI-request. Your server must verify the filename, set up the environment variables, and run the CGI as a separate process.

Logging

Your server should print a series of diagnostic messages as it runs to indicate what it's doing. The log should minimally contain the following information:

The type of HTTP request.
The requested document name.
The address of the incoming connection.

You may choose to output more information than this if you want however.

Getting started

Start by making a simple web server capable of delivering HTML documents and images. This should be entirely straightforward--in fact, you should be able to do this with maybe only a few hours of effort. Testing your server is easy: just point a browser at it and see if it does what it is supposed to do.

Once you have your basic server up and running, modify it to run concurrently by using fork() or threads to handle multiple client connections.

Add a few security checks to make sure the browser can't request bogus files or do anything weird. For example, your server probably shouldn't allow a user to download the system password file.

Once you have your server working pretty well with all of the above. Add CGI support. Make your server look for the /cgi-bin/ directory and set it up to execute programs in that directory. This is the most tricky part of the server to write, but it should not involve a huge amount of code.

As always, don't hesistate to send me email or stop by if you have any questions.

Extra Credit

Now, a few people have asked me about survival after the fallout of Project 1. In light of this, there are a few opportunities for extra credit with this project. In particular, you can receive extra credit for modifying your server to support any of the following features (the more features, the more extra credit).

Server-side includes.
Servlets (See the Java Network Programming Book)
Make your server support the Connection:Keep-Alive attribute.
Add support for more HTTP methods (HEAD, PUT)
Virtual domains and multiple document roots.
Anything else you think might be interesting.

More details about these topics will be discussed in class.