Next: Reading a Log File Up: Using Perl with Web Previous: Using Perl with Web

Server Log Files

The most useful tool to assist in understanding how and when your Web site pages and applications are being accessed is the log file generated by your Web server. This log file contains, among other things, which pages are being accessed, by whom, and when.

Each Web server will provide some form of log file that records who and what accesses a specific HTML page or graphic. A terrific site to get an overall comparison of the major Web servers can be found at

http://www.webcompare.com/.

From this site one can see which Web servers follow the CERN/NCSA common log format that is detailed below. In addition, you can also find out which sites can customize log files, or write to multiple log files. You might also be surprised at the number of Web servers there are on the market.

Understanding the contents of the server log files is a worthwhile endeavor. And in this section, you'll see several ways that the information in the log files can be manipulated. However, if you're like most people, you'll use one of the log file analyzers that you'll read about in the section "Existing Log File Analyzing Programs" to do most of your work. After all, you don't want to create a program that others are giving away for free.

Note This section about server log files is one that you can read when the need arises. If you are not actively running a Web server now, you won't be able to get full value from the examples. The CD-ROM that accompanies this book has a sample log file to you to experiment on but it is very limited in size and scope.

Nearly all of the major Web servers use a common format for their log files. These log files contain information such as the IP address of the remote host, the document that was requested, and a timestamp. The syntax for each line of a log file is:

site logName fullName [date:time GMToffset] "req file proto" status length

Because that line of syntax is relatively meaningless, here is a line from a real log file:

204.31.113.138 - - [03/Jul/1996:06:56:12 -0800]
    "GET /PowerBuilder/Compny3.htm HTTP/1.0" 200 5593

Even though the line is split into two, here, you need to remember that inside the log file it really is only one line.

Each of the eleven items listed in the above syntax and example are described in the following list.

site-either an IP address or the symbolic name of the site making the HTTP request. In the example line the remotehost is 204.31.113.138.
logName-login name of the user who owns the account that is making the HTTP request. Most remote sites don't give out this information for security reasons. If this field is disabled by the host, you see a dash (-) instead of the login name.
fullName-full name of the user who owns the account that is making the HTTP request. Most remote sites don't give out this information for security reasons. If this field is disabled by the host, you see a dash (-) instead of the full name. If your server requires a user id in order to fulfill an HTTP request, the user id will be placed in this field.
date-date of the HTTP request. In the example line the date is 03/Jul/1996.
time-time of the HTTP request. The time will be presented in 24-hour format. In the example line the time is 06:56:12.
GMToffset-signed offset from Greenwich Mean Time. GMT is the international time reference. In the example line the offset is -0800, eight hours earlier than GMT.
req-HTTP command. For WWW page requests, this field will always start with the GET command. In the example line the request is GET.
file-path and filename of the requested file. In the example line the file is /PowerBuilder/Compny3.htm. There are three types of path/filename combinations:

Implied Path and Filename-accesses a file in a user's home direc-tory. For example, /foo/ could be expanded into /user/foo/homepage.html. The /user/foo directory is the home directory for the user foo. And homepage.html is the default file name for any user's home page. Implied paths are hard to analyze because you need to know how the server is set up and because the server's set up may change. Relative Path and Filename-accesses a file in a directory that is specified relative to a user's home directory. For example, /foo/cooking.html will be expanded into /user/foo/cooking.html. Full Path and Filename-accesses a file by explicitly stating the full directory and filename. For example,

/user/foo/biking/mountain/index.html.

proto-type of protocol used for the request. In the example line, proto HTTP 1.0 is used.
status-status code generated by the request. In the example line the status is 200.
length-length of requested document. In the example line the byte is 5593.

Web servers can have many different types of log files. For example, you might see a proxy access log, or an error log. In this chapter, we'll focus on the access log-where the Web server tracks every access to your Web site.

Next: Reading a Log File Up: Using Perl with Web Previous: Using Perl with Web

dave@cs.cf.ac.uk