Lab 4: Due Wednesday, October 10, 2002

 

Part I: Reading Files

Reading files is covered extensively in your textbook. I will highlight some of the key ideas here, however. First, you need access to the part of the Java language that deals with files. You can get this by making the first command of your program the following:

import java.io.*;

You also need to tell Java where you'll be reading from. In Java input and output are handled as "streams." A file can be considered to be a stream of characters that flow to your program. In many respects this means that reading files is not all that different than reading input from a keyboard. First you will make a file input stream for reading from files. This can be done in your program with a command such as the following:

FileInputStream stream = new FileInputStream(filename);

For this lab "filename" will be "input.data" (with the quotes). Once that is done, you can make a variable of type InputStreamReader with the proper context as follows.

InputStreamReader reader = new InputStreamReader(stream);

The important bit here is that we have given the InputStreamReader constructor an extra parameter to work with corresponding to the file we are going to read.

There are several ways to read the stream. We are going to use a tokenizer class in order to break the stream up into smaller chunks called "tokens." You'll need the following declaration:

StreamTokenizer tokens = new StreamTokenizer(reader);

The tokenizer will be able to read individual strings, numbers, etc. as discrete chunks.

Finally, Java is rather finicky about letting people open up files. Many things can go wrong during this process - they can be misnamed, they might not exist, the user might not have permission to them, etc. - to combat this Java forces you to wrap all of the file creation and reading operations within a try and catch combination. Ultimately, your code will end up looking something like this:

try  {
       InputStream stream = new FileInputStream("input.data");
       InputStreamReader reader = new InputStreamReader(stream);
       StreamTokenizer tokens = new StreamTokenizer(reader);
       .
       .
       .
}
catch (IOException e) 
    {
       System.err.println(e); 
    }

The "catch" part of the code looks for errors generated during the IO process and when it finds them prints them out as system errors.

There are several relevant parts of the StreamTokenizer class. The most important method is the nextToken() method which returns an integer relaying the type of token read (e.g. a number or a string) and sets fields within the instance. These fields include

nval - if the token read was a number, it is stored in this public field as a double.

sval - if the token read was a string, it is stored in this public field as a String.

There are also several useful constant fields.

TT_EOF - The number returned by nextToken() when the end of the file was reached.

TT_WORD - The number returned by nextToken() when a String was read.

TT_NUMBER - The number returned by nextToken() when a number was read.

Given this, a typical reading loop (this is taken almost identically from page 159 of your book) would look something like this:

    int next = 0;
    while ((next = tokens.nextToken()) != tokens.TT_EOF) {
        switch(next) {
           case tokens.TT_WORD:
               // whatever you do when you read a string
               break;
           case tokens.TT_NUMBER:
               // whatever you do when you read a number
               break;
       }
    }
    stream.close();

For starters, I'd suggest you simply write a program with that as your main, and simply read a file and print out the tokens one by one.

Part IIAn introduction to parsing and HTML files

Our goal for this part is to read a file (or files) that use a simplified version of HTML (hypertext markup language), the language used to define web pages. To see an example of this, go to any webpage (like this one) and select the "View Source" menu option (or similar item depending on your browser). This will show the document used to generate the page. We are going to implement a subset of the HTML language. To learn more try here. We'll take this a piece at a time.

Titles Most HTML documents have a title. For the one you are reading the title is "Lab 4: Due Wednesday October 10, 2002". Titles are specified in HTML starting as follows:

<TITLE>The title goes here</TITLE>

The title will be equivalant to the title of the display window.

Images

Images are declared as follows:

<img src="imagename" width=x height=y>

In this case the image named "imagename" should be displayed. The height and width specify how big the image is. It is not strictly necessary, but can be used to speed up the display. In HTML files are found relative to the current file. So if you just specify a filename, it means the file should be found in the same directory as the source file. The same will hold true in our project.

Links

Links to other documents are specified in two parts as follows:

Here is my <a href="http://www.bowdoin.edu">home page.</a>

It shows up like this: Here is my home page.

Notice in this case I liked to a page somewhere else by using its entire address. The part of the text "home page" will be underlines to indicate that it is a link.

Text

Text is handled in a fairly straightforward way with a few exceptions. For now we will ignore most of those (like headings, using bold, etc.) to keep things relatively straightforward for the project. Later if we do a more realistic implementation we'll revisit this. However, one important feature of text is the paragraph marker. To generate a new paragraph, HTML files use a <p> notation. Optionally they will end with </p>.

Your Assignment

Your job will be to parse a file and store the information in a data structure. Later we will take the information in the data structure and display it in a window as a web browser would. Your goal will be to break the file into discrete chunks corresponding either to the document's intended title, pieces of text, pointers to images, and pointers to files (links). For example, consider a file with the following contents:

<TITLE>My title</TITLE>
This is the story of a boy and his dog.
Here is the boy
<img src="boy.jpg" width=60 height=80>
Our story begins with

This should be parsed into four discrete chunks. The first should be flagged as a title with the text "My title". The second is the text "This is the story of a boy and his dog." The third is the specification of the image. The fourth is the text "Our story begins with". Your program should output these chunks in a simple to understand way.

Some hints about parsing. For this program the key thing will be to pick out two kinds of things: 1) Everything from where you are until the next ">" 2) Everything from where you are until the next "<". Write a class specifically to do all of the reading. The class should take a file name. It should also contain a method to return the next chunk. You will be much happier if you write two additional methods that do 1 and 2 above. Once you do this you might find it useful to specialize the reading somewhat (e.g. once you identify <TITLE> have a routine that grabs text until </TITLE>).

In this lab the files will not have mistakes. This won't be the case in future labs.

The StreamTokenizer does not simply read things as strings or numbers. It can also read individual characters such as >. This means the "switch" construct from part I will not work as well. Instead you'll need to check for some single characters too. The important ones for this lab are > and <. You can check for them by check if (next == '>') and if (next == '<') respectively. There are a few other characters you will have to look for as well, including /, =, and ". Also, the StreamTokenizer will not normally recognize the / as a legitimate token. You'll need to issue the command

tokens.ordinaryChar('/');

to let it know that you care about them.

A useful String method for this project is equalsIgnoreCase. This compares the equality of two strings without regard to case. (E.g. "thisString" is equal to "THISstring").

Your project will need to know where to look for the file. The simplest way is to put the file to be read into the same folder as your executable code. This will be in the "build" folder within your project. I will put some test code into the course material folder.

One of the important things you'll need to do in this lab is create a data structure to hold the parsed information. I would reccommend a data structure that builds on a combination of Associations and Vectors. For example, one "chunk" of the file might be a title and its corresponding text. You can flag the chunk as a title by using the "key" of an association, and you can store the text as a vector of strings in the "value" part of the association. Do not store the tag information (e.g. <TITLE>), it should be implicit in the data structure (for example in the key). To hold multiple chunks, the natural data structure is a Vector. So, you would have a Vector which contains Associations which in turn contains Vectors.

Put your project in the drop box with the usual naming conventions. We'll work on displaying things in the next lab.