Writing your own ‘fmt’ program

The original Bell Labs Unix system was based on a then-novel idea: Everything is a text file. Your documents? Text files. Your email? Text files. You might run a separate program to process those plain text files into a more suitable format, such as using nroff to prepare documents for printing on a TeleType printer, or using troff to format documents for printing on a phototypesetter – but in the end, everything was a text file.

Yet there’s an inherent problem when working with text files: If you continue to edit a text file in a plain editor like vi (or before that, ed) you will quickly reach a point where the lines are not the same length. While this isn’t a problem for processing documents using nroff or troff, this can make other files look less “nice” for humans to read.

Reformatting text files

Unix and Unix-like systems provide tools to help reformat a text document to make it look great for the rest of us. One such program is fmt, which first appeared in the Berkeley Software Distribution of Unix (commonly known as “BSD” Unix). GNU has a similar fmt program that does pretty much the same thing, although differs in a few implementation details such as handling nroff files.

fmt had a specific use case: split long lines to be shorter, and glue short lines together to be longer. fmt has some interesting peculiarities due to its history, but the simple description is the program makes text files easier to read by making each line about the same length.

But with a little programming, you can write your own version of fmt. Let’s write a simple version that collects words and fills paragraphs. We’ll call this program fill so it doesn’t get confused with fmt which does things a little differently.

Outlining the program

Before we write this program, let’s start with an outline. We’ll need a main function that will read a list of files. For each file, the program should call a function that collects words and fills paragraphs in the file. We’ll call that other function fill_file.

To process a file, it’s easiest to read one line at a time and process them. We can do that in another function, which we’ll call fill_line.

In the C programming language, the strtok function interprets a string and returns the next “token” based on a set of delimiters. If we use various whitespace characters such as space, tab, and newline as hte delimiters, the tokens are words. This allows us to read lines one a time, and collect words from them.

By keeping track of how much text we’ve printed as output, we can fill paragraphs up to a certain line length. We’ll keep this program as simple as possible and hard-code the target line length.

We can describe the program operation at a high level with this pseudo-code:

main() {
  for every file:
    fill_file(file)

  if no files:
    file_file(stdin)

  exit
}

fill_file(file) {
  for every line in the file:
    fill_line(line)

  return
}

fill_line(line) {
  if line is empty:
    print blank line
    return

  for every word in the line:
    if we can fit the word on the output:
      print word
    else:
      start a new line
      print word

  return
}

The implementation will require more details, but that is the overall structure we’ll use in our program.

The ‘main’ program

With the pseudo-code as a guide, we can construct a program to collect words and fill paragraphs. First, let’s start with the main program function:

int main(int argc, char **argv)
{
    int i;
    FILE *in;

    for (i = 1; i < argc; i++) {
        in = fopen(argv[i], "r");

        if (in) {
            fill_file(in, stdout);
            fclose(in);
        }
        else {
            fputs("cannot open file: ", stderr);
            fputs(argv[i], stderr);
            fputc('\n', stderr);
        }
    }

    if (argc == 1) {
        fill_file(stdin, stdout);
    }

    return 0;
}

This reads a list of files from the command line. The program opens the file, and passes it to the fill_file function for processing. If there are no files on the command line (argc == 1) then the program uses standard input as the input to fill_file.

I decided to write the fill_file function so it takes both an input and output file pointer. This doesn’t really add much complexity in printing the output, but it makes the program more flexible if we decide to add a command line parameter like -o file to save all output directly to a file instead of printing it to standard output.

Reading the file

If we leave the main function to manage opening and closing the input files, we can focus the fill_file function on reading lines one at a time from the input. A simple way to read a line of input is with the fgets function from the standard C library. This reads a line up to a certain length into memory. However, that leaves us stuck with a predetermined size of input lines; lines that are longer will get split, possibly in the middle of a word.

A more flexible approach uses getline, which reads an arbitrary amount of data into memory. If the memory is too small, getline automatically reallocates more room to fit the data. This makes the fill_file function a very brief one:

void fill_file(FILE *in, FILE *out)
{
    char *line = NULL;
    size_t linesize = 0;
    size_t length = 0;

    while (getline(&line, &linesize, in) != -1) {
        length = fill_line(line, length, out);
    }

    fputc('\n', out);           /* trailing newline */

    free(line);
}

Processing lines

The most complicated function in the program processes the input lines, splits them into words, and prints the output. Fortunately, the strtok function makes this somewhat easier: call strtok with the string to read the first word, then call strtok with a “zero” value (called NULL) to read the words that follow on the line. strtok returns NULL when there are no more words to find.

size_t fill_line(char *line, size_t length, FILE *out)
{
    char *word;
    size_t wordlen, linelen;

    if (is_empty(line, DELIM)) {
        if (length > 0) {
            fputc('\n', out);   /* break prev line */
        }

        fputc('\n', out);       /* add blank line */
        return 0;
    }

    linelen = length;
    word = strtok(line, DELIM);

    while (word) {
        wordlen = strlen(word);

        if ((linelen + 1 + wordlen) > MAX_LENGTH) {
            fputc('\n', out);
            linelen = 0;
        }

        if (linelen > 0) {
            fputc(' ', out);
            linelen++;
        }

        fputs(word, out);
        linelen += wordlen;

        word = strtok(NULL, DELIM);     /* get next token */
    }

    return linelen;
}

The key to this function is tracking how much has been written to the output. In the while loop, the function uses strlen to determine how many letters are in the word. If the next input word won’t fit on the current output line, it starts a new line before printing the word. The function also adds a single space between words, except at the start of a new line.

Finding empty lines

The fill_line function relies on a new function to determine if a line is empty. A too-simple approach would use strlen to determine if the string had a zero length, but this does not account for lines that are just one or more spaces or tabs.

To correctly determine if a line is empty, we need to write our own function. The is_empty function reads a string (a line of input) and a list of delimiters (the same delimiters used to split words) and returns a false value if any character in the string is not a delimiter. Only if the function reaches the end of the string will it return a true value; this is possible if the string contains only delimiters, or is a zero-length string.

int is_empty(char *str, const char *whitesp)
{
    char *s;

    s = str;

    while (s[0]) {
        if (strchr(whitesp, s[0]) == NULL) {
            return 0;
        }

        s++;
    }

    return 1;
}

Putting it all together

Those four functions are all we need to create a simple program that collects words and fills paragraphs. As we put it all together, we’ll also need to provide the programming overhead to specify the “include” files for the C programming language library functions, such as string.h for functions that work on strings and stdlib.h to use the functions that allocate and free memory.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LENGTH 66

#define DELIM " \t\n"

int is_empty(char *str, const char *whitesp)
{
  ...
}

size_t fill_line(char *line, size_t length, FILE *out)
{
  ...
}

void fill_file(FILE *in, FILE *out)
{
  ...
}

int main(int argc, char **argv)
{
  ...
}

This also uses #define to create a constant value (called a macro) for the maximum line length, at 66 characters. Because we set this as a predetermined value, you will need to update the program if you want to use a different line length. A more robust version of this program might scan the command line arguments (such as with getopt) to interpret the user’s preferred line length. But for this version of the program, we’ll keep it simple and use a hard-coded value.

Collect words and fill paragraphs

This new program (which we’ll call fill) will collect words from the input and fill paragraphs on the output. To generate a version of the program you can run, save the source code in a file called fill.c and use your system’s C compiler like this:

$ gcc -o fill fill.c

Let’s exercise the program with a test file. Create a plain text file on your system, one with different line lengths, and possibly extra spaces between words and lines. I’ve saved my test file as t.txt:

$ cat t.txt 

One blank line before this one.
This   is   a   test   file
with lines that are different lengths.

This is the start of a new paragraph,
which   should   start   on   a   new   line.
Some more text to cause a line break.


Two   blank   lines   before   this   one.
A-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns

The fill program will read the input and collect words on each line, then will fill output lines, up to the specified length of 66 characters. The program starts a new paragraph when it finds empty lines. Note that multiple empty lines in the input file also get printed as empty lines in the output:

$ ./fill t.txt 

One blank line before this one. This is a test file with lines
that are different lengths.

This is the start of a new paragraph, which should start on a new
line. Some more text to cause a line break.


Two blank lines before this one.
A-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns

The last line is quite long, and demonstrates why it’s important for programs to avoid arbitrary limits. If the fill program used fgets to read lines one a time, we risk splitting long lines because of the limits in fgets. In the worst case, we might have a line that consists of multiple words, but gets split in the middle of a word because the line is too long. Using getline reads the entire line, at the expense of a little extra memory. On modern systems with lots of memory, this is usually a safe trade-off.