A little programming goes a long way

0

Aside from my consulting work, I also teach a few university courses on technical writing. One class is about writing with digital technologies where students learn how to use the tools and technologies they’ll need as professional technical writers. For example, we start with HTML, then move on to CSS, Markdown, GitHub, LibreOffice, Scribus, XML, DITA, and other writing tools.

The final project in this course challenges students to format a longer, nontrivial document of their choice, using one of the digital writing tools or technologies we learned in class. Specifically, the assignment requires at least 2,000 words, with two or more images, and a variety of styles and formatting like headings, paragraphs, bold and italic text, and so on.

I think all the students did really well, but one assignment looked a little short. I wasn’t sure if it was the font size (which was a little small) or if the student just didn’t work with a long enough document. I needed to count the words in this submission, but using wc -w wouldn’t work because it was written using an XML-based markup, and all the XML elements and attributes would skew the word count with wc -w. My quick solution was to write a one-off program to eliminate the XML tags, and run the result through wc -w to count the words.

Rules to find tags

Before I show you the program, let me first share an overview of how it works. The program would need to read an XML document, and scan it to remove any XML tags. A sample XML document, such as one formatted in Simplified Docbook, might look like this:

<article>
  <title>This is the title</title>
  <para>Simplified Docbook is a markup language based on XML.</para>
</article>

The XML elements are the “tags” wrapped in the “less than” and “greater than” symbols. For example, <article> declares this as a Simplified Docbook Article, <title> defines the document’s title, and <para> creates a normal paragraph. This sample XML document doesn’t use element attributes. An attribute is essentially an “argument” to an element, such as the <p> element in HTML might carry an attribute like class="error" to print an error message.

To just print the text from an XML or HTML document, and eliminate the tags, my program needed to scan the document letter-by-letter and keep track of when it was “inside” a tag and when it was not. It’s a simple pair of rules:

  1. When the program encounters <, it knows it’s “inside” a tag
  2. When the program encounters >, it knows it’s “outside” a tag

Scanning a document

This pair of rules is easy to create in code. I used a switch statement in C. The switch statement is basically a kind of jump table: based on the value of a variable (the character it’s examining) the switch jumps to a case statement that matches the value.

For example, let’s say I’ve saved a single character in a variable called ch. I can use switch like this to set a variable called istext to “false” (0) if ch has the value <, and set istext to “true” (1) if ch has the value >. For any other value, we can jump to some other instruction (called default).

    switch (ch) {
    case '<':                         /* starting an HTML tag */
      istext = 0;
      break;
    case '>':                         /* closed an HTML tag */
      istext = 1;
      break;
    default:
      if (istext) {
        fputc(ch, out);
      }
    }

This is the heart of the function that prints only the text from an XML or HTML document, skipping all the tags. All that’s left is to write the function using a while loop to read each character from the input one character at a time:

void
unhtml(FILE *in, FILE *out)
{
  int ch;
  int istext = 1;

  while ((ch = fgetc(in)) != EOF) {
    switch (ch) {
    case '<':                         /* starting an HTML tag */
      istext = 0;
      break;
    case '>':                         /* closed an HTML tag */
      istext = 1;
      break;
    default:
      if (istext) {
        fputc(ch, out);
      }
    }
  }
}

The function has type void because it doesn’t return a value. Not every function in C programming needs to return a value.

Writing the full program

This unhtml function takes two arguments: a file to read as input and another file where it should print its output. I really only needed my program to print the results to the terminal, so I could have written this function to always print to stdout, which is the default “file” that means “the terminal” or whatever is defined as the standard output. But good programmers should always take that extra step to solve a general case rather than a specific one, so specifying the output file in the function makes this easier to re-use in another program that might need to write to a file instead of print on the terminal.

With this function, the main program just needs to read the command line, and open any files it finds before using unhtml to print the results. My full program looks like this:

#include <stdio.h>

void
unhtml(FILE *in, FILE *out)
{
  int ch;
  int istext = 1;

  while ((ch = fgetc(in)) != EOF) {
    switch (ch) {
    case '<':                         /* starting an HTML tag */
      istext = 0;
      break;
    case '>':                         /* closed an HTML tag */
      istext = 1;
      break;
    default:
      if (istext) {
        fputc(ch, out);
      }
    }
  }
}

int
main(int argc, char **argv)
{
  FILE *pfile;

  for (int i = 1; i < argc; i++) {
    pfile = fopen(argv[i], "r");

    if (pfile == NULL) {
      fputs("cannot open file: ", stderr);
      fputs(argv[i], stderr);
      fputc('\n', stderr);
    }
    else {
      unhtml(pfile, stdout);
      fclose(pfile);
    }
  }

  if (argc == 1) {
    unhtml(stdin, stdout);
  }

  return 0;
}

If I save this as unhtml.c I can compile it and then run it.

$ gcc -o unhtml unhtml.c

Counting just the words

Let’s say I wanted to count just the words in the Simplified Docbook example from above, saved in a file called short.docbook. Using unhtml eliminates any XML tags, leaving just the text:

$ ./unhtmlify short.docbook

  This is the title
  Simplified Docbook is a markup language based on XML.

Notice that the output has some blank lines in there; those are the “new lines” after each XML tag. The first blank line is the “line feed” after <article> and the last blank line is the “line feed” after </article>. However, these extra characters don’t matter when I send the results to wc -w:

$ ./unhtmlify short.docbook | wc -w
13

An HTML document might contain more attributes, such as class= on an element – this is useful for styling a web page using CSS. If I use only wc -w on a web page, the “word count” will misidentify each attribute as another “word,” which I don’t want. Using unhtmlify removes the HTML tags, leaving just the plain text. Here’s a comparison using wc -w to count the words in the raw HTML, versus using unhtmlify to remove the HTML tags before counting words:

$ wc -w index.html
358 index.html

$ ./unhtmlify index.html | wc -w
307

Let’s go back to the original use case: the formatting assignment. This was supposed to be at least 2,000 words long, but it felt a little light. Was it just the smaller font size, or did the student format a document that was too short?

The student wrote their project using DITA, an open writing technology that’s based on XML. The project was spread across multiple files. Using wc -w on the source files (all with *.dita as the file extension) gives the raw count, including XML tags and attributes as “words,” which is an overcount:

$ find . -type f -name '*.dita' -exec cat {} \; | wc -w
2571

And using unhtmlify allows me to see the actual word count, without the extra stuff:

$ find . -type f -name '*.dita' -exec cat {} \; | ~/src/unhtmlify | wc -w
1569

And that shows that no, it wasn’t just the font. In fact, the student submitted a formatting project that was only 1,569 words long, which was a little under the asked-for 2,000 words.

Leave a Reply