{"id":5374,"date":"2024-05-21T03:00:00","date_gmt":"2024-05-21T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=5374"},"modified":"2024-05-19T12:56:28","modified_gmt":"2024-05-19T16:56:28","slug":"a-little-programming-goes-a-long-way","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=5374","title":{"rendered":"A little programming goes a long way"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"5374\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>Aside from my consulting work, I also teach a few university courses on technical writing. One class is about <em>writing with digital technologies<\/em> where students learn how to use the tools and technologies they\u2019ll need as professional technical writers. For example, we start with HTML, then move on to CSS, Markdown, GitHub, LibreOffice, Scribus, XML, DITA, and other writing tools.<\/p>\n\n\n\n<p>The final project in this course challenges students to format a longer, nontrivial document of their choice, <em>using one of the digital writing tools or technologies<\/em> we learned in class. Specifically, the assignment requires at least 2,000 words, with two or more images, and a variety of styles and formatting like headings, paragraphs, bold and italic text, and so on.<\/p>\n\n\n\n<p>I think all the students did really well, but one assignment looked a little short. I wasn\u2019t sure if it was the font size (which was a little small) or if the student just didn\u2019t work with a long enough document. I needed to count the words in this submission, but using <code>wc -w<\/code> wouldn\u2019t work because it was written using an XML-based markup, and all the XML elements and attributes would skew the word count with <code>wc -w<\/code>. My quick solution was to write a one-off program to eliminate the XML tags, and run the result through <code>wc -w<\/code> to count the words.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rules-to-find-tags\">Rules to find tags<\/h2>\n\n\n\n<p>Before I show you the program, let me first share an overview of how it works. The program would need to read an XML document, and scan it to remove any XML tags. A sample XML document, such as one formatted in Simplified Docbook, might look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;article&gt;\n  &lt;title&gt;This is the title&lt;\/title&gt;\n  &lt;para&gt;Simplified Docbook is a markup language based on XML.&lt;\/para&gt;\n&lt;\/article&gt;<\/code><\/pre>\n\n\n\n<p>The XML elements are the \u201ctags\u201d wrapped in the \u201cless than\u201d and \u201cgreater than\u201d symbols. For example, <code>&lt;article&gt;<\/code> declares this as a Simplified Docbook Article, <code>&lt;title&gt;<\/code> defines the document\u2019s title, and <code>&lt;para&gt;<\/code> creates a normal paragraph. This sample XML document doesn\u2019t use element attributes. An attribute is essentially an \u201cargument\u201d to an element, such as the <code>&lt;p&gt;<\/code> element in HTML might carry an attribute like <code>class=\"error\"<\/code> to print an error message.<\/p>\n\n\n\n<p>To just print the text from an XML or HTML document, and eliminate the tags, my program needed to scan the document letter-by-letter and keep track of when it was \u201cinside\u201d a tag and when it was not. It\u2019s a simple pair of rules:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>When the program encounters <code>&lt;<\/code>, it knows it\u2019s \u201cinside\u201d a tag<\/li>\n\n\n\n<li>When the program encounters <code>><\/code>, it knows it\u2019s \u201coutside\u201d a tag<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scanning-a-document\">Scanning a document<\/h2>\n\n\n\n<p>This pair of rules is easy to create in code. I used a <code>switch<\/code> statement in C. The <code>switch<\/code> statement is basically a kind of <em>jump table<\/em>: based on the value of a variable (the character it\u2019s examining) the <code>switch<\/code> jumps to a <code>case<\/code> statement that matches the value.<\/p>\n\n\n\n<p>For example, let\u2019s say I\u2019ve saved a single character in a variable called <code>ch<\/code>. I can use <code>switch<\/code> like this to set a variable called <code>istext<\/code> to \u201cfalse\u201d (0) if <code>ch<\/code> has the value <code>&lt;<\/code>, and set <code>istext<\/code> to \u201ctrue\u201d (1) if <code>ch<\/code> has the value <code>&gt;<\/code>. For any other value, we can jump to some other instruction (called <code>default<\/code>).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    switch (ch) {\n    case '&lt;':                         \/* starting an HTML tag *\/\n      istext = 0;\n      break;\n    case '&gt;':                         \/* closed an HTML tag *\/\n      istext = 1;\n      break;\n    default:\n      if (istext) {\n        fputc(ch, out);\n      }\n    }<\/code><\/pre>\n\n\n\n<p>This is the heart of the function that prints only the text from an XML or HTML document, skipping all the tags. All that\u2019s left is to write the function using a <code>while<\/code> loop to read each character from the input one character at a time:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>void\nunhtml(FILE *in, FILE *out)\n{\n  int ch;\n  int istext = 1;\n\n  while ((ch = fgetc(in)) != EOF) {\n    switch (ch) {\n    case '&lt;':                         \/* starting an HTML tag *\/\n      istext = 0;\n      break;\n    case '&gt;':                         \/* closed an HTML tag *\/\n      istext = 1;\n      break;\n    default:\n      if (istext) {\n        fputc(ch, out);\n      }\n    }\n  }\n}<\/code><\/pre>\n\n\n\n<p>The function has type <code>void<\/code> because it doesn\u2019t return a value. Not every function in C programming needs to return a value.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"writing-the-full-program\">Writing the full program<\/h2>\n\n\n\n<p>This <code>unhtml<\/code> function takes two arguments: a file to read as <em>input<\/em> and another file where it should print its <em>output<\/em>. I really only needed my program to print the results to the terminal, so I could have written this function to always print to <code>stdout<\/code>, which is the default \u201cfile\u201d that means \u201cthe terminal\u201d or whatever is defined as the <em>standard output<\/em>. But good programmers should always take that extra step to solve a <em>general case<\/em> rather than a specific one, so specifying the output file in the function makes this easier to re-use in another program that might need to write to a file instead of print on the terminal.<\/p>\n\n\n\n<p>With this function, the main program just needs to read the command line, and open any files it finds before using <code>unhtml<\/code> to print the results. My full program looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\nvoid\nunhtml(FILE *in, FILE *out)\n{\n  int ch;\n  int istext = 1;\n\n  while ((ch = fgetc(in)) != EOF) {\n    switch (ch) {\n    case '&lt;':                         \/* starting an HTML tag *\/\n      istext = 0;\n      break;\n    case '&gt;':                         \/* closed an HTML tag *\/\n      istext = 1;\n      break;\n    default:\n      if (istext) {\n        fputc(ch, out);\n      }\n    }\n  }\n}\n\nint\nmain(int argc, char **argv)\n{\n  FILE *pfile;\n\n  for (int i = 1; i &lt; argc; i++) {\n    pfile = fopen(argv&#91;i], \"r\");\n\n    if (pfile == NULL) {\n      fputs(\"cannot open file: \", stderr);\n      fputs(argv&#91;i], stderr);\n      fputc('\\n', stderr);\n    }\n    else {\n      unhtml(pfile, stdout);\n      fclose(pfile);\n    }\n  }\n\n  if (argc == 1) {\n    unhtml(stdin, stdout);\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>If I save this as <code>unhtml.c<\/code> I can compile it and then run it.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -o unhtml unhtml.c<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"counting-just-the-words\">Counting just the words<\/h2>\n\n\n\n<p>Let\u2019s say I wanted to count just the words in the Simplified Docbook example from above, saved in a file called <code>short.docbook<\/code>. Using <code>unhtml<\/code> eliminates any XML tags, leaving just the text:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ .\/unhtmlify short.docbook\n\n  This is the title\n  Simplified Docbook is a markup language based on XML.\n<\/code><\/pre>\n\n\n\n<p>Notice that the output has some blank lines in there; those are the \u201cnew lines\u201d after each XML tag. The first blank line is the \u201cline feed\u201d after <code>&lt;article&gt;<\/code> and the last blank line is the \u201cline feed\u201d after <code>&lt;\/article&gt;<\/code>. However, these extra characters don\u2019t matter when I send the results to <code>wc -w<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ .\/unhtmlify short.docbook | wc -w\n13<\/code><\/pre>\n\n\n\n<p>An HTML document might contain more attributes, such as <code>class=<\/code> on an element &#8211; this is useful for styling a web page using CSS. If I use only <code>wc -w<\/code> on a web page, the \u201cword count\u201d will misidentify each attribute as another \u201cword,\u201d which I don\u2019t want. Using <code>unhtmlify<\/code> removes the HTML tags, leaving just the plain text. Here\u2019s a comparison using <code>wc -w<\/code> to count the words in the raw HTML, versus using <code>unhtmlify<\/code> to remove the HTML tags before counting words:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -w index.html\n358 index.html\n\n$ .\/unhtmlify index.html | wc -w\n307<\/code><\/pre>\n\n\n\n<p>Let\u2019s go back to the original use case: the formatting assignment. This was supposed to be at least 2,000 words long, but it felt a little light. Was it just the smaller font size, or did the student format a document that was too short?<\/p>\n\n\n\n<p>The student wrote their project using DITA, an open writing technology that\u2019s based on XML. The project was spread across multiple files. Using <code>wc -w<\/code> on the source files (all with <code>*.dita<\/code> as the file extension) gives the raw count, including XML tags and attributes as \u201cwords,\u201d which is an overcount:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find . -type f -name '*.dita' -exec cat {} \\; | wc -w\n2571<\/code><\/pre>\n\n\n\n<p>And using <code>unhtmlify<\/code> allows me to see the actual word count, without the extra stuff:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find . -type f -name '*.dita' -exec cat {} \\; | ~\/src\/unhtmlify | wc -w\n1569<\/code><\/pre>\n\n\n\n<p>And that shows that no, it wasn\u2019t just the font. In fact, the student submitted a formatting project that was only 1,569 words long, which was a little under the asked-for 2,000 words.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It helps to know some programming to solve these little problems.<\/p>\n","protected":false},"author":33,"featured_media":3514,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[83,150],"tags":[409,162,152,408],"class_list":["post-5374","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-problem-solving","category-programming","tag-html","tag-problem-determination","tag-programming","tag-xml"],"modified_by":"David Both","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5374"}],"version-history":[{"count":2,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5374\/revisions"}],"predecessor-version":[{"id":5378,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5374\/revisions\/5378"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3514"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}