{"id":4958,"date":"2024-04-20T03:00:00","date_gmt":"2024-04-20T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=4958"},"modified":"2024-04-14T14:25:09","modified_gmt":"2024-04-14T18:25:09","slug":"writing-your-own-fmt-program","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=4958","title":{"rendered":"Writing your own \u2018fmt\u2019 program"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"4958\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>The original Bell Labs Unix system was based on a then-novel idea: <em>Everything is a text file<\/em>. Your documents? Text files. Your email? Text files. You might run a <em>separate program<\/em> to process those plain text files into a more suitable format, such as using nroff to prepare documents for printing on a TeleType printer, or using troff to format documents for printing on a <em>phototypesetter<\/em> &#8211; but in the end, everything was a text file.<\/p>\n\n\n\n<p>Yet there\u2019s an inherent problem when working with text files: If you continue to edit a text file in a plain editor like <code>vi<\/code> (or before that, <code>ed<\/code>) you will quickly reach a point where the lines are not the same length. While this isn\u2019t a problem for processing documents using nroff or troff, this can make other files look less \u201cnice\u201d for humans to read.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reformatting-text-files\">Reformatting text files<\/h2>\n\n\n\n<p>Unix and Unix-like systems provide tools to help <em>reformat<\/em> a text document to make it look great for the rest of us. <a href=\"https:\/\/technicallywewrite.com\/2024\/04\/05\/unixfmt\" data-type=\"link\" data-id=\"https:\/\/technicallywewrite.com\/2024\/04\/05\/unixfmt\">One such program is <code>fmt<\/code><\/a>, which first appeared in the Berkeley Software Distribution of Unix (commonly known as \u201cBSD\u201d Unix). GNU has a similar <code>fmt<\/code> program that does pretty much the same thing, although differs in a few implementation details such as handling nroff files.<\/p>\n\n\n\n<p><code>fmt<\/code> had a specific use case: split long lines to be shorter, and glue short lines together to be longer. <code>fmt<\/code> has some interesting peculiarities due to its history, but the simple description is the program makes text files easier to read by making each line about the same length.<\/p>\n\n\n\n<p>But with a little programming, you can write your own version of <code>fmt<\/code>. Let\u2019s write a simple version that <em>collects words<\/em> and <em>fills paragraphs<\/em>. We\u2019ll call this program <code>fill<\/code> so it doesn\u2019t get confused with <code>fmt<\/code> which does things a <em>little<\/em> differently.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"outlining-the-program\">Outlining the program<\/h2>\n\n\n\n<p>Before we write this program, let\u2019s start with an outline. We\u2019ll need a <code>main<\/code> function that will read a list of files. For each file, the program should call a function that <em>collects words<\/em> and <em>fills paragraphs<\/em> in the file. We\u2019ll call that other function <code>fill_file<\/code>.<\/p>\n\n\n\n<p>To process a file, it\u2019s easiest to read one line at a time and process them. We can do that in another function, which we\u2019ll call <code>fill_line<\/code>.<\/p>\n\n\n\n<p>In the C programming language, the <code>strtok<\/code> function interprets a string and returns the next \u201ctoken\u201d based on a set of delimiters. If we use various whitespace characters such as space, tab, and newline as hte delimiters, the tokens are words. This allows us to read lines one a time, and collect words from them.<\/p>\n\n\n\n<p>By keeping track of how much text we\u2019ve printed as output, we can fill paragraphs up to a certain line length. We\u2019ll keep this program as simple as possible and hard-code the target line length.<\/p>\n\n\n\n<p>We can describe the program operation at a high level with this pseudo-code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>main() {\n  for every file:\n    fill_file(file)\n\n  if no files:\n    file_file(stdin)\n\n  exit\n}\n\nfill_file(file) {\n  for every line in the file:\n    fill_line(line)\n\n  return\n}\n\nfill_line(line) {\n  if line is empty:\n    print blank line\n    return\n\n  for every word in the line:\n    if we can fit the word on the output:\n      print word\n    else:\n      start a new line\n      print word\n\n  return\n}<\/code><\/pre>\n\n\n\n<p>The implementation will require more details, but that is the overall structure we\u2019ll use in our program.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-main-program\">The \u2018main\u2019 program<\/h2>\n\n\n\n<p>With the pseudo-code as a guide, we can construct a program to collect words and fill paragraphs. First, let\u2019s start with the <code>main<\/code> program function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>int main(int argc, char **argv)\n{\n    int i;\n    FILE *in;\n\n    for (i = 1; i &lt; argc; i++) {\n        in = fopen(argv&#91;i], \"r\");\n\n        if (in) {\n            fill_file(in, stdout);\n            fclose(in);\n        }\n        else {\n            fputs(\"cannot open file: \", stderr);\n            fputs(argv&#91;i], stderr);\n            fputc('\\n', stderr);\n        }\n    }\n\n    if (argc == 1) {\n        fill_file(stdin, stdout);\n    }\n\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p>This reads a list of files from the command line. The program opens the file, and passes it to the <code>fill_file<\/code> function for processing. If there are no files on the command line (<code>argc == 1<\/code>) then the program uses <em>standard input<\/em> as the input to <code>fill_file<\/code>.<\/p>\n\n\n\n<p>I decided to write the <code>fill_file<\/code> function so it takes both an <em>input<\/em> and <em>output<\/em> file pointer. This doesn\u2019t really add much complexity in printing the output, but it makes the program more flexible if we decide to add a command line parameter like <code>-o file<\/code> to save all output directly to a file instead of printing it to <em>standard output<\/em>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reading-the-file\">Reading the file<\/h2>\n\n\n\n<p>If we leave the <code>main<\/code> function to manage opening and closing the input files, we can focus the <code>fill_file<\/code> function on reading lines one at a time from the input. A simple way to read a line of input is with the <code>fgets<\/code> function from the standard C library. This reads a line <em>up to a certain length<\/em> into memory. However, that leaves us stuck with a predetermined size of input lines; lines that are longer will get split, possibly in the middle of a word.<\/p>\n\n\n\n<p>A more flexible approach uses <code>getline<\/code>, which reads an arbitrary amount of data into memory. If the memory is too small, <code>getline<\/code> automatically reallocates more room to fit the data. This makes the <code>fill_file<\/code> function a very brief one:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>void fill_file(FILE *in, FILE *out)\n{\n    char *line = NULL;\n    size_t linesize = 0;\n    size_t length = 0;\n\n    while (getline(&amp;line, &amp;linesize, in) != -1) {\n        length = fill_line(line, length, out);\n    }\n\n    fputc('\\n', out);           \/* trailing newline *\/\n\n    free(line);\n}<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"processing-lines\">Processing lines<\/h2>\n\n\n\n<p>The most complicated function in the program processes the input lines, splits them into words, and prints the output. Fortunately, the <code>strtok<\/code> function makes this somewhat easier: call <code>strtok<\/code> with the string to read the first word, then call <code>strtok<\/code> with a \u201czero\u201d value (called <code>NULL<\/code>) to read the words that follow on the line. <code>strtok<\/code> returns <code>NULL<\/code> when there are no more words to find.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>size_t fill_line(char *line, size_t length, FILE *out)\n{\n    char *word;\n    size_t wordlen, linelen;\n\n    if (is_empty(line, DELIM)) {\n        if (length &gt; 0) {\n            fputc('\\n', out);   \/* break prev line *\/\n        }\n\n        fputc('\\n', out);       \/* add blank line *\/\n        return 0;\n    }\n\n    linelen = length;\n    word = strtok(line, DELIM);\n\n    while (word) {\n        wordlen = strlen(word);\n\n        if ((linelen + 1 + wordlen) &gt; MAX_LENGTH) {\n            fputc('\\n', out);\n            linelen = 0;\n        }\n\n        if (linelen &gt; 0) {\n            fputc(' ', out);\n            linelen++;\n        }\n\n        fputs(word, out);\n        linelen += wordlen;\n\n        word = strtok(NULL, DELIM);     \/* get next token *\/\n    }\n\n    return linelen;\n}<\/code><\/pre>\n\n\n\n<p>The key to this function is tracking how much has been written to the output. In the <code>while<\/code> loop, the function uses <code>strlen<\/code> to determine how many letters are in the word. If the next input word won\u2019t fit on the current output line, it starts a new line before printing the word. The function also adds a single space between words, except at the start of a new line.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"finding-empty-lines\">Finding empty lines<\/h2>\n\n\n\n<p>The <code>fill_line<\/code> function relies on a new function to determine if a line is empty. A too-simple approach would use <code>strlen<\/code> to determine if the string had a zero length, but this does not account for lines that are just one or more spaces or tabs.<\/p>\n\n\n\n<p>To correctly determine if a line is empty, we need to write our own function. The <code>is_empty<\/code> function reads a string (a line of input) and a list of delimiters (the same delimiters used to split words) and returns a false value if any character in the string is <em>not<\/em> a delimiter. Only if the function reaches the end of the string will it return a true value; this is possible if the string contains only delimiters, or is a zero-length string.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>int is_empty(char *str, const char *whitesp)\n{\n    char *s;\n\n    s = str;\n\n    while (s&#91;0]) {\n        if (strchr(whitesp, s&#91;0]) == NULL) {\n            return 0;\n        }\n\n        s++;\n    }\n\n    return 1;\n}<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"putting-it-all-together\">Putting it all together<\/h2>\n\n\n\n<p>Those four functions are all we need to create a simple program that collects words and fills paragraphs. As we put it all together, we\u2019ll also need to provide the programming overhead to specify the \u201cinclude\u201d files for the C programming language library functions, such as <code>string.h<\/code> for functions that work on strings and <code>stdlib.h<\/code> to use the functions that allocate and free memory.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt;\n#include &lt;string.h&gt;\n\n#define MAX_LENGTH 66\n\n#define DELIM \" \\t\\n\"\n\nint is_empty(char *str, const char *whitesp)\n{\n  ...\n}\n\nsize_t fill_line(char *line, size_t length, FILE *out)\n{\n  ...\n}\n\nvoid fill_file(FILE *in, FILE *out)\n{\n  ...\n}\n\nint main(int argc, char **argv)\n{\n  ...\n}<\/code><\/pre>\n\n\n\n<p>This also uses <code>#define<\/code> to create a constant value (called a <em>macro<\/em>) for the maximum line length, at 66 characters. Because we set this as a predetermined value, you will need to update the program if you want to use a different line length. A more robust version of this program might scan the command line arguments (such as with <code>getopt<\/code>) to interpret the user\u2019s preferred line length. But for this version of the program, we\u2019ll keep it simple and use a hard-coded value.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"collect-words-and-fill-paragraphs\">Collect words and fill paragraphs<\/h2>\n\n\n\n<p>This new program (which we\u2019ll call <code>fill<\/code>) will collect words from the input and fill paragraphs on the output. To generate a version of the program you can run, save the source code in a file called <code>fill.c<\/code> and use your system\u2019s C compiler like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -o fill fill.c<\/code><\/pre>\n\n\n\n<p>Let\u2019s exercise the program with a test file. Create a plain text file on your system, one with different line lengths, and possibly extra spaces between words and lines. I\u2019ve saved my test file as <code>t.txt<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cat t.txt \n\nOne blank line before this one.\nThis   is   a   test   file\nwith lines that are different lengths.\n\nThis is the start of a new paragraph,\nwhich   should   start   on   a   new   line.\nSome more text to cause a line break.\n\n\nTwo   blank   lines   before   this   one.\nA-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns<\/code><\/pre>\n\n\n\n<p>The <code>fill<\/code> program will read the input and collect words on each line, then will fill output lines, up to the specified length of 66 characters. The program starts a new paragraph when it finds empty lines. Note that multiple empty lines in the input file also get printed as empty lines in the output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ .\/fill t.txt \n\nOne blank line before this one. This is a test file with lines\nthat are different lengths.\n\nThis is the start of a new paragraph, which should start on a new\nline. Some more text to cause a line break.\n\n\nTwo blank lines before this one.\nA-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns<\/code><\/pre>\n\n\n\n<p>The last line is quite long, and demonstrates why it\u2019s important for programs to avoid arbitrary limits. If the <code>fill<\/code> program used <code>fgets<\/code> to read lines one a time, we risk splitting long lines because of the limits in <code>fgets<\/code>. In the worst case, we might have a line that consists of multiple words, but gets split in the middle of a word because the line is too long. Using <code>getline<\/code> reads the entire line, at the expense of a little extra memory. On modern systems with lots of memory, this is usually a safe trade-off.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn a little programming to write tools that do things the way you want to do them.<\/p>\n","protected":false},"author":33,"featured_media":3683,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[106,150],"tags":[104,91,152],"class_list":["post-4958","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-history","category-programming","tag-command-line","tag-linux","tag-programming"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/4958","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4958"}],"version-history":[{"count":3,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/4958\/revisions"}],"predecessor-version":[{"id":4969,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/4958\/revisions\/4969"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3683"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4958"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4958"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4958"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}