{"id":5534,"date":"2024-06-03T03:00:00","date_gmt":"2024-06-03T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=5534"},"modified":"2024-05-29T10:22:59","modified_gmt":"2024-05-29T14:22:59","slug":"using-awk-to-filter-text","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=5534","title":{"rendered":"Using \u2018awk\u2019 to filter text"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"5534\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>I use Markdown to write drafts of technical articles. I find writing in Markdown makes it easy for me to stay focused on <em>what I\u2019m writing<\/em> rather than <em>what it will look like<\/em>.<\/p>\n\n\n\n<p>When I\u2019m writing an article, I also like to keep track of my word count. There\u2019s no magic \u201cword count\u201d for technical articles &#8211; they can be as long or as short as needed to cover the material &#8211; but I still like to keep most of my technical articles between 800 and 1,000 words. Articles that provide a \u201cdeep dive\u201d on a highly technical topic (such as programming) might be much longer, up to 2,000 words.<\/p>\n\n\n\n<p>I don\u2019t want to include the code in my word count; every bracket, parenthesis, \u2026 and generally <em>everything that\u2019s surrounded by at least one space<\/em> will be included in the \u201cword count.\u201d Yet the code is part of the Markdown file, so using the <code>wc<\/code> tool to count words will include all of my sample code. For example, this simple \u201chello world\u201d program has about 30 \u201cwords\u201d in it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\nint main()\n{\n    int i;\n\n    for (i = 1; i &lt;= 10; i = i + 1) {\n        puts(\"Hello world\");\n    }\n\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p>But how do you count words in an article when that article has lots of code samples? All it takes is knowing a little about using awk to filter text.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-basics-of-awk-scripts\">The basics of awk scripts<\/h2>\n\n\n\n<p>Awk is a simple yet powerful scripting language developed by Al Aho, Peter Weinberger, and Brian Kernighan of Bell Labs. In fact, the command name <code>awk<\/code> was formed from the first letter of each of their last names.<\/p>\n\n\n\n<p>Awk is perhaps best explained as a scripting language that takes actions based on matching conditions, and have the general form of:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>condition { actions }<\/code><\/pre>\n\n\n\n<p>In awk, a condition can be a regular expression inside slashes, such as <code>\/^a\/<\/code> to match any line that starts with the letter \u2018a\u2019, or a relational expression like <code>i==4<\/code> for when the variable <code>i<\/code> has the value 4, or a constant \u201cvalue\u201d like <code>BEGIN<\/code> for the beginning of a file or <code>END<\/code> for the end of a file. You can form more complex conditions with those basics.<\/p>\n\n\n\n<p>To make processing text files easier for you, awk also splits lines into <em>tokens<\/em> or <em>fields<\/em> that you can access as <code>$1<\/code>, <code>$2<\/code>, and so on. The field value <code>$0<\/code> indicates the entire line. Awk also provides variables that you can access from within scripts, such as <code>NR<\/code> as the number of \u201crecords\u201d or lines processed so far, or <code>NF<\/code> as the number of fields on the current line.<\/p>\n\n\n\n<p>Actions or <em>expressions<\/em> can be any series of awk instructions. Awk instructions are very similar to C programming instructions: if you know a little C, you can quickly learn awk. For example, let\u2019s say I wanted to set a variable called <code>aline<\/code> to 1 whenever we encounter a line that starts with the letter <code>a<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/^a\/ { aline = 1; }<\/code><\/pre>\n\n\n\n<p>The extra spaces within the curly braces aren\u2019t needed; I included them only to make this easier to read. You could also write that awk statement like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/^a\/ {aline=1;}<\/code><\/pre>\n\n\n\n<p>Or maybe I want to just increment the <code>aline<\/code> variable, such as to count the number of lines that start with the letter <code>a<\/code>. This is easy to do, as well. In awk, all variables start with a zero value, so I can write this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/^a\/ {aline=aline+1}<\/code><\/pre>\n\n\n\n<p>You can start to see how awk operates by recognizing a pattern (such as <code>\/^a\/<\/code> to match a regular expression) and then taking an action (like adding 1 to the <code>aline<\/code> variable). This simple pattern-action format makes awk both simple and flexible.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"using-awk-to-recognize-code-blocks\">Using awk to recognize code blocks<\/h2>\n\n\n\n<p>Markdown is a lightweight document markup system that uses plain text files as input. You usually convert Markdown into some other format, such as HTML. And that\u2019s exactly how I use Markdown to write my article drafts; I\u2019ll write a draft in Markdown, then convert it into an HTML document using the <code>pandoc<\/code> command.<\/p>\n\n\n\n<p>To insert a block of code, such as some sample code in a programming article, you surround the sample code with a \u201ccode fence\u201d of three \u201cbackticks.\u201d These \u201cbackticks\u201d make it easy to match the start and end of sample code using awk. In other words, I want awk to take action whenever it finds three \u201cbackticks\u201d in a Markdown file. I\u2019ll start by incrementing a variable called <code>text<\/code> every time we encounter the three \u201cbackticks\u201d delimiter:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/```\/ { text=text+1; }<\/code><\/pre>\n\n\n\n<p>Since we only need to add 1 to the <code>text<\/code> variable, we can instead use the <code>++<\/code> notation, like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/```\/ { text++; }<\/code><\/pre>\n\n\n\n<p>The first time we find three \u201cbackticks\u201d in a Markdown file, that marks the end of regular article text and the beginning of sample code. The sample code continues until the next series of three \u201cbackticks.\u201d This means that the variable <code>text<\/code> will always have an even value (0, 2, 4, 6, \u2026) for regular body text within a Markdown file, and an odd value (1, 3, 5, 7, \u2026) for sample code.<\/p>\n\n\n\n<p>An easy way to determine if a value is even or odd is to use <code>%<\/code> to calculate the <em>modulo<\/em>, or the remainder after dividing by another number. For example, <code>5%2<\/code> is \u201c5 divided by 2,\u201d or \u201c2 with a remainder of 1,\u201d so a modulo of 1.<\/p>\n\n\n\n<p>We can use this to only print lines from a Markdown file that are regular body text, when <code>text<\/code> has an even value:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>(text%2)==0 {print;}<\/code><\/pre>\n\n\n\n<p>In this case, the <em>pattern<\/em> is <code>(text%2)==0<\/code> which calculates the modulo of <code>text<\/code> with 2, to determine if the result is an even number (modulo is zero). If it is, then awk prints the line.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"counting-words-in-an-article\">Counting words in an article<\/h2>\n\n\n\n<p>Let\u2019s say I have this sample Markdown file called <code>hello.md<\/code>, which contains headings, paragraph text, and sample code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Hello world\n\nHere is how you can write your first \"Hello world\" program in C:\n\n```\n#include &lt;stdio.h&gt;\n\nint main()\n{\n  puts(\"Hello world\");\n  return 0;\n}\n```\n\nAnd now you're ready to learn programming!<\/code><\/pre>\n\n\n\n<p>This file contains 35 words, according to the <code>wc<\/code> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -w hello.md\n35 hello.md<\/code><\/pre>\n\n\n\n<p>But this includes the sample code, which I don\u2019t want to include in the final word count. We can use this 2-line awk script called <code>text.awk<\/code> to match lines with three \u201cbackticks\u201d and only print the parts of the article that are regular text:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/```\/ {text++;}\n(text%2)==0 {print;}<\/code><\/pre>\n\n\n\n<p>Now we can use the <code>awk<\/code> command with the <code>-f<\/code> option to specify the script file, to filter the Markdown file before passing the results to <code>wc<\/code> to count the words:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ awk -f text.awk hello.md | wc -w\n24<\/code><\/pre>\n\n\n\n<p>For very short awk scripts like this, you can also provide the entire awk script as a single command line argument, usually enclosed in single quotes. When you use this method to run an awk script, you list the conditions and actions in pairs, such as <em>condition-action<\/em> <em>condition-action<\/em> <em>condition-action<\/em> <em>condition-action<\/em> and so on. This means we can rewrite the command line like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ awk '\/```\/ {text++;} (text%2)==0 {print;}' hello.md | wc -w\n24<\/code><\/pre>\n\n\n\n<p>In my real-world example, I had written a draft article in Markdown about programming, called <code>copyfile.md<\/code>. According to the <code>wc<\/code> command, this file had over 2,200 words, including source code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -w copyfile.md\n2274 copyfile.md<\/code><\/pre>\n\n\n\n<p>Using the short awk command to filter out the sample code, and running the result through the <code>wc<\/code> command to count words, tells me the file has about 1,800 words of actual text:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ awk '\/```\/ {text++;} (text%2)==0 {print;}' copyfile.md | wc -w\n1884<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Here\u2019s how to use awk to strip out sample code from a Markdown file.<\/p>\n","protected":false},"author":33,"featured_media":4654,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-5534","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5534"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5534\/revisions"}],"predecessor-version":[{"id":5535,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5534\/revisions\/5535"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/4654"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}