Skip to content

Both.org

News, Opinion, Tutorials, and Community for Linux Users and SysAdmins

Primary Menu
  • About Us
  • Computers 101
    • Hardware 101
    • Operating Systems 101
  • End of 10 Events
    • Wake Forest, NC, — 2025-09-20
  • Linux
    • Why I use Linux
    • The real reason we use Linux
  • My Linux Books
    • systemd for Linux SysAdmins
    • Using and Administering Linux – Zero to SysAdmin: 2nd Edition
    • The Linux Philosophy for SysAdmins
    • Linux for Small Business Owners
    • Errata
      • Errata for The Linux Philosophy for SysAdmins
      • Errata for Using and Administering Linux — 1st Edition
      • Errata for Using and Administering Linux — 2nd Edition
  • Open Source Resources
    • What is Open Source?
    • What is Linux?
    • What is Open Source Software?
    • The Open Source Way
  • Write for us
    • Submission and Style guide
    • Advertising statement
  • Downloads
  • Home
  • Using ‘awk’ to filter text
  • Command Line
  • Linux

Using ‘awk’ to filter text

Here’s how to use awk to strip out sample code from a Markdown file.
Jim Hall June 3, 2024 7 minutes read
Network components

I use Markdown to write drafts of technical articles. I find writing in Markdown makes it easy for me to stay focused on what I’m writing rather than what it will look like.

When I’m writing an article, I also like to keep track of my word count. There’s no magic “word count” for technical articles – they can be as long or as short as needed to cover the material – but I still like to keep most of my technical articles between 800 and 1,000 words. Articles that provide a “deep dive” on a highly technical topic (such as programming) might be much longer, up to 2,000 words.

I don’t want to include the code in my word count; every bracket, parenthesis, … and generally everything that’s surrounded by at least one space will be included in the “word count.” Yet the code is part of the Markdown file, so using the wc tool to count words will include all of my sample code. For example, this simple “hello world” program has about 30 “words” in it:

#include <stdio.h>

int main()
{
    int i;

    for (i = 1; i <= 10; i = i + 1) {
        puts("Hello world");
    }

    return 0;
}

But how do you count words in an article when that article has lots of code samples? All it takes is knowing a little about using awk to filter text.

The basics of awk scripts

Awk is a simple yet powerful scripting language developed by Al Aho, Peter Weinberger, and Brian Kernighan of Bell Labs. In fact, the command name awk was formed from the first letter of each of their last names.

Awk is perhaps best explained as a scripting language that takes actions based on matching conditions, and have the general form of:

condition { actions }

In awk, a condition can be a regular expression inside slashes, such as /^a/ to match any line that starts with the letter ‘a’, or a relational expression like i==4 for when the variable i has the value 4, or a constant “value” like BEGIN for the beginning of a file or END for the end of a file. You can form more complex conditions with those basics.

To make processing text files easier for you, awk also splits lines into tokens or fields that you can access as $1, $2, and so on. The field value $0 indicates the entire line. Awk also provides variables that you can access from within scripts, such as NR as the number of “records” or lines processed so far, or NF as the number of fields on the current line.

Actions or expressions can be any series of awk instructions. Awk instructions are very similar to C programming instructions: if you know a little C, you can quickly learn awk. For example, let’s say I wanted to set a variable called aline to 1 whenever we encounter a line that starts with the letter a:

/^a/ { aline = 1; }

The extra spaces within the curly braces aren’t needed; I included them only to make this easier to read. You could also write that awk statement like this:

/^a/ {aline=1;}

Or maybe I want to just increment the aline variable, such as to count the number of lines that start with the letter a. This is easy to do, as well. In awk, all variables start with a zero value, so I can write this:

/^a/ {aline=aline+1}

You can start to see how awk operates by recognizing a pattern (such as /^a/ to match a regular expression) and then taking an action (like adding 1 to the aline variable). This simple pattern-action format makes awk both simple and flexible.

Using awk to recognize code blocks

Markdown is a lightweight document markup system that uses plain text files as input. You usually convert Markdown into some other format, such as HTML. And that’s exactly how I use Markdown to write my article drafts; I’ll write a draft in Markdown, then convert it into an HTML document using the pandoc command.

To insert a block of code, such as some sample code in a programming article, you surround the sample code with a “code fence” of three “backticks.” These “backticks” make it easy to match the start and end of sample code using awk. In other words, I want awk to take action whenever it finds three “backticks” in a Markdown file. I’ll start by incrementing a variable called text every time we encounter the three “backticks” delimiter:

/```/ { text=text+1; }

Since we only need to add 1 to the text variable, we can instead use the ++ notation, like this:

/```/ { text++; }

The first time we find three “backticks” in a Markdown file, that marks the end of regular article text and the beginning of sample code. The sample code continues until the next series of three “backticks.” This means that the variable text will always have an even value (0, 2, 4, 6, …) for regular body text within a Markdown file, and an odd value (1, 3, 5, 7, …) for sample code.

An easy way to determine if a value is even or odd is to use % to calculate the modulo, or the remainder after dividing by another number. For example, 5%2 is “5 divided by 2,” or “2 with a remainder of 1,” so a modulo of 1.

We can use this to only print lines from a Markdown file that are regular body text, when text has an even value:

(text%2)==0 {print;}

In this case, the pattern is (text%2)==0 which calculates the modulo of text with 2, to determine if the result is an even number (modulo is zero). If it is, then awk prints the line.

Counting words in an article

Let’s say I have this sample Markdown file called hello.md, which contains headings, paragraph text, and sample code:

# Hello world

Here is how you can write your first "Hello world" program in C:

```
#include <stdio.h>

int main()
{
  puts("Hello world");
  return 0;
}
```

And now you're ready to learn programming!

This file contains 35 words, according to the wc command:

$ wc -w hello.md
35 hello.md

But this includes the sample code, which I don’t want to include in the final word count. We can use this 2-line awk script called text.awk to match lines with three “backticks” and only print the parts of the article that are regular text:

/```/ {text++;}
(text%2)==0 {print;}

Now we can use the awk command with the -f option to specify the script file, to filter the Markdown file before passing the results to wc to count the words:

$ awk -f text.awk hello.md | wc -w
24

For very short awk scripts like this, you can also provide the entire awk script as a single command line argument, usually enclosed in single quotes. When you use this method to run an awk script, you list the conditions and actions in pairs, such as condition-action condition-action condition-action condition-action and so on. This means we can rewrite the command line like this:

$ awk '/```/ {text++;} (text%2)==0 {print;}' hello.md | wc -w
24

In my real-world example, I had written a draft article in Markdown about programming, called copyfile.md. According to the wc command, this file had over 2,200 words, including source code:

$ wc -w copyfile.md
2274 copyfile.md

Using the short awk command to filter out the sample code, and running the result through the wc command to count words, tells me the file has about 1,800 words of actual text:

$ awk '/```/ {text++;} (text%2)==0 {print;}' copyfile.md | wc -w
1884
Tags: command line Linux

Post navigation

Previous: Using the Alpine Linux email client to access messages from any network
Next: SpamAssassin, MIMEDefang, and Procmail: Best Trio of 2024

Related Stories

Typewriter-lead
  • Books
  • Linux
  • Printing
  • Using and Administering Linux

Book Update — Chapter 26, Printers

David Both May 1, 2026
connections_wires_sysadmin_cable
  • Linux
  • Networking
  • Router

How to Make your Linux Box Into a Router

David Both April 29, 2026
f44-01-day-cropped
  • Fedora
  • Linux
  • Upgrades

Fedora 44 Released

David Both April 28, 2026

System upgrades this Sunday, May 3

Tools illustrationFedora 44 was released this week and I’ve upgraded all my systems except for the two that directly affect Both.org. I’ll be upgrading the hosts for my server and firewall to Fedora 44 this Sunday afternoon, May 3.

Both.org will be down for most of the afternoon for these upgrades.

Thanks for your patience.

Random Quote

Those who don’t understand Unix are condemned to reinvent it, poorly.

— Henry Spencer

Why I’ve Never Used Windows

On February 12 I gave a presentation at the Triangle Linux Users Group (TriLUG) about why I use Linux and why I’ve never used Windows.

Here’s the link to the video: https://www.youtube.com/live/uCK_haOXPFM 

Why there’s no such thing as AI

Last October at All Things Open (ATO) I was interviewed by Jason Hibbits of We Love Open Source. It’s posted in the article “Why today’s AI isn’t intelligent (yet)“.

Technically We Write — Our Partner Site

Our partner site, Technically We Write, has published a number of articles from several contributors to Both.org. Check them out.

Technically We Write is a community of technical writers, technical editors, copyeditors, web content writers, and all other roles in technical communication.

Subscribe to Both.org

To comment on articles, you must have an account.

Send your desired user ID, first and last name, and an email address for login (this must be the same email address used to register) to subscribe@both.org with “Subscribe” as the subject line.

You’ll receive a confirmation of your subscription with your initial password as soon as we are able to process it.

Administration

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

License and AI Statements

Both.org aims to publish everything under a Creative Commons Attribution ShareAlike license. Some items may be published under a different license. You are responsible to verify permissions before reusing content from this website.

The opinions expressed are those of the individual authors, not Both.org.

You may not use this content to train AI.

 

Advertising Statement

Both.org does not sell advertising on this website.


Advertising may keep most websites running—but at Both.org, we’re committed to keeping our corner of the web ad-free. Both.org does not sell advertising on the website. Nor do we offer sponsored articles at this time. We’ll update this page if our position on sponsorships changes.

We want to be open about how the website is funded. Both.org is supported entirely by David Both and a few other dedicated individuals.

 

 

Copyright © All rights reserved. | MoreNews by AF themes.