Tools illustration

Extracting text with awk

0

The awk script interpreter is a very handy tool for systems administrators, and anyone else who uses Linux at the command line. With awk, you can solve a tricky problem with a 1-line script.

Recently, I needed to extract content from an HTML document. The details of the HTML page are not important, only that I wanted to get just the body text from the document. In HTML, the body is defined by the <body> and </body> tags, in a larger document that might look like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>...</title>
    ...
  </head>
  <body>

  ...

  </body>
</html>

A sample HTML file

Let’s say I have a very simple HTML document that contains only a few lines of text in the body. I can simulate this with the pandoc command, which generates over 170 lines of text (mostly stylesheet information):

$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | wc -l
177

But the <body> section is at the end of that long file. We can preview it by sending the output through the tail command:

$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | tail
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
  </style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
</body>
</html>

Extracting the text

In my case, I wanted to print only the lines between the <body> and </body> tags. I decided to write a short awk script to do this. To write it, I applied a simple pattern-and-action pairing:

  1. When awk finds <body> or </body>, increment a counter variable called body
  2. This means the body variable will be zero for any content before <body>
  3. .. and body will have the value 1 for content between <body> and </body>
  4. .. and body will have the value 2 for any content after </body>

To print the content between <body> and </body>, the awk script needs to print lines when the variable body is equal to 1:

$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '/<\/?body>/ {body++} body==1 {print}'
<body>
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>

For what I needed to do, that was good enough. I didn’t mind that the script did not print the trailing </body> tag. Unfortunately, there’s no simple way to modify this version of the script to either print both tags, or print no tags. If we swap the body++ and print statements, we won’t print the <body> tag but will print </body> at the end:

$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk 'body==1 {print} /<\/?body>/ {body++}'
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
</body>

If you don’t want to include the <body> and </body> tags, you’ll need to modify this awk script to detect the opening and closing tags separately. Let’s modify this script to set body to 1 when it finds the <body> line, and back to zero when it finds the </body> line. We can strategically place the print statement to not print the tags:

$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '/<\/body>/ {body=0} body==1 {print} /<body>/ {body=1}'
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>

Awk is a versatile tool

This is just one way to use awk to extract text from a document. I find that awk is a versatile tool that I can apply to most problems that involve text. If I need to print or manipulate text based on a pattern, awk is usually the right tool for the job.

Leave a Reply