
Extracting text with awk
The awk script interpreter is a very handy tool for systems administrators, and anyone else who uses Linux at the command line. With awk, you can solve a tricky problem with a 1-line script.
Recently, I needed to extract content from an HTML document. The details of the HTML page are not important, only that I wanted to get just the body text from the document. In HTML, the body is defined by the <body>
and </body>
tags, in a larger document that might look like this:
<!DOCTYPE html>
<html lang="en">
<head>
<title>...</title>
...
</head>
<body>
...
</body>
</html>
A sample HTML file
Let’s say I have a very simple HTML document that contains only a few lines of text in the body. I can simulate this with the pandoc command, which generates over 170 lines of text (mostly stylesheet information):
$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | wc -l
177
But the <body>
section is at the end of that long file. We can preview it by sending the output through the tail command:
$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | tail
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
</style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
</body>
</html>
Extracting the text
In my case, I wanted to print only the lines between the <body>
and </body>
tags. I decided to write a short awk script to do this. To write it, I applied a simple pattern-and-action pairing:
- When awk finds
<body>
or</body>
, increment a counter variable calledbody
- This means the
body
variable will be zero for any content before<body>
- .. and
body
will have the value 1 for content between<body>
and</body>
- .. and
body
will have the value 2 for any content after</body>
To print the content between <body>
and </body>
, the awk script needs to print lines when the variable body
is equal to 1:
$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '/<\/?body>/ {body++} body==1 {print}'
<body>
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
For what I needed to do, that was good enough. I didn’t mind that the script did not print the trailing </body>
tag. Unfortunately, there’s no simple way to modify this version of the script to either print both tags, or print no tags. If we swap the body++
and print
statements, we won’t print the <body>
tag but will print </body>
at the end:
$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk 'body==1 {print} /<\/?body>/ {body++}'
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
</body>
If you don’t want to include the <body>
and </body>
tags, you’ll need to modify this awk script to detect the opening and closing tags separately. Let’s modify this script to set body
to 1 when it finds the <body>
line, and back to zero when it finds the </body>
line. We can strategically place the print
statement to not print the tags:
$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '/<\/body>/ {body=0} body==1 {print} /<body>/ {body=1}'
<header id="title-block-header">
<h1 class="title">Hello there</h1>
</header>
<p>Hello there!</p>
Awk is a versatile tool
This is just one way to use awk to extract text from a document. I find that awk is a versatile tool that I can apply to most problems that involve text. If I need to print or manipulate text based on a pattern, awk is usually the right tool for the job.