Regular Expressions #1: Introduction

0

Regular expressions don’t have to invoke anxiety and fear, although they do for many of us. The function of regular expressions is to provide a highly flexible tool for matching strings of characters in a stream of data. When a match is found, the program’s action can be as simple as to pass the line of data in which it’s found on to STDOUT, or as complex as replacing that string with another before sending it to STDOUT.

This article, part one of four, introduces you to the need for regular expressions and shows you how to create a simple REGEX with the grep command.

Why we need Regular Expressions

We have all used file globbing with wildcard characters like * and ? as a means to select specific files or lines of data from a data stream. These tools are powerful and I use them many times a day. Yet, there are things that cannot be done with wildcards.

Regular expressions (regexes or REs) provide us with more complex and flexible pattern matching capabilities. Just as certain characters take on special meaning when using file globbing, REs also have special characters. There are two main types of regular expressions (REs), Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs).

The first thing we need are some definitions. There are many definitions for the term regular expressions, but many are dry and uninformative. Here are mine.

Regular Expressions are strings of literal and metacharacters that can be used as patterns by various Linux utilities to match strings of ASCII plain text data in a data stream. When a match occurs, it can be used to extract or eliminate a line of data from the stream, or to modify the matched string in some way.

Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs) are not significantly different in terms of functionality. (See the grep info page’s Section 3.6, “Basic vs. Extended Regular Expressions.”) The primary difference is in the syntax used and how metacharacters are specified. In basic regular expressions, the metacharacters ?, +, {, |, (, and ) lose their special meaning. Instead, it is necessary to use the backslashed versions: \?, \+, \{, \|, \(, and \). The ERE syntax is believed by many to be easier to use.

When I talk about regular expressions, in a general sense I usually mean to include both basic and extended regular expressions. If there is a differentiation to be made I will use the acronyms BRE for basic regular expressions or ERE for extended regular expressions.

Regular expressions (REs) take the concept of using metacharacters to match patterns in data streams much further than file globbing, and give us even more control over the items we select from a data stream. REs are used by various tools to parse1 a data stream to match patterns of characters in order to perform some transformation on the data.

Regular expressions have a reputation for being obscure and arcane incantations that only those with special wizardly sysadmin powers use. This single line of code in Figure 1 (that I used to transform a file that was sent to me into a usable form) would seem to confirm that.

$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/\]//g" -e "s/)//g" | awk '{print $1" "$2" <"$3">"}' > addresses.txt

Figure 1: A rather complex Regular Expression like this one can seem obscure until we learn how REGEXs work.

This command pipeline appears to be an intractable sequence of meaningless gibberish to anyone without the knowledge of regex. It certainly seemed that way to me the first time I encountered something similar early in my career. As you will see, regexes are relatively simple once they are explained.

A Simple REGEX

If you use Unix or Linux on the command line, you use regular expressions whether you know it or not. One of the first tools most of us learn is the grep command.

The grep command is used to select lines that match a specified pattern from a stream of data. grep is one of the most commonly used filter utilities and can be used in some very creative and interesting ways. The grep command is one of the few that can correctly be called a filter because it does filter out all the lines of the data stream that you do not want; it leaves only the lines that you do want in the remaining data stream.

According to Seth Kenlon, reviewer for many of my books and articles, “One of the classic Unix commands, developed way back in 1974 by Ken Thompson, is the Global Regular Expression Print (grep) command. It’s so ubiquitous in computing that it’s frequently used as a verb (“grepping through a file”) and, depending on how geeky your audience, it fits nicely into real-world scenarios, too. (For example, “I’ll have to grep my memory banks to recall that information.”) In short, grep is a way to search through a file for a specific pattern of characters. If that sounds like the modern Find function available in any word processor or text editor, then you’ve already experienced grep’s effects on the computing industry.”2

We need to create a file with some random data in it. We can use a tool that generates random passwords but we first need to install it as root. I use dnf on my Fedora host.

# dnf -y install pwgen

Now as a non-root user lets generate some random data and create a file with it. I suggest doing this in the /tmp directory. You could use your home directory if you have enough space. The following command creates a stream of 5000 lines of random data that are each 75 characters long and stores them in the random.txt file.

$ pwgen 75 5000 > random.txt

Considering that there are so many passwords, it is very likely that some character strings in them are the same. Use the grep command to locate some short, randomly selected strings from the last ten passwords on the screen. I saw the words “see” and “loop” in one of those ten passwords, so my command looked like this.

$ grep see random.txt

You can try that, but you should also pick some strings of your own to search for. Short strings of 2 to 4 characters work best. I also used grep to locate all of the lines in the output from dmesg with CPU in them. You need to be root to run the dmesg command.

# dmesg | grep cpu

Do a long listing of all of the directories in your home directory with this command.

$ ls -la | grep ^d

This works because each directory has a “d” as the first character in a long listing. The carat ( ^ ) is used by grep and other tools to anchoe the text being searched to the beginning of the line.

To list all of the files that are not directories, reverse the meaning of the previous grep command with the -v option.

$ ls -la | grep -v ^d

Final Thoughts

We can only begin to touch upon all of the possibilities opened to us by regular expressions in a single article (even in a single series). There are entire books devoted exclusively to regular expressions, so we will explore the basics in this series of articles here on Both.org. By the end, you will know just enough to get started with tasks common to sysadmins. Hopefully, you’ll be hungry to learn more on your own after that.


Note: This series is a slightly modified version from Chapter 25 of Volume 2 of my Linux self-study trilogy, Using and Administering Linux: Zero to SysAdmin, 2nd Edition.

  1. One general meaning of parse is to examine something by studying its component parts. For our purposes, we parse a data stream to locate sequences of characters that match a specified pattern. ↩︎
  2. Kenlon, Seth, a.k.a. Klaatu, Opensource.com, Practice using the Linux grep command, 18 Mar 2021 ↩︎