Regular Expressions #2: An example

0

“BXP135671” by tableatny is licensed under CC BY 2.0

Dive right into a regular expression example in this second of four articles about regular expressions.

In the previous article, Regular Expressions #1: Introduction, I covered what they are and why they’re useful. Now, we need a real-world example to use as a learning tool. Here is one I encountered several years ago.

This example highlights the power and flexibility of the Linux command line, especially regular expressions, for their ability to automate common tasks. I have administered several listservs during my career and still do. People send me email addresses to add to those lists. In more than one case, I have received a list of names and email addresses in a Microsoft Word format to be added to one of the lists.

The troublesome list

The list itself was not very long, but it was inconsistent in its formatting. An abbreviated version of that list, with name and domain changes, is shown in Figure 1.

Team 1	Apr 3 
Leader  Virginia Jones  vjones88@example.com	
Frank Brown  FBrown398@example.com	
Cindy Williams  cinwill@example.com	
Marge smith   msmith21@example.com 
 [Fred Mack]   edd@example.com	

Team 2	March 14
leader  Alice Wonder  Wonder1@example.com	
John broth  bros34@example.com	
Ray Clarkson  Ray.Clarks@example.com	
Kim West    kimwest@example.com	
[JoAnne Blank]  jblank@example.com	

Team 3	Apr 1 
Leader  Steve Jones  sjones23876@example.com	
Bullwinkle Moose bmoose@example.com	
Rocket Squirrel RJSquirrel@example.com	
Julie Lisbon  julielisbon234@example.com	
[Mary Lastware) mary@example.com

Figure 1: A sample taken from the problematic list.

It was obvious that I needed to manipulate the data in order to mangle it into an acceptable format for inputting to the list. It is possible to use a text editor or a word processor such as LibreOffice Writer to make the necessary changes to this small file. However, people send me files like this quite often, so it becomes a chore to use a word processor to make these changes. Despite the fact that Writer has a good search and replace function, each character or string must be replaced singly, and there is no way to save previous searches.

LibreOffice Write does have a powerful macro feature, but I am not familiar with either of its two languages: LibreOffice Basic or Python. I do know Bash shell programming. I did what comes naturally to a sysadmin—I automated the task. The first thing I did was to copy the address data to a text file so I could work on it using command-line tools. After a few minutes of work, I developed the Bash command-line program shown in the first article of this series and shown again in Figure 2.

$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/\]//g" -e "s/)//g" | awk '{print $1" "$2" <"$3">"}' > addresses.txt

Figure 2: The solution to my list problem involves some interesting regular expressions.

This code produced the desired output as the file addresses.txt. I used my normal approach to writing command-line programs like this by building up the pipeline one command at a time.

Let’s break this pipeline down into its component parts to see how it works and all fits together. All of the experiments in this series should be performed as a non-privileged user.

Getting started with the sample file

First, we need to create the sample file. Create a directory named testing on your local machine, and then copy the text from Figure 1 into into a new text file named Experiment_6-1.txt, which contains the three team entries shown above.

Removing unnecessary lines with grep

The first things I see that can be done are a couple of easy ones. Since the team names and dates are on lines by themselves, we can use the following to remove those lines that have the word “Team:”

[student@studentvm1 testing]$  cat Experiment_6-1.txt | grep -v Team

I won’t reproduce the results of each stage of building this Bash program, but you should be able to see the changes in the data stream as it shows up on STDOUT, the terminal session. We won’t save it in a file until the end.

In this first step in transforming the data stream into one that is usable, we use the grep command with a simple literal pattern, Team. Literals are the most basic type of pattern we can use as a regular expression, because there is only a single possible match in the data stream being searched, and that is the string Team.

We need to discard empty lines, so we can use another grep statement to eliminate them. I find that enclosing the regular expression for the second grep command in quotes ensures that it gets interpreted properly:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$"
Leader  Virginia Jones  vjones88@example.com
Frank Brown  FBrown398@example.com
Cindy Williams  cinwill@example.com
Marge smith   msmith21@example.com 
 [Fred Mack]   edd@example.com  
leader  Alice Wonder  Wonder1@example.com
John broth  bros34@example.com  
Ray Clarkson  Ray.Clarks@example.com
Kim West    kimwest@example.com 
[JoAnne Blank]  jblank@example.com
Leader  Steve Jones  sjones23876@example.com
Bullwinkle Moose bmoose@example.com
Rocket Squirrel RJSquirrel@example.com  
Julie Lisbon  julielisbon234@example.com
[Mary Lastware) mary@example.com
[student@studentvm1 testing]$

The expression "^\s*$" illustrates the use of anchors, and using the backslash (\) as an escape character to change the meaning of a literal “s” (in this case) to a metacharacter that means any whitespace such as spaces, tabs, or other characters that are unprintable. We cannot see these characters in the file, but it does contain some of them.

The asterisk, aka splat (*), specifies that we are to match zero or more of the whitespace characters. This addition would match multiple tabs, multiple spaces, or any combination of those in an otherwise empty line.

Viewing extra whitespace with Vim

Next, I configured my Vim editor to display whitespace using visible characters. Do this by adding the line in Figure 3 to your own ~.vimrc file, or to the global /etc/vimrc configuration file.

set listchars=eol:$,nbsp:_,tab:<->,trail:~,extends:>,space:+

Figure 3: Add this line to your own ~.vimrc file, or to the global /etc/vimrc configuration file.

Then, start—or restart—Vim.

I have found a lot of bad, incomplete, and contradictory information on the internet in my searches for how to do this. The built-in Vim help has the best information, and the data line I created from that above is one that works for me.

Note: In the example below, regular spaces are shown as +; tabs are shown as <, <>, or <–>, and fill the length of the space that the tab covers. The end of line (EOL) character is shown as $.

The result, before any operation on the file, is shown in Figure 4.

Team+1<>Apr+3~$
Leader++Virginia+Jones++vjones88@example.com<-->$
Frank+Brown++FBrown398@example.com<---->$
Cindy+Williams++cinwill@example.com<--->$
Marge+smith+++msmith21@example.com~$
+[Fred+Mack]+++edd@example.com<>$
$
Team+2<>March+14$
leader++Alice+Wonder++Wonder1@example.com<----->$
John+broth++bros34@example.com<>$
Ray+Clarkson++Ray.Clarks@example.com<-->$
Kim+West++++kimwest@example.com>$
[JoAnne+Blank]++jblank@example.com<---->$
$
Team+3<>Apr+1~$
Leader++Steve+Jones++sjones23876@example.com<-->$
Bullwinkle+Moose+bmoose@example.com<--->$
Rocket+Squirrel+RJSquirrel@example.com<>$
Julie+Lisbon++julielisbon234@example.com<------>$
[Mary+Lastware)+mary@example.com$

Figure 4: Viewing “whitespace” in Vim.

Removing unnecessary characters with sed

You can see that there are a lot of whitespace characters that need to be removed from our file. We also need to get rid of the word “leader,” which appears twice and is capitalized once. Let’s get rid of “leader” first. This time, we will use sed (stream editor) to perform this task by substituting a new string—or a null string in our case—for the pattern it matches.

Adding sed -e "s/[Ll]eader//" to the pipeline does just what we want.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//"

In this sed command, -e means that the quote-enclosed expression is a script that produces a desired result. In the expression, the s means that this is a substitution. The basic form of a substitution is s/<regex>/<replacement string>/, so /[Ll]eader/ is our search string. The set [Ll] matches L or l, so [Ll]eader matches leader or Leader. In this case, the replacement string is null because it looks like a double forward slash with no characters or whitespace between the two slashes (//).

Let’s also get rid of some of the extraneous characters like []() that will not be needed.

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" 

We have added four new expressions to the sed statement. Each one removes a single character. The first of these additional expressions is a bit different, because the left square brace ([) character can mark the beginning of a set. We need to escape the brace to ensure that sed interprets it correctly as a regular character and not a special one.

Tidying up with awk

We could use sed to remove the leading spaces from some of the lines, but the awk command can do that, reorder the fields if necessary, and add the <> characters around the email address:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

The awk utility is actually a powerful programming language that can accept data streams on its STDIN. This fact makes it extremely useful in command-line programs and scripts. The awk utility works on data fields, and the default field separator is spaces—any amount of white space. The data stream we have created so far has three fields separated by whitespace (<first>, <last>, and <email>).

awk '{print $1" "$2" <"$3">"}' 

This little program takes each of the three fields ($1, $2, and $3) and extracts them without leading or trailing whitespace. It then prints them in sequence, adding a single space between each as well as the <> characters needed to enclose the email address.

Wrapping up

The last step here would be to redirect the output data stream to a file, but that is trivial, so I leave it with you to perform that step. It is not really necessary that you do so for this experiment.

I saved the Bash program in an executable file, and now I can run this program anytime I receive a new list. Some of those lists are fairly short, as is the one in this example. Others have been quite long, sometimes containing up to several hundred addresses and many lines of “stuff” that do not contain addresses to be added to the list.


Note: This series is a slightly modified version from Chapter 25 of Volume 2 of my Linux self-study trilogy, Using and Administering Linux: Zero to SysAdmin, 2nd Edition.