Building a random text generator

There are many reasons you need to create placeholder text. For example, if you are building a new website, you may not have all of the content ready as you’re creating the design; placeholder text helps you see what the design will look like after you’ve added the content.

For years, my “go-to” to generate sample content for documents has been the lipsum.com website, to insert Latin-like meaningless text. Most people are able to ignore the placeholder content if they immediately recognize that it’s just meaningless words, and “Lorem Ipsum” can do that very well. If I want placeholder text in English, I sometimes use other placeholder generators to do the same job, by inserting random content from Star Wars, Doctor Who, and Star Trek.

But there’s another way to create placeholder text without copying from a website: you can make your own text generator. I wrote my own Bash script on Linux to generate a few paragraphs of random text. Here’s how it works.

A list of words

Every Linux system includes a default dictionary of correctly-spelled words, usually saved in /usr/share/dict/words. These words are in sorted order, and contain both uppercase and lowercase words. If you use the head command to print the first ten lines of the words file, you will see “words” that start with numbers:

$ head /usr/share/dict/words
1080
10-point
10th
11-point
12-point
16-point
18-point
1st
2
20-point

The grep command is an old Unix command that finds text in a file. You can just give grep some plain text to find, or you can make your search more specific by using special markers that indicate the start of a line (^) or the end of a line ($). For example, you can use two grep commands to search for all lines that start with a lowercase letter a, and end with the letter e, and use head to display only the first ten examples:

$ grep '^a' /usr/share/dict/words | grep 'e$' | head
abacate
abacinate
abaisance
abaisse
abalienate
abalone
abampere
abandonable
abandonee
abase

You can do more with grep than just find plain words. Those special markers are called regular expressions and there’s a lot you can do. For example, you can specify repeating examples of text by using + for one or more or * to mean zero or more of the previous character. If you want to specify certain classes of characters, you can use special brackets like [[:upper:]] to mean the uppercase letters A to Z, or [[:lower:]] for the lowercase letters. This flexibility makes it possible to search for all kinds of text in a file. For example, to print all lines that start with an uppercase letter followed by one or more lowercase letters, you would use this regular expression:

$ grep '^[[:upper:]][[:lower:]]\+$' /usr/share/dict/words

However, grep can find some very long words, if they are in the words file. On my system, the longest words that start with an uppercase letter followed by one or more lowercase letters are Prorhipidoglossomorpha, Pseudolamellibranchia, and Pseudolamellibranchiata. Those are too long if I want to generate some random placeholder text for a website. I think good placeholder text is a reasonable length, maybe 2 to 8 letters long for lowercase words, or 4 to 8 letters for uppercase words.

To limit the length of the words, I can send the output of the grep command to another classic Unix command called awk, implemented as gawk (GNU awk) on most Linux systems. The awk command takes pairs of patterns and actions; for each matching pattern, it executes the action. In my case, to print just the words that start with an uppercase letter followed by one or more lowercase letters, and are more than 2 letters and less than 8 letters long, I would use this command:

$ grep '^[[:upper:]][[:lower:]]\+$' /usr/share/dict/words | gawk 'length($0)>4 && length($0)<8 {print}'

That’s a long line, but it’s just a grep command to find lines of text, and sending that to the gawk command.

But a gawk pattern can also be a regular expression, using basically the same syntax as the grep command. That allows us to rewrite the command to search the /usr/share/dict/words file for all words that start with an uppercase letter followed by one or more lowercase letters, more than 2 letters and less than 8 letters, as a single gawk command:

$ gawk '/^[[:upper:]][[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}' /usr/share/dict/words > upper.tmp

This moves the length test inside the action, using if to determine if the word’s length is greater than 2 and less than 8. Other than using a redirector (>) to save the output to a temporary file called upper.tmp, the command is essentially the same, but doing it all inside gawk instead of using grep then gawk.

Using loops

The script generates 5 paragraphs of text, each consisting of a random number of sentences, each with a random number of words. I do this with several for loops, to iterate over a set of values. For example, to print out the text “Hello” 4 times, I would write this for loop:

$ for word in 1 2 3 4; do echo "Hello"; done

If you type this at the Bash command line, or save it to a “script” file and run it, you should see “Hello” printed back to you 4 times:

$ for word in 1 2 3 4; do echo "Hello"; done
Hello
Hello
Hello
Hello

At every “pass” through the loop, the variable word is assigned the value 1, 2, 3, or 4. You can print out the value of the word variable by writing it with a “dollar sign” in front, like this to print the numbers 1, 2, 3, and 4:

$ for word in 1 2 3 4; do echo $word; done
1
2
3
4

You can also put one for loop “inside” another; this is called nested loops. It’s easiest to show nested loops by writing it in a script, where I can split up the lines to make the instructions more clear. For example, this prints the values A1, A2, B1, and B2 to the screen using nested loops:

for letter in A B ; do
  for number in 1 2 ; do
    echo $letter$number
  done
done

I’ve also added some extra spacing so you can see the nested loops in action, and to make clear what is “inside” each loop. When I write for loops like this, I usually write the ; with spaces on either side. This is just a personal style, you don’t need to use the extra space.

If you save this to a script and run it, you should see the values A1, A2, B1, and B2 printed to the screen. That’s because the “outer” loop iterates through the letters A and B; for each “letter” loop, the “inner” loop iterates through the numbers 1 and 2. The effect is the loop generates the four values in order:

A1
A2
B1
B2

Printing random lines

To generate random words, either all lowercase words or words that start with an initial uppercase letter, we need to print random lines from a word file. We can use gawk to find the words we need; the next step is to pick random words from the temporary file.

Linux provides a command called shuf that can shuffle a text file and generate a file with the lines in a random order. For example, let’s print the numbers 1, 2, 3, and 4 in a random order with the shuf command:

$ seq 4 | shuf
3
4
1
2

The seq command always prints 1, 2, 3, and 4 in that order, but adding the shuf command randomizes the order. Similarly, if you have a longer list, but only want to see the first few lines from the shuffled list, send the output to the head command. This prints only the first ten lines by default; use a hyphen with a number to print that many lines, such as this to shuffle a list of ten numbers but print only 4 lines of output:

$ seq 10 | shuf | head -4
5
9
10
2

Putting it all together

With these Bash scripting commands, plus a few extra Bash features that I’ll show you, you can generate a few paragraphs of random text. Each paragraph contains a random number of sentences, between 5 and 8 sentences. Each sentence has a random number of words, between 6 and 9 words.

#!/bin/bash

words=/usr/share/dict/words

lower=/tmp/lower.tmp
upper=/tmp/upper.tmp

gawk '/^[[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}}' $words > $lower
gawk '/^[[:upper:]][[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}}' $words > $upper

for para in $(seq 5) ; do
  s=$((RANDOM % 5 + 3))

  for sent in $(seq $s) ; do
    w=$((RANDOM % 6 + 3))
    ( shuf -n 1 $upper ; shuf -n $w $lower ) | tr '\n' ' ' | sed 's/ $/. /'
  done
  echo -e '\n'
done

rm -f $lower $upper

On my system, I saved this script to a file called mkwords.bash. Let’s look at this in more detail to understand how it works:

The first few lines save some values to a few variables; a variable is just a way to access a value later on. In this case, I’ve saved the path to the word list in a words variable, the path to a list of lowercase words in the lower variable, and a list of uppercase words in the upper variable. I can use these at any time in the Bash script with a “dollar sign” like $words to get the full path to the word list, at /usr/share/dict/words.

After that, the script runs the two gawk commands to generate the list of all-lowercase words and the list of words that start with an uppercase letter.

Then, the script uses a nested for loop to print 5 paragraphs. This also sets a variable called s that is a random number between 3 and 7. That’s because the $(( )) brackets create an arithmetic expansion, so Bash can do simple arithmetic. You probably know the basic arithmetic operators like add (+), subtract (-), multiply (*) and divide (/). You can also use % to mean modulo, or the remainder after division. For example, 9 % 4 is 1, because 9 divided by 4 is 2 with 1 left over. The arithmetic expansion to assign a value to s uses RANDOM to mean a random number, and taking the modulo of 5 will give a value in the range 0, 1, 2, 3, or 4. That means s can be in the range 3 (0 + 3) to 7 (4 + 3).

The next loop generates that many random sentences, from 1 to s, using a similar trick to pick a random number of words (w) between 3 and 8.

The last line inside the “inner” loop uses two shuf commands to print 1 random word from the uppercase words, then the random number of words from the list of lowercase words. The random words are printed one per line, so I’ve added the tr command to translate the newline (\n) to a space. The sed command makes line-by-line edits to add a period to the end of the line. These commands generate a series of “sentences” that begin with an uppercase word followed by a random number of lowercase words, plus a period.

After each sentence, the script uses an echo command to print an extra newline. Actually, the echo command itself generates a newline, so this command effectively prints 2 newlines.

The last line in the script cleans up my temporary files by deleting them.

A few samples

Whenever I need to generate some placeholder text for a project, I can just run this Bash script to print out a few paragraphs. Every time I run the script, it prints 5 paragraphs of a few sentences, each with a reasonable number of words. This is somewhat representative of text that I might include in a document.

The script prints each paragraph on a single line. To make it more readable, I’ll send the output through the fmt program to “wrap” the lines:

$ bash mkwords.bash | fmt
Attalie adicity sebate arecain. Jedthus azaleas ottos calor omniana. Alber
shammes talpa resoaks micmac ducs anchors vil frosty. Olalla gesling
nooses trashy downby gnosis pituri sambuca magmata. Borda durably salada
dubbin sanable femoral cubane. Gorizia pirojki viper mattins jitters
rongeur theos laciest cretic.

Vinie driven outgaze sleepry. Lepper lansing ogams trams cruiser italite
outstay. Niort slavers noecho tugriks swaddle. Fassold plagal vlei unioid
mellows bunty weals. Loy prahus stare rowable inlayed. Vally pigmy joeyes
zincify balada clethra pks tineine.

Inola perit peggy filled. Yarura oceanic taunt scrath rapids crusta
wyches. Aleus gleety bumphs staw caaba ratio cliffs. Stigler ortman
decay faucals.

Nyoro atoxic asses melvie. Blau insteep chaw couac. Boff clite sodless
arzan.

Gerhan feudary espinal shoad libra brunion debts rosing. Alvito fister
quested buxom pennant impower tabstop stylize outrick. Dupuis caffle
emerick neems hagbut equinox.

Every time I run the mkwords.bash script, it generates new random words, sentences, and paragraphs.

This script works well for me, but you can still improve it. For example, every time the script runs, it generates the same list of words from the /usr/share/dict/words file. Since the system’s word list doesn’t change very often, you can make this script run faster if you save the list of temporary words somewhere in your home directory, and only regenerate the lists if they are not there.

Also, the /usr/share/dict/words file contains some words that are not work-friendly. So instead of using the system’s word list, you might make your own list of words to use. One way to create a list like this is to use the words from other documents you have already written, and use that word list as the starting point.

But if you just want to generate a few paragraphs of random-length sentences with random words, this script will do the job. And you can do it on your own with a Bash script.

This article is based on Generating your own random text by Jim Hall, and is republished with the author’s permission.

InLinux, Programming

Rethinking su vs sudo

ATO book signing schedule announced

Print ‘Hello world’ in color with conio

Email can’t access the INBOX

Choose your new computer’s operating system

Planning for End of Life

Unzipping archives from the command line

Write directly to the screen with DOS conio

Best Linux Distros for Windows Users: From Mint to Pop!_OS