3 ways to read files in C

0

When you’re just starting out with learning a new programming language, it’s good to stick to the basics until you have a more solid understanding of how the language works. With that foundation, you can move up to higher levels and more sophisticated algorithms to create more interesting programs.

That’s why when I write articles about learning how to write programs, I tend to stick to the basics. In these “entry level” articles, I don’t want to lose my audience, so I stick to simple programming methods that are easy to understand – even if they aren’t the most efficient way to do it. For example, to demonstrate how to write your own version of the cat program, I might use stream functions like fgetc to read a single character from the input and fputc to print a single character on the output.

While reading and writing one character at a time isn’t a very fast way to print the contents of a text file, it’s simple enough that most new programmers can see what’s going on. Let’s look at three different ways that you could write a cat program, at three different levels: easy but slow, simple and fast, and most efficient.

Starting a ‘cat’ program

The cat command reads multiple files and concatenates them to the output, such as printing the contents to the user’s terminal. To implement the basics, we need a program called cat.c that processes all the files on the command line, opens them, and prints their contents. Additionally, if the user didn’t list any files, we can read from standard input and copy that to standard output.

#include <stdio.h>

void cpytext(FILE * in, FILE * out);

int
main(int argc, char **argv)
{
  FILE *in;

  for (int i = 1; i < argc; i++) {
    in = fopen(argv[i], "r");

    if (in) {
      cpytext(in, stdout);
      fclose(in);
    }
    else {
      fputs("cannot open file: ", stderr);
      fputs(argv[i], stderr);
      fputc('\n', stderr);
    }
  }

  if (argc == 1) {
    /* no input files, read from stdin */
    cpytext(stdin, stdout);
  }

  return 0;
}

This is a very simple program that uses for to iterate over the command line arguments, stored in the argv array. The first item in the array (element 0) is the name of the program itself, so the for loop actually starts at element 1 for the first command line argument. For each file, it opens the file, and uses cpytext to print its contents to standard output.

We can write separate implementations of cpytext to create new versions of the cat program that use different methods to print the contents of text files.

Easy but slow: One character at a time

The stream functions in stdio.h present a simple way to read and write data. We can use the fgetc function to read one character at a time from a file, and fputc to print one character a time to a different file. Writing a cpytext function with these functions is just an exercise of reading with fgetc and writing with fputc until we reach the end of the file:

#include <stdio.h>

void
cpytext(FILE *in, FILE *out)
{
  /* copy one character at a time */

  int ch;

  while ((ch = fgetc(in)) != EOF) {
    fputc(ch, out);
  }
}

This method is easy to explain: The cpytext function takes two file pointers: one for the input and another for the output. cpytext then reads data from the input, one character at a time, and uses fputc to print it to the output. When fgetc encounters the end of the file, it stops.

If we save that file as cpy1.c then we can compile a new cat program called cat1 like this:

$ gcc -o cat1 cat.c cpy1.c

Simple and fast: One line at a time

Reading and writing one character at a time is easy to explain, but the method is slow. Every time fgetc reads a single character from a file, the operating system has to do a little extra work. We can be somewhat more efficient by reading and writing more data at once, such as working with one line at a time.

The getline function from stdio.h will read an entire string into memory at once. This is similar to the fgets function, but with one important difference: where fgets reads data into a variable of a fixed size, getline can resize the array to fit the whole line into memory.

To use getline, you first need to allocate memory to a pointer and set a variable to indicate the size. Or, don’t allocate memory (set the pointer to NULL) and getline will allocate memory on its own.

Using getline requires more memory than fgetc because it’s storing an entire line of text, but otherwise the basic algorithm is the same: Read a line of text from the input, then print that line to the output.

#include <stdio.h>
#include <stdlib.h>

void
cpytext(FILE *in, FILE *out)
{
  char *line = NULL;
  size_t size = 0;
  ssize_t len;

  while ((len = getline(&line, &size, in)) != -1) {
    fputs(line, out);
  }

  free(line);
}

Note that getline is meant to read text data, not copy data between files. But if the use case is to implement a cat program that prints the contents of text files, we should be okay.

If we save that file as cpyline.c then we can compile a new cat program called catline like this:

$ gcc -o catline cat.c cpyline.c

Most efficient: Read a block of data

One problem with using getline to print the contents of a text file is when the program encounters a large file that has exactly one line. Then, the getline function must read the entire file into memory before it can print it. That’s not a great way to use memory.

Instead, we can use the fread function to read a block of data from a file at once, then use fwrite to write the same block to a different file. To do this, we need to use the feof function to tell us when we’ve reached the end of the file. Otherwise, the general algorithm is the same: Read from the input, then write to the output.

#include <stdio.h>

#define BUFSIZE 128

void
cpytext(FILE *in, FILE *out)
{
  char buf[BUFSIZE];

  size_t numread;

  while (!feof(in)) {
    numread = fread(buf, sizeof(char), BUFSIZE, in);

    if (numread > 0) {
      fwrite(buf, sizeof(char), numread, out);
    }
  }
}

The fread function reads data into a buffer, called buf, which has a fixed size of 128. fread will read up to 128 characters from the input, and store them in buf, then return a count of how many characters it actually read. We store that in a variable called numread which we use with fwrite to copy the contents of the buffer to the output.

If we save that file as cpybuf.c then we can compile a new cat program called catbuf like this:

$ gcc -o catbuf cat.c cpybuf.c

How they compare

The basic algorithm remains the same across each implementation of cpytext, although the details change: Read data from one file, and print it to another file. However, each version performs quite differently.

Let’s demonstrate how quickly each method can run by using cat to copy the contents of a large text file. The /usr/share/dict/words file contains a long list of words, which can be used by spell-checking programs. On my Fedora Linux system, this is a 4.8 MB file that contains almost a half million words:

$ wc -l /usr/share/dict/words
479826 /usr/share/dict/words
$ ls -H -sh /usr/share/dict/words
4.8M /usr/share/dict/words

The time command will run a program and then print how much time that program needed to execute, broken down by “real” time (from start to finish), “user” time (CPU time) and “system” time (a different kind of CPU time). To time how long it takes to read the /usr/share/dict/words file with the /bin/cat command, and save the output to a temporary file called w, we can type this:

$ time /bin/cat /usr/share/dict/words > w

To verify that the file didn’t change as we copied it with cat, we can use the cmp program; cmp prints any differences between two files, and otherwise remains silent if they are the same. For example, to compare /usr/share/dict/words with the w file, type this:

$ cmp /usr/share/dict/words w

If cmp doesn’t print anything, we know the two files are the same.

To compare the run times of each implementation, we can write a script to run each version and report the times. I’ve added the /bin/cat program twice, at the start and at the end, because the operating system will “buffer” the contents of a file the first time we read it. We can then ignore the first /bin/cat time, and use the second time.

#!/bin/sh

words=/usr/share/dict/words

echo '/bin/cat..'
time /bin/cat $words > w
cmp $words w

echo 'cat1..'
time ./cat1 $words > w
cmp $words w

echo 'catline..'
time ./catline $words > w
cmp $words w

echo 'catbuf..'
time ./catbuf $words > w
cmp $words w

echo '/bin/cat..'
time /bin/cat $words > w
cmp $words w

If we save this script as runall, we can run it to compare each cat implementation at once:

$ ./runall 
/bin/cat..

real    0m0.007s
user    0m0.002s
sys 0m0.005s
cat1..

real    0m0.073s
user    0m0.047s
sys 0m0.016s
catline..

real    0m0.033s
user    0m0.015s
sys 0m0.008s
catbuf..

real    0m0.018s
user    0m0.004s
sys 0m0.006s
/bin/cat..

real    0m0.002s
user    0m0.000s
sys 0m0.002s

We can see that reading and writing one character at a time with fgetc and fputc (cat1) was the slowest method, requiring 73 milliseconds to copy the 4.8 MB text file. Reading a line at a time using getline (in catline) was noticeably faster, at 33 milliseconds. But reading and writing a block of data at a time using fread and fwrite (catbuf) was faster still, at only 18 milliseconds.

Our catbuf implementation read 128 characters at a time, which is good, but still quite small. The program can run faster with a larger buffer. And the system /bin/cat program uses this method iwth a much larger buffer, and takes virtually no time at all, only 2 milliseconds to read 4.8 MB of text data.

Slowing it down

You might wonder why bother if the difference is so small? My quad-core Intel(R) Core(TM) i3-8100T CPU @ 3.10GHz is certainly very fast, but consider the performance impact on slower systems.

Let’s run the same test on a slower system. I have a virtual machine running FreeBSD, which I use for testing. FreeBSD is actually a fast operating system, but since it’s running in a virtual machine, I can slow it down by running the virtual machine without KVM acceleration.

The /usr/share/dict/words file is smaller on FreeBSD than on Linux, at just over 236,000 words. The file itself is 2.4 MB in size:

$ wc -l /usr/share/dict/words 
  236007 /usr/share/dict/words
$ ls -s -H /usr/share/dict/words 
2496 /usr/share/dict/words

To make a more direct comparison between my fast Linux running on real hardware and my FreeBSD instance running on an artificially slow virtual machine, I’ll double the size of the text file by copying its contents twice to a new file called words in my working directory. The new file approaches half a million words, and is about 4.8 MB in size; both measurements are about the same as on Linux:

$ cat /usr/share/dict/words /usr/share/dict/words > words
$ wc -l words 
  472014 words
$ ls -lh words 
-rw-r--r--  1 jhall jhall  4.8M May 15 14:37 words

I’ve compiled the same source files on FreeBSD, and this is my output when running the virtual machine without using KVM:

$ ./runall 
/bin/cat..
        0.04 real         0.00 user         0.04 sys
cat1..
        2.85 real         2.71 user         0.11 sys
catline..
        0.67 real         0.59 user         0.07 sys
catbuf..
        0.15 real         0.08 user         0.05 sys
/bin/cat..
        0.03 real         0.00 user         0.03 sys

Running FreeBSD without KVM simulates a much slower system, where we can see a more dramatic difference between these programs. Reading and writing one character at a time (cat1) is quite slow, requiring 2.8 seconds to copy the 4.8 MB text file. But reading one line at a time with getline (as catline) is much better, at about 67 milliseconds of real time. Reading and writing 128 characters at a time (catbuf) is faster still, at only 15 milliseconds to copy the 4.8 MB text file. The system /bin/cat program uses the same method but with a larger buffer, so only needs 3 milliseconds to print the text file.

Teaching using simple methods

When I write “introductory” articles about how to get started in programming, I try to write my sample programs in a way that everyone can see what’s going on. But learning how to program for the first time is challenging enough without adding algorithms. I approach “writing your first program” as learn the basics first then you can move on to more advanced methods. So I might use fgetc and fputc to demonstrate how to write your own version of cat on Linux or TYPE on FreeDOS, even though there’s a better, faster way to do it.

Leave a Reply