{"id":5327,"date":"2024-05-17T02:00:00","date_gmt":"2024-05-17T06:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=5327"},"modified":"2024-05-15T10:59:15","modified_gmt":"2024-05-15T14:59:15","slug":"3-ways-to-read-files-in-c","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=5327","title":{"rendered":"3 ways to read files in C"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"5327\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>When you\u2019re just starting out with learning a new programming language, it\u2019s good to stick to the basics until you have a more solid understanding of how the language works. With that foundation, you can move up to higher levels and more sophisticated algorithms to create more interesting programs.<\/p>\n\n\n\n<p>That\u2019s why when I write articles about <em>learning how to write programs<\/em>, I tend to stick to the basics. In these \u201centry level\u201d articles, I don\u2019t want to lose my audience, so I stick to simple programming methods that are easy to understand &#8211; even if they aren\u2019t the most efficient way to do it. For example, to demonstrate how to write your own version of the <code>cat<\/code> program, I might use <em>stream<\/em> functions like <code>fgetc<\/code> to read a single character from the input and <code>fputc<\/code> to print a single character on the output.<\/p>\n\n\n\n<p>While <em>reading and writing one character at a time<\/em> isn\u2019t a very fast way to print the contents of a text file, it\u2019s simple enough that most new programmers can see what\u2019s going on. Let\u2019s look at three different ways that you could write a <code>cat<\/code> program, at three different levels: easy but slow, simple and fast, and most efficient.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"starting-a-cat-program\">Starting a \u2018cat\u2019 program<\/h2>\n\n\n\n<p>The <code>cat<\/code> command reads multiple files and <em>concatenates<\/em> them to the output, such as printing the contents to the user\u2019s terminal. To implement the basics, we need a program called <code>cat.c<\/code> that processes all the files on the command line, opens them, and prints their contents. Additionally, if the user didn\u2019t list any files, we can read from <em>standard input<\/em> and copy that to <em>standard output<\/em>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\nvoid cpytext(FILE * in, FILE * out);\n\nint\nmain(int argc, char **argv)\n{\n  FILE *in;\n\n  for (int i = 1; i &lt; argc; i++) {\n    in = fopen(argv&#91;i], \"r\");\n\n    if (in) {\n      cpytext(in, stdout);\n      fclose(in);\n    }\n    else {\n      fputs(\"cannot open file: \", stderr);\n      fputs(argv&#91;i], stderr);\n      fputc('\\n', stderr);\n    }\n  }\n\n  if (argc == 1) {\n    \/* no input files, read from stdin *\/\n    cpytext(stdin, stdout);\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>This is a very simple program that uses <code>for<\/code> to iterate over the command line arguments, stored in the <code>argv<\/code> array. The first item in the array (element 0) is the name of the program itself, so the <code>for<\/code> loop actually starts at element 1 for the first command line argument. For each file, it opens the file, and uses <code>cpytext<\/code> to print its contents to standard output.<\/p>\n\n\n\n<p>We can write separate implementations of <code>cpytext<\/code> to create new versions of the <code>cat<\/code> program that use different methods to print the contents of text files.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"easy-but-slow-one-character-at-a-time\">Easy but slow: One character at a time<\/h2>\n\n\n\n<p>The stream functions in <code>stdio.h<\/code> present a simple way to read and write data. We can use the <code>fgetc<\/code> function to read one character at a time from a file, and <code>fputc<\/code> to print one character a time to a different file. Writing a <code>cpytext<\/code> function with these functions is just an exercise of reading with <code>fgetc<\/code> and writing with <code>fputc<\/code> until we reach the end of the file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\nvoid\ncpytext(FILE *in, FILE *out)\n{\n  \/* copy one character at a time *\/\n\n  int ch;\n\n  while ((ch = fgetc(in)) != EOF) {\n    fputc(ch, out);\n  }\n}<\/code><\/pre>\n\n\n\n<p>This method is easy to explain: The <code>cpytext<\/code> function takes two file pointers: one for the input and another for the output. <code>cpytext<\/code> then reads data from the input, one character at a time, and uses <code>fputc<\/code> to print it to the output. When <code>fgetc<\/code> encounters the end of the file, it stops.<\/p>\n\n\n\n<p>If we save that file as <code>cpy1.c<\/code> then we can compile a new <code>cat<\/code> program called <code>cat1<\/code> like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -o cat1 cat.c cpy1.c<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"simple-and-fast-one-line-at-a-time\">Simple and fast: One line at a time<\/h2>\n\n\n\n<p>Reading and writing one character at a time is easy to explain, but the method is slow. Every time <code>fgetc<\/code> reads a single character from a file, the operating system has to do a little extra work. We can be somewhat more efficient by reading and writing more data at once, such as working with one <em>line<\/em> at a time.<\/p>\n\n\n\n<p>The <code>getline<\/code> function from <code>stdio.h<\/code> will read an entire string into memory at once. This is similar to the <code>fgets<\/code> function, but with one important difference: where <code>fgets<\/code> reads data into a variable of a fixed size, <code>getline<\/code> can resize the array to fit the whole line into memory.<\/p>\n\n\n\n<p>To use <code>getline<\/code>, you first need to allocate memory to a <em>pointer<\/em> and set a variable to indicate the size. Or, don\u2019t allocate memory (set the pointer to <code>NULL<\/code>) and <code>getline<\/code> will allocate memory on its own.<\/p>\n\n\n\n<p>Using <code>getline<\/code> requires more memory than <code>fgetc<\/code> because it\u2019s storing an entire line of text, but otherwise the basic algorithm is the same: Read a line of text from the input, then print that line to the output.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt;\n\nvoid\ncpytext(FILE *in, FILE *out)\n{\n  char *line = NULL;\n  size_t size = 0;\n  ssize_t len;\n\n  while ((len = getline(&amp;line, &amp;size, in)) != -1) {\n    fputs(line, out);\n  }\n\n  free(line);\n}<\/code><\/pre>\n\n\n\n<p>Note that <code>getline<\/code> is meant to read text data, not copy data between files. But if the use case is to implement a <code>cat<\/code> program that prints the contents of text files, we should be okay.<\/p>\n\n\n\n<p>If we save that file as <code>cpyline.c<\/code> then we can compile a new <code>cat<\/code> program called <code>catline<\/code> like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -o catline cat.c cpyline.c<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"most-efficient-read-a-block-of-data\">Most efficient: Read a block of data<\/h2>\n\n\n\n<p>One problem with using <code>getline<\/code> to print the contents of a text file is when the program encounters a large file that has exactly one line. Then, the <code>getline<\/code> function must read the <em>entire file<\/em> into memory before it can print it. That\u2019s not a great way to use memory.<\/p>\n\n\n\n<p>Instead, we can use the <code>fread<\/code> function to read a block of data from a file at once, then use <code>fwrite<\/code> to write the same block to a different file. To do this, we need to use the <code>feof<\/code> function to tell us when we\u2019ve reached the end of the file. Otherwise, the general algorithm is the same: Read from the input, then write to the output.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\n#define BUFSIZE 128\n\nvoid\ncpytext(FILE *in, FILE *out)\n{\n  char buf&#91;BUFSIZE];\n\n  size_t numread;\n\n  while (!feof(in)) {\n    numread = fread(buf, sizeof(char), BUFSIZE, in);\n\n    if (numread &gt; 0) {\n      fwrite(buf, sizeof(char), numread, out);\n    }\n  }\n}<\/code><\/pre>\n\n\n\n<p>The <code>fread<\/code> function reads data into a buffer, called <code>buf<\/code>, which has a fixed size of 128. <code>fread<\/code> will read up to 128 characters from the input, and store them in <code>buf<\/code>, then return a count of how many characters it actually read. We store that in a variable called <code>numread<\/code> which we use with <code>fwrite<\/code> to copy the contents of the buffer to the output.<\/p>\n\n\n\n<p>If we save that file as <code>cpybuf.c<\/code> then we can compile a new <code>cat<\/code> program called <code>catbuf<\/code> like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -o catbuf cat.c cpybuf.c<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-they-compare\">How they compare<\/h2>\n\n\n\n<p>The basic algorithm remains the same across each implementation of <code>cpytext<\/code>, although the details change: Read data from one file, and print it to another file. However, each version performs quite differently.<\/p>\n\n\n\n<p>Let\u2019s demonstrate how quickly each method can run by using <code>cat<\/code> to copy the contents of a large text file. The <code>\/usr\/share\/dict\/words<\/code> file contains a long list of words, which can be used by spell-checking programs. On my Fedora Linux system, this is a 4.8 MB file that contains almost a half million words:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l \/usr\/share\/dict\/words\n479826 \/usr\/share\/dict\/words\n$ ls -H -sh \/usr\/share\/dict\/words\n4.8M \/usr\/share\/dict\/words<\/code><\/pre>\n\n\n\n<p>The <code>time<\/code> command will run a program and then print how much time that program needed to execute, broken down by \u201creal\u201d time (from start to finish), \u201cuser\u201d time (CPU time) and \u201csystem\u201d time (a different kind of CPU time). To time how long it takes to read the <code>\/usr\/share\/dict\/words<\/code> file with the <code>\/bin\/cat<\/code> command, and save the output to a temporary file called <code>w<\/code>, we can type this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ time \/bin\/cat \/usr\/share\/dict\/words &gt; w<\/code><\/pre>\n\n\n\n<p>To verify that the file didn\u2019t change as we copied it with <code>cat<\/code>, we can use the <code>cmp<\/code> program; <code>cmp<\/code> prints any differences between two files, and otherwise remains silent if they are the same. For example, to compare <code>\/usr\/share\/dict\/words<\/code> with the <code>w<\/code> file, type this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cmp \/usr\/share\/dict\/words w<\/code><\/pre>\n\n\n\n<p>If <code>cmp<\/code> doesn\u2019t print anything, we know the two files are the same.<\/p>\n\n\n\n<p>To compare the run times of each implementation, we can write a script to run each version and report the times. I\u2019ve added the <code>\/bin\/cat<\/code> program twice, at the start and at the end, because the operating system will \u201cbuffer\u201d the contents of a file the first time we read it. We can then ignore the first <code>\/bin\/cat<\/code> time, and use the second time.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/sh\n\nwords=\/usr\/share\/dict\/words\n\necho '\/bin\/cat..'\ntime \/bin\/cat $words &gt; w\ncmp $words w\n\necho 'cat1..'\ntime .\/cat1 $words &gt; w\ncmp $words w\n\necho 'catline..'\ntime .\/catline $words &gt; w\ncmp $words w\n\necho 'catbuf..'\ntime .\/catbuf $words &gt; w\ncmp $words w\n\necho '\/bin\/cat..'\ntime \/bin\/cat $words &gt; w\ncmp $words w<\/code><\/pre>\n\n\n\n<p>If we save this script as <code>runall<\/code>, we can run it to compare each <code>cat<\/code> implementation at once:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ .\/runall \n\/bin\/cat..\n\nreal    0m0.007s\nuser    0m0.002s\nsys 0m0.005s\ncat1..\n\nreal    0m0.073s\nuser    0m0.047s\nsys 0m0.016s\ncatline..\n\nreal    0m0.033s\nuser    0m0.015s\nsys 0m0.008s\ncatbuf..\n\nreal    0m0.018s\nuser    0m0.004s\nsys 0m0.006s\n\/bin\/cat..\n\nreal    0m0.002s\nuser    0m0.000s\nsys 0m0.002s<\/code><\/pre>\n\n\n\n<p>We can see that reading and writing one character at a time with <code>fgetc<\/code> and <code>fputc<\/code> (<code>cat1<\/code>) was the slowest method, requiring 73 milliseconds to copy the 4.8 MB text file. Reading a line at a time using <code>getline<\/code> (in <code>catline<\/code>) was noticeably faster, at 33 milliseconds. But reading and writing a block of data at a time using <code>fread<\/code> and <code>fwrite<\/code> (<code>catbuf<\/code>) was faster still, at only 18 milliseconds.<\/p>\n\n\n\n<p>Our <code>catbuf<\/code> implementation read 128 characters at a time, which is good, but still quite small. The program can run faster with a larger buffer. And the system <code>\/bin\/cat<\/code> program uses this method iwth a much larger buffer, and takes virtually no time at all, only 2 milliseconds to read 4.8 MB of text data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"slowing-it-down\">Slowing it down<\/h2>\n\n\n\n<p>You might wonder why bother if the difference is so small? My quad-core Intel(R) Core(TM) i3-8100T CPU @ 3.10GHz is certainly very fast, but consider the performance impact on slower systems.<\/p>\n\n\n\n<p>Let\u2019s run the same test on a slower system. I have a virtual machine running FreeBSD, which I use for testing. FreeBSD is actually a fast operating system, but since it\u2019s running in a virtual machine, I can slow it down by running the virtual machine without KVM acceleration.<\/p>\n\n\n\n<p>The <code>\/usr\/share\/dict\/words<\/code> file is smaller on FreeBSD than on Linux, at just over 236,000 words. The file itself is 2.4 MB in size:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l \/usr\/share\/dict\/words \n  236007 \/usr\/share\/dict\/words\n$ ls -s -H \/usr\/share\/dict\/words \n2496 \/usr\/share\/dict\/words<\/code><\/pre>\n\n\n\n<p>To make a more direct comparison between my fast Linux running on real hardware and my FreeBSD instance running on an artificially slow virtual machine, I\u2019ll double the size of the text file by copying its contents twice to a new file called <code>words<\/code> in my working directory. The new file approaches half a million words, and is about 4.8 MB in size; both measurements are about the same as on Linux:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cat \/usr\/share\/dict\/words \/usr\/share\/dict\/words &gt; words\n$ wc -l words \n  472014 words\n$ ls -lh words \n-rw-r--r--  1 jhall jhall  4.8M May 15 14:37 words<\/code><\/pre>\n\n\n\n<p>I\u2019ve compiled the same source files on FreeBSD, and this is my output when running the virtual machine without using KVM:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ .\/runall \n\/bin\/cat..\n        0.04 real         0.00 user         0.04 sys\ncat1..\n        2.85 real         2.71 user         0.11 sys\ncatline..\n        0.67 real         0.59 user         0.07 sys\ncatbuf..\n        0.15 real         0.08 user         0.05 sys\n\/bin\/cat..\n        0.03 real         0.00 user         0.03 sys<\/code><\/pre>\n\n\n\n<p>Running FreeBSD without KVM simulates a much slower system, where we can see a more dramatic difference between these programs. Reading and writing one character at a time (<code>cat1<\/code>) is quite slow, requiring 2.8 seconds to copy the 4.8 MB text file. But reading one line at a time with <code>getline<\/code> (as <code>catline<\/code>) is much better, at about 67 milliseconds of real time. Reading and writing 128 characters at a time (<code>catbuf<\/code>) is faster still, at only 15 milliseconds to copy the 4.8 MB text file. The system <code>\/bin\/cat<\/code> program uses the same method but with a larger buffer, so only needs 3 milliseconds to print the text file.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"teaching-using-simple-methods\">Teaching using simple methods<\/h2>\n\n\n\n<p>When I write \u201cintroductory\u201d articles about how to get started in programming, I try to write my sample programs in a way that everyone can see what\u2019s going on. But learning how to program for the first time is challenging enough without adding algorithms. I approach \u201cwriting your first program\u201d as <em>learn the basics first<\/em> then you can move on to more advanced methods. So I might use <code>fgetc<\/code> and <code>fputc<\/code> to demonstrate how to write your own version of <code>cat<\/code> on Linux or <code>TYPE<\/code> on FreeDOS, even though there\u2019s a better, faster way to do it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There\u2019s the simple way, and there\u2019s the fast way. Let\u2019s compare.<\/p>\n","protected":false},"author":33,"featured_media":2970,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[109,150,80],"tags":[403,91,152],"class_list":["post-5327","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-curiosity","category-programming","category-tips-and-tricks","tag-freebsd","tag-linux","tag-programming"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5327","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5327"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5327\/revisions"}],"predecessor-version":[{"id":5328,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5327\/revisions\/5328"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2970"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5327"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5327"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5327"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}