Tuesday, January 01, 2008

Difference between CR and LF

Sourced from here

The main difference is that and are characters, while EOF is more of a concept - though the end of a file can be denoted by a special marker character, too.

Generally speaking, under Windows and DOS, lines of text in a file end with the two byte character sequence ; these characters have the ASCII values 13 and 10, respectively. Under *nix, a line of text ends with just ASCII character # 10.

Under DOS and Windows, a file might also contain an end of file marker character; the ASCII value of this character is 0x1A (26 decimal). This character was used more to speed things up when working with certain kinds of files (text files, as opposed to binary files) than anything.

Exactly how you open a file, read from it, or write to it, will vary with the programming language or tool that you're using; there are a lot of commonalities between the majority of them, though. Hopefully, a few examples in C will help clear this up. A few important points first, though: a carriage return (CR) - ASCII character code 13 - is also represented by the escape sequence "\r" (not including the quote marks), and a line feed (LF) - ASCII character code 10 - is also represented by the escape sequence "\n" (again, not including the quote marks).

Under DOS or Windows - assuming that we're using the C run-time library (RTL) and not the Windows application programming interface (API), or DOS function calls - you have to tell the RTL how you would like it to treat the file. Your option are "as a text file" and "as a binary file". If you tell the RTL to treat a file as a text file, the following things will happen:

When the sequence is encountered in a file, it's replaced, "behind the scenes", by just , which is ASCII character 10. Which matches the constant '\n'.

If the end-of-file character is encountered, the file's end-of-file flag is set: fgetc() returns the value EOF (defined in stdio.h); feof() will return true (non-zero), and so on.

However, if you tell the RTL to treat the file as a binary file, the data is returned to you "raw" - no end-of-line marker translation is done, you get both the and the ; and, the end-of-file marker - if present - is treated just like any other byte in the file; it's value is returned to you.

With most compilers, under Windows or DOS, in C, a file is opened as a binary file by using fopen() and specifying the letter "b" after the mode you wish to open the file in (r, w, a, etc.). For example, to open a file named "test.dat" in read-only mode, in binary mode (without any translation of the file's contents), one could write:

FILE *f;

if (!(f = fopen(f,"rb")))
{
/* an error occurred */
}

A note regarding porting code between DOS/Windows and *nix systems: if you were to attempt to compile the above code on a *nix system, the compiler might or might not give you an indication that there was a problem: generally speaking, *nix systems always open files in "binary" mode, and "rb" (or "wb", "ab", etc.) is not a valid parameter. Exactly what would happen can be expected to vary from compiler to compiler; on some, the presence of the "b" might cause the fopen() call to fail; others might print a warning message.



One last note regarding end-of-file and translation of characters in "text" mode on DOS and Windows systems: while, in theory, the rules are simple, in practice, they're a lot more complicated. If you were to take a program that used text files in "text" mode, and compile it, on a DOS / Windows system, with different compilers from different vendors, you can expect that, to some extent, you'll get different results. One of the ways they differ in is in how the final pair in a file is treated. In "text" mode, sometimes, you'll get a final "\n", and other times you won't. In "binary" mode, you'll always get the entire contents of the file.