3.2. Language Types

Programming languages are categorized into several types:

3.2.1. Machine Language

A machine language is a set of binary codes which is used to direct the activities of a Central Processing Unit (CPU).

Everything a computer does is the result of running machine language programs. No matter what language you use to program, the computer hardware is ultimately running machine language in order to execute your program.

Individual codes in the machine language are known as instructions. Hence, a machine language is also known as an instruction set. Each instruction consists of an operation code (opcode for short) and possibly some operands. For example, and add instruction contains a binary opcode that causes the CPU to initiate a sequence of operations to add two numbers, and usually two or three operands that specify where to get the two numbers to be added, and where to store the result.

The CPU reads instructions from memory, and the bits in the instruction trigger "switches" in the CPU, causing it to execute the instruction. For example, when the MIPS microprocessor reads the machine instruction

	    00000000010000110001100000100000
	    

this causes the processor to add the contents of registers 2 and 3, and store the result back into register 3. The meaning of each bit in this instruction is depicted in Table 3.1, “Example MIPS Instruction”

Table 3.1. Example MIPS Instruction

opcodesource1source2destinationunusedopcode continued
00000000010000110001100000100000
addregister 2register 3register 3-add


A CPU architecture is defined by its instruction set. For example, the Intel x86 family of architectures has a specific set of instructions, with variations that have evolved over time (8086, 8088, 80286, 80386, 80486, Pentium, Xeon, Core Duo, etc.) The x86 architectures have a completely different instruction set than the MIPS architecture, the ARM architecture, the PowerPC architecture, and so on.

The architecture is separate from the implementation. For example, Intel and AMD (Advanced Micro Devices) are two separate companies that make implementations of the x86 architectures. Some popular Intel implementations include the Core Duo, Xeon, Nehalem, etc. Some AMD x86 implementations include the Athlon, Duron, Sempron, etc. While the circuit diagrams of all these processors are very different from each other, they all implement the basic x86 instruction set.

The fact that machine language is specific to one architecture presents an obvious problem: Programs written for one architecture have to be completely rewritten in order to run on a different architecture, i.e. they are not portable. Program development and maintenance is enormously time consuming, and therefore expensive. The solution to this problem is discussed in Section 3.2.3, “High Level Languages (HLLs)”.

In addition to the lack of portability, machine language programs tend to be very long, since the machine instructions are quite primitive. Most machine instructions can only perform a single, simple operation such as adding two numbers. It may take a sequence of dozens of instructions to evaluate a simple polynomial.

3.2.2. Assembly Language

To program in machine language, one would have to memorize or look up binary codes for opcodes and operands in order to read or write the instructions. This process is far too tedious and error prone to allow for productive (or enjoyable) programming.

One of the first things early programmers did to make the job easier is create a mnemonic, or symbolic form of machine language that is easier to read. For example, instead of writing

	    00000000010000110001100000100000
	    

a programmer could write

	    add     $3, $2, $3
	    

which is obviously much easier on the eyes and brain.

Assembly language also makes it possible for the programmer to use named variables instead of binary addresses, label program elements, and define macros. The assembler, which translates the assembly language to machine language, can also check the program for errors.

However, the CPU can't understand this mnemonic form, so it has to be translated, or assembled into machine language before the computer can run it. Hence, it was given the name assembly language. The set of assembly language instructions is generally a close, but not exact match to the machine language. When designing assembly languages, programmers often add some simple features to make programming a little easier than a 1-to-1 mapping to machine language would provide. For example, some machine instructions may be given more than one assembly instruction name to help make programs more readable, and some assembly instructions may actually translate to a sequence of machine instructions instead of just one. Assembly instructions that don't translate exactly to one machine instruction are known as pseudo-instructions.

While assembly language is much easier to read and write than machine language, it still suffers from two major problems:

  • It is still specific to one architecture, i.e. it is not portable.
  • The instructions are still very primitive, so the programs are long and difficult to follow.

3.2.3. High Level Languages (HLLs)

Overview

Early programmers quickly realized that it should be feasible to automate the process of writing certain types of machine or assembly code. Programs often contain sequences of instructions that evaluate mathematical expressions, perform repetitive execution, make decisions, print out numbers and text, etc.

Wouldn't it be nice if we had a program that could take a mathematical expression and write the machine code to evaluate it for us? What if it could also write the necessary code to input and output numbers?

In the 1950's, a team at IBM led by John Backus set out to do just that, and their efforts produced the first major high-level language, FORTRAN. The program that performed the translation to machine language was named a compiler.

FORTRAN made it much easier to write programs, since we could now write a one-line algebraic expressions and let the compiler convert it to a long sequence of machine instructions. We could write a single print statement instead of the hundreds of machine instructions that it represents.

In addition to making our programs much shorter and easier to understand, FORTRAN paved the way for another major benefit: portability. We could now write programs in FORTRAN, and by modifying the compiler to output code for different CPU architectures, we could run the same program on any computer without significant modification.

Program Execution

Languages are often classified according to how they are executed. There are two major categories, although many languages don't fit either category perfectly. These categories and several example languages are discussed below.

Compiled

A program written in a compiled language is first translated to machine language by another program called a compiler. The compiler parses each statement in the program and outputs an equivalent sequence of machine or assembly instructions. For example:

		    # Compiler input high-level language source code
		    y = a + b * c - d / a
		    
		    # Compiler output (assembly language)
		    mul     product1, b, c
		    div     quotient1, d, a
		    add     sum1, a, product1
		    sub     y, sum1, quotient1
		    

Parsing involves first breaking it into tokens, which are small pieces such as variable names, keywords, and operators. It then looks at the sequence of tokens to figure out what the statement means.

The machine language is stored in another file called an executable, or binary file. The executable file can be loaded into memory and executed directly by the CPU of any computer using the same architecture and operating system as the one used to compile the program.

Most Unix commands are actually the names of executable files stored in a "bin" directory such as /bin or /usr/bin. (The directory name "bin" is short for binary.) In DOS and Windows, the names of executable files usually have a ".exe" or ".com" extension.

Compile-time and run-time

Interpreted

A program written in an interpreted language is never translated to machine language. Instead, the program is interpreted and executed by another program called an interpreter.

Like a compiler, an interpreter parses each statement in the source program, but instead of translating it to machine language, it simply executes the statement as soon as it determined the meaning.

Comparison of Compiled and Interpreted Languages

Since a compiled program is translated in advance, compiled programs have the following advantages:

  • The executables run as fast as possible, since they are executed natively (directly) by the CPU.

    All of the expensive parsing of the source code is done before the program begins executing, and need only be done once. After that, the machine language produced by the compiler runs with no further need for the source code.

  • There is no need to have a compiler on every computer that runs the executable. The compiler need only be present on the developer's computer, and the executables produced can be run on any computer with the same architecture and operating system.

The interpreter performs much of the same work as a compiler. For each statement in a program, a compiler and interpreter both parse the code.

At this point, the compiler and the interpreter diverge. Once the meaning of a statement is determined, the compiler outputs equivalent machine language to another file to be executed after compilation is finished. The interpreter, on the other hand, does not translate to machine language, but immediately executes the statement.

Programs written in interpreted languages run far slower than the same program written in a compiled language. This is due to the fact that an interpreted program is being parsed while it is executing, whereas the compiler does all the parsing before execution begins. The sequence of events in an interpreted program is:

		    | parse | execute | parse | execute | ...
		    

For a compiled program, the sequence is:

		    | parse | translate | parse | translate | ... | execute |
		    

Parsing a statement is a complex process that often takes much longer than executing the statement, so more time is actually spent interpreting the program than doing the work it's meant for. Many statements in the interpreter must be executed in order to execute one statement in the interpreted program it is running.

Another disadvantage of interpreted languages is that the interpreter must be installed on every computer that runs the program. Compilers need only be installed on the machine where the program is compiled. Since the executable contains machine language, it will run on any computer with the same architecture and a binary compatible operating system. Some x86-based Unix operating systems can run each other's executables, even though they have a slightly different format. Some can also run certain Windows executables directly, provided the WINE API is installed.

The main advantage that is often cited for interpreted programs is that you don't have to wait for it to compile before you can test it. This fact is vastly overrated, however, since compilation generally doesn't take very long. Modern computers can compile many thousands of lines of code per second. In addition, a well-designed software project need only recompile a small portion of the program in order to generate a new executable after minor changes. Hence, the startup time for a compiled program isn't much different from that of an interpreted program.

Interpreters do have another advantage, however. Interpreter can act as an "overseer", and perform complex housekeeping and debugging operations that can be difficult to insert into a compiled executable. For this reason, there have historically been some C development suites that provided both a C compiler and a C interpreter for debugging.

Also, execution speed often doesn't matter much. So-called "scripting languages", which are used to write short programs to automate execution of other programs, don't need to be fast, since they are short, and most of the work is actually done by the other programs they execute.

Table Table 3.2, “Selection Sort of 50,000 Integers” summarizes the difference in execution speed between several languages.

Table 3.2. Selection Sort of 50,000 Integers

LanguageExecution methodTime (seconds)
GNU CCompiled4.01
GNU C++Compiled4.07
GNU FortranCompiled5.23
Java+JITMixed6.14
MatlabInterpreted44.39
Java without JITByte-code Interpreted64.74
PerlInterpreted589
C-shellInterpreted178,500 (extrapolated)

Execution times can vary significantly with different compilers and different algorithms. The sample of times above should only be viewed as a rough estimate.

Generally speaking, if execution speed matters, use a compiled language. If it doesn't, then use the most convenient language for the job.

Most interpreted languages have built-in routines that can perform common task efficiently. For example, the Perl interpreter is written in C, and has a built-in sort function that is also written in C and compiled into the Perl package. Hence, a good Perl programmer would not implement a sort function in Perl. Matlab has many built-in functions for processing matrices. These functions are written in compiled languages, mostly C and Fortran, and hence run at near optimal speed.

However, no language can provide more than a small fraction of all the functions needed by all users, so if you're coding in an interpreted language, you will eventually need to implement algorithms in it, or find a way to incorporate code written in a compiled language. The latter is complicated, and you may find it easier to implement the entire project in a compiled language. Furthermore, most programmers never become fully aware of the built-in routines available in a language, and end up implementing equivalent functionality that is far less efficient.

Choosing the wrong language for a large project can be extremely costly.

If 1000 people use the software, and each of them wastes 10 minutes a day waiting for the software unnecessarily, then 1000 * 10 minutes = 166 man-hours per day are wasted because of poor software performance. If the average user is paid $25 per hour including benefits, then 166 * $25 = $4,125 is lost each day.

If you find that performance is unacceptable, you may end up rewriting the entire program in a compiled language.

This happens frequently with Matlab programs, which are often rewritten in C, C++, or Fortran. Most such programs are small, so this is usually more of a nuisance than a catastrophe. Nevertheless, it causes big delays in product release or research publication. Matlab is much more than a programming language, and provides a wealth of useful tools for engineering and scientific research. However, the interpreted language included with the Matlab system is orders of magnitude slower than a compiled language such as Fortran.

RPM (the Redhat Package Manager, now called RPM Package Manager) is a grand example. This is a large and sophisticated software installation tool that was originally written in Perl, well-developed and tested through several major versions, and then completely rewritten in C to improve performance.

Rewriting software is a colossal waste of precious man-hours that could be used to develop new code. There is already (and probably always will be) a severe shortage of good programmers, so it is important to ensure that their valuable time is well spent.

The rewrite of RPM was done by volunteers donating their time to an open source project. Rewriting software while getting paid raises concerns of competency and ethics as well. Should programmers be paid to rewrite software if they made a poor choice of language the first time around? Managers must trust computer professionals to make technical decisions that they themselves are not qualified for. It is therefore the responsibility of the computer professional to become knowledgeable (on their own) before making important decisions for the company.

Languages that Don't Completely Fit Either Model

The Java language doesn't quite fit the definition of either compiled or interpreted. Although Java programs must be "compiled" to a .class file before they can be executed, the .class file does not contain actual machine language. Instead, it contains Java byte code, which is interpreted by the Java Virtual Machine (JVM). The Java byte code is a sort of pseudo machine language, which can be interpreted very efficiently. This is why Java even without the just-in-time (JIT) compiler is much faster than the other interpreted languages.

Without the JIT compiler, Java falls cleanly into the category of an interpreted language. With the JIT compiler enabled, strange and complicated things happen when a Java program is executed. The first time each element of a Java program is executed, the JIT compiler converts the Java byte code to the native machine language of the machine running the JVM. This causes the execution to go even slower, since the JVM is now parsing, executing, and translating the code before moving on to the next element. However, the next time the same element is executed (assuming it's inside a loop), the native machine code is executed, so it runs at roughly the same speed that it would had it been written in a compiled language. Note that the JIT compiler never outputs a machine code executable. The compilation is performed while the program is being interpreted, and the resulting machine code is kept in memory.

Many other interpreted languages are actually crunched to a simpler binary format that is more efficient to interpret, before execution begins. Strings such as "for", "while", and "switch" in a Perl program are reduced to binary integers, which can be identified with a single compare instruction, where as identifying the original string form requires a loop to compare all the characters. This reduces run time significantly, but as the table above demonstrates, it still does not come close to a compiled language for execution speed.

3.2.4. Open Standard vs. Proprietary

Open standard languages are usually preferable for the long-term.

Open standard languages include C, C++, Fortran, Perl, PHP, Python, Ruby, and many more. There are usually multiple vendors as well as free open source implementations for compilers and interpreters, and they can be used on most common hardware (PowerPC, MIPS, x86, etc.) and operating systems (BSD, Linux, Mac, Windows, etc.).

Proprietary languages include Matlab, Labview, PIC Basic, etc. Proprietary languages are often attractive in the short-term due to specific features that meet a particular need, but they have some major drawbacks.

They can only be purchased from one vendor. Support and pricing from that vendor will vary over time. While support and cost may be reasonable now, the company may decide in the future that the product is not profitable, and support for new versions of the operating system could be poor or non-existent. The company could go out of business, and the product could be completely discontinued.

Once you have invested many man-hours in developing software in a proprietary language, you are at the mercy of the vendor in order to keep the software current.

Proprietary languages can only be used on hardware and operating systems that the vendor deems profitable. Most vendors only support one or two platforms, and the quality of support decays with the popularity of the platform. E.g., many companies offer full support for Windows, limited support for Mac, minimal support for Linux, and no support for anything else. If you need to run some Windows-only software, and some Mac-only software, you will need to maintain both Windows and Mac installations.

3.2.5. Criteria for Selecting a Language

The computer science field is full of religious devotion to languages. Some programmers will insist that Python is better than Perl, C++ is better than Java, or vice versa.

When selecting a language, use objective criteria, and ignore hype from devotees. Many of the arguments you will hear are subjective, but there are objective criteria for language selection:

  • Is it compiled or interpreted?
  • What is the general execution speed?
  • Is it proprietary or open standard?
  • Is it portable across different hardware and operating systems, or will you be locked into using one platform in order to keep your code running?

3.2.6. Creeping Feature Syndrome

During the 1960s, many new languages evolved, and the push was to add more and more capabilities. This led to languages such as ADA, PL/I, and COBOL, and eventually the notion of creeping feature syndrome, which is the proliferation of too many features that don't add significant value to the product. ( We see this in cars today, which are full of gadgets that drive up the price and really aren't much help to the driver. )

Around 1970, Dennis Ritchie at Bell Labs developed the C language. The design philosophy was minimalist, aimed at avoiding creeping feature syndrome. As a result, C includes only features that could not be provided as subprograms. For example, C has no built-in I/O statements, and only minimal built-in support for strings. Simple operations like string comparison and string assignment are carried out by library functions which are not part of the C language, and in fact are written in C:

	    /* See if name is "Bob" */
	    if ( strcmp(name, "Bob") == 0 )
	    {
	    }
	    
	    strlcpy(name, "Bob", NAME_MAX);
	    

Ritchie noted that providing support for features like I/O and strings complicates the language grammar significantly, and provides only superficial advantage to the programmer. The ability to write

	    if ( name == "Bob" )
	    {
	    }
	    

instead of using strcmp() serves only to make the code a little prettier. It's impossible to create a language with all the features desired by every programmer, so ultimately it's more important to provide extensibility than intrinsic features. The ability to create libraries of subprograms is the common solution across most languages. Library functions can be fixed, enhanced, and replaced without the need to upgrade the compiler or interpreter.

Unfortunately, the lessons of the 1960s appear to have been lost on the next generation after Ritchie's, and we're now seeing another proliferation of feature-loaded languages being developed. Be skeptical about languages that promise to cut your development time in half. It may work out for small projects, but as the project grows, you will begin to discover which critical features are not there.

A high-level tool such as Matlab is not a replacement for general purpose programming languages. Most researchers will eventually will need to do general software development, so it would be wise to maintain your general programming skills in a compiled language. For this very reason, Matlab has the ability to integrate C and Fortran code into Matlab programs, so that users can write efficient extensions adding features that Matlab doesn't provide. You may want to think about whether it's better to build a large project using a mixture of two languages, or keep it cleaner and simpler by using the general-purpose language for all of it.