Software: Thoughts on Reliability and Randomness

Overview

Software Reliability and Randomness are slippery concepts that may be conceptually easy to understand, but hard to pin down. As programmers, we can write the equivalent of ‘hello world’ in dozens of languages on hundreds of platforms and once the program is functioning – it is reliable. It will produce the same results every time it is executed. Yet systems built from thousands of modules and millions of lines of code function less consistently than our hello world programs – and are functionally less reliable.

As programmers we often look for a source of randomness in our programs, and it is hard to find. Fundamentally we see computers as deterministic systems without any inherent entropy (for our purposes – randomness). For lack of true random numbers we generate Pseudo Random Numbers (PRNs), which are not really random. They are used in generating simulations, and in generating session keys for secure connections, and this lack of true randomness in computer generated PRNs has been the source of numerous security vulnerabilities.

In this post I am going to discuss how software can be “unreliable”, deterministic behavior, parallel systems / programming, how modern computer programs / systems can be non-deterministic (random), and how that is connected to software reliability.

Disclaimer

The topics of software reliability, deterministic behavior, and randomness in computers is a field that is massively deep and complex. The discussions in this blog are high level, lightweight, and I make some broad generalizations and assertions that are mostly correct (if you don’t look to closely) – but hopefully still serve to illustrate the discussion.

I also apologize in advance for this incredibly dry and abstract post.

Software Reliability

Hardware reliability, more precisely “failure” is most often occurs when some device in a system breaks (the smoke comes out), and the system no longer functions as expected. Software failures do not involve broken hardware or devices. Software failures are based on the concept that there are a semi-infinite number of paths (or states) through a complex software package, and the vast majority will result in the software acting and functioning as expected. However there are some paths through the code that will result in the software not functioning as expected. When this happnes, the software and system are doing exactly what the code is telling it to do – so from that perspective, there is no failure. However from the concept of a software failure, the software is not doing what is expected – which we interpret as a software failure, which provides a path to understand the concept of software reliability.

Deterministic Operation

Deterministic operation in software means that a given program with a given set if inputs will function in exactly the same manner every time it is executed – without any unexpected behaviors. For the most part this characteristic is what allows us to effectively write software. If we carry this further, and look at software on simple (8 / 16 bit) microprocessors / microcontrollers, where the software we write runs exclusively on the device, operation is very deterministic.

In contrast – on a modern system, our software exists in a relatively high level on top of APIs (application programming interfaces), libraries, services, and a core operating system – and in most cases this is a multitasking/multi-threaded/multi-cored environment. In the world of old school 8 / 16 bit microprocessors / microcontrollers, none of these layers exist. When we program for that environment, our program is compiled down to machine code that runs exclusively on that device.

In this context, our program not only operates deterministically in how the software functions, but the timing and interactions external to the microprocessor is deterministic. In the context of modern complex computing systems, this is generally not the case. In any case, the very deterministic operation of software on dedicated microprocessor makes it ideal for real world interactions and embedded controllers. This is why this model is used for toasters, coffee pots, microwave ovens and other appliances. The system is closed – meaning its inputs are limited to known and well defined sources, and its functions are fixed and static, and generally these systems are incredibly reliable. After all how often it is necessary to update the firmware on an appliance?

If this war our model the world of software and software reliability, we would be ignoring much of what has happened in the world of computing over the last decade or two. More importantly – we need to understand that this model is an endpoint, not the whole story, and to understand where we are today we need to look further.

Parallel Execution

One of the most pervasive trends in computing over the last decade (or so) is the transition from increasingly faster single threaded systems to increasingly parallel systems. This parallelism is accomplished through multiple computing cores on a single device and through multiple processing threads on a single core, which are both mechanisms to increase the ability of the processor to produce more work by being able to support concurrently running programs. A typical laptop today can have two to four cores and support two hardware threads per core, resulting in 8 relatively independent processes running at the same time. Servers with 16 to 64 cores would have qualified as supercomputers (small ones) a decade ago are now available off the shelf.

Parallel Programming: the Masochistic Way

Now – back in the early 80s as an intern at Cray, my supervisor spent one afternoon trying to teach me about how Cray computers (at that time) were parallel coded. As one of the first parallel processing systems, and as systems where every cycle was expensive – much of the software was parallel programmed in assembly code. The process is exactly how would imagine. There was a hardware scheduler that would transfer data to/from each processor to main memory every so many cycles. In between these transfers the processors would execute code. So if the system had four processors, you would write assembly code for each processor to execute some set of functions that were time synchronized ever so many machine cycles, with NOPs (no operation) occasionally used to pad the time. NOPs were considered bad practice since cycles were precious and not to be wasted on a NOP.  At the time, it was more than I wanted to take on, and I was shuffled back to hardware troubleshooting.

Over time I internalized this event, and learned something about scalability. It was easy to imagine somebody getting very good at doing two (maybe even 3 or 4) dissimilar time synchronous parallel programs. Additionally, since many programs also rely on very similar parallel functions, it was also easy to imagine somebody getting good at writing programs that did the same thing across a large number of parallel processors. However, it is much harder to imagine somebody getting very good at writing dissimilar time synchronous parallel programs effectively over a large number of parallel processors. This is in addition to the lack of scalability inherent in assembly language.

Parallel Programming – High Level Languages

Of course in the 80s or even the 90s, most computer programmers did not need to be concerned with parallel programming, and every Operating System was single threaded, and the argument of the day was Cooperative multitasking versus Preemptive multitasking. Much like the RISC vs CISC argument from the prior decade, these issues were rendered irrelevant by the pace of processor hardware improvements. Now many of us walk around with the equivalent that Cray supercomputer in our pockets.

In any case the issue of parallel programming was resolved in two parts. The first being the idea of a multi-tasking operating systems with a scheduler – the core function that controls what programs are running (and how long they run) in parallel at any one time. The second being the development of multi-threaded programming in higher level languages (without the time synchronization of early Crays).

Breaking Random

Finally getting back to my original point… The result today is that all modern operating systems have some privileged block of code – the kernel running continuously, but have a number of other services that run the OS, including the memory manager and the task scheduler.

The key to this whole story is that these privileged processes manage access to shared resources on the computer. Of these two, the task manager is the most interesting – mostly due the arcane system attributes it uses to determine which processes have access to which core / thread on the processor. This is one of the most complex aspects of a multitasking / multi-core / multithreaded (hardware) system. The attributes the scheduler looks at include affinity flags that processes use to indicate core preference, priority flags, resource conflicts and hardware interrupts.

The net result is that if we take any set of processes on a highly parallel system there are some characteristics of this set that are sufficiently complex and impacted by unknown external elements that they are random – truly random. For example if we create three separate processes that generate a pseudo random number set based on some seed (using unique values in each), and point all of them to some shared memory resource- where the value is read as input and the output is written back. Since the operation of the task scheduler means that the order of execution of these three threads is completely arbitrary, it is not possible to determine what the sequence is deterministically – the result would be something more random than a PRNG. A not so subtle (and critical) assumption is that the system has other tasks and processes it is managing, which directly impact the scheduler, introducing entropy to the system.

Before we go on, lets take a closer look at this. Note that if some piece of software functions the same (internally and externally) every time it executes, it is deterministic. If this same piece of software functions differently based on external factors that are unrelated to this software, that is non-deterministic. Since kernel level resource managers (memory, scheduler, etc) function in response to system factors and factors from each and every running process – that means that from the perspective of any one software package, certain environmental factors are non-deterministic (i.e. random). In addition to the scheduling and sequencing aspects identified above, memory allocations will also be granted or moved in a similar way.

Of course this system level random behavior is only half the story. As software packages are built to take advantage of gigabytes of RAM, and lots of parallel execution power, they are becoming a functional aggregation of dozens (to hundreds) of independently functioning threads or processes, which introduce a new level of sequencing and interdependancies which are dependent on the task manager.

Bottom Line – Any sufficiently complex asynchronous and parallel system will have certain non-deterministic characteristics based on the number of independent sources that will influence access / use of system shared resources. Layer the complexity of parallel high level programming, and certain aspects of program operation are very non-deterministic

Back to Software Reliability

 Yes we have shown that both multitasked parallel hardware and parallel programmed software contribute to some non-deterministic behavior in operation, but we also know that for the most part software is relatively reliable. Some software is better and some is worse, but there clearly is some other set of factors in play. 

The simple and not very useful answer is “better coding” or “code quality”. A slightly more insightful answer would tell you that code that depends on or uses some non-deterministic feature of the system is probably going to be less reliable. An obvious example is timing loops. Back in the days of single threaded programs and single threaded platforms, programmers would introduce relatively stable timing delays with empty timing loops. This practice was easy, popular and produced fairly consistent timing – showing deterministic behavior. As systems hardware and software have evolved, the assumptions these coding practices rely on become less and less valid. Try writing a timing loop program on a modern platform and the results can be workable much of the time, but it  can also vary by orders of magnitude – in a very non-deterministic manner. There are dozens of programming practices like this that use to work just fine, but no longer do – but they don’t completely break, just operate a little bit randomly. In many cases, the behavior is close enough to “correct” that the program appears to function, but not very reliably.

Another coding practice that used to work on single threaded systems was to call some function and expect the result would be available on the next line of code. It worked on single threaded systems because execution was handed off to that function, and did not return until it was complete. Fast forward to today, and if this is written as a parallel program – the expected data may not be there when your code thinks is should be. There is a lesson here – high level parallel programming languages make writing parallel code fairly easy, but that does not mean that writing robust parallel programs is easy. Parallel inter-dependencies issues can be just as ugly as parallel assembly code on a Cray system.

Summary

A single piece of code running exclusively on a dedicated processor is very deterministically, but parallel programmed software on a multitasking parallel hardware system can be very non-deterministic, and difficult to test. Much of software reliability is based on how little a given software package depends on these non-deterministic features. Managing software reliability and failure mechanisms requires that programmers understand the system beyond the confines of the program.

References

Android Systems Engineering: A Quick Look Under the Hood

Overview

Most of us we have become comfortable with the understanding that Android is a very specialized and customized version of Linux. The resulting level of customization does however render the OS sufficiently different that most general purpose Linux Systems Engineering expertise is of limited value. We also need to acknowledge that this simply means we need to get back to the foundations and explore / relearn from the bottom up (or top down – depending on your perspective).

If we take a quick survey of the multitudes of Android devices available, it is clear that there are both similarities and significant differences. Some of these are cosmetic and some are not, and one way to understand which is which,  is by dissecting the construction of the OS – and identifying where these modifications were interjected. From the bottom up we have:

  • The Android Kernel: The Android kernel is the engine at the bottom of the Android software stack that is responsible for everything that happens after the bootloader hands off control.
  • Core Android Services: When you hear about functions named Binder or AshMem, these are two of the core services that are responsible for managing memory, communication and process launching. Like the kernel these services generally are a key part of everything that happens in Android.
  • System Services/APIs: These are functions that provide more specialized functions and programmatic interfaces to the Android platform. This is the layer that applications start to touch.
  • Proprietary Device Drivers: A set of binary loaded drivers that enable the very general interfaces in the OS to talk to specialized hardware devices. These are not open source.
  • Dalvik Virtual Machine (VM): A specialized Java like virtual machine that isolates (or virtualizes) the APIs so that applications can be device independent.
  • AOSP User Interface/Apps: The standard user interface and core applications provided as part of the Android Open Source Project. This is a very small set of applications.
  • Google Applications: A larger set of Google specific applications, which include Gmail, Google Maps, Google Play Market, Google Talk, Google Voice, and Chrome.
  • Device Vendor Theming / Applications: A look and feel overlay on the user interface (also known as ‘skinning’) combined with a set of vendor specific applications. Nexus devices are not “skinned”.
  • Telecom Vendor Applications: A set of applications specific to a telecom vendor that provide some level of function that often duplicate functionality. Nexus devices do have telecom vendor apps.

From an Android Systems Engineering viewpoint, a few of these merit further discussion. Specifically we will expand on the Android Kernel, AOSP, and the proprietary drivers.

Android Kernel

The Android kernel is based on the Linux kernel, but has a number of Android specific patches that provide improved performance and function on a mobile platform. These patches are generally very specific to how Android manages memory, tasks, interrupts and timers to provide better performance and battery life than a more traditional Linux kernel. There has also been an evolving effort to reduce the amount of OS code running as a privileged user (i.e. root), which has been relatively successful, and has driven some kernel architectural changes. This is why Android privilege escalation attacks are so very uncommon.

AOSP: Android Open Source Project

The Android Open Source Project, or AOSP is the open source part of the Android system. The homepage for AOSP is at http://source.android.com/. AOSP includes the Android Kernel, a number of key OS services, the Android API functions and the Dalvik Virtual Machine.

Another way to look at this is that AOSP is the open source foundation from kernel to User Interface – excepting the proprietary apps, drivers, and ‘skinning’. To clarify – skinning is a process that layers a user interface theme on top of the standard Android look and feel. This is generally done by product vendors to create a brand style or appearance. Examples include Motorola MotoBlur, HTC Sense and Samsung TouchWiz.

The most important thing to know about AOSP is that it is open source (distribution, but not development) and it is capable of generating a fully functional Android Operating System without any special or purchased tools.

Android Proprietary Drivers

Although the Android OS is fairly generic for platforms, it is still necessary to have device drivers and a hardware abstraction layer (HAL) that maps a generic API interface to the device. Unlike the PC Linux movement, there is not a significant effort to develop open source drivers for Android hardware devices and for the most part, these are proprietary pieces of code that get installed on the OS image.

If your device is a Nexus device, drivers can be found at https://developers.google.com/android/nexus/drivers. For other Android devices the process is more complicated, and drivers are generally extracted from the factory device firmware.

Beyond Skinning

As the Android marketplace evolves (I hesitate to call it maturing), device vendors are increasingly working to develop product differentiation that would enable them to carve off their own walled gardens in the Android world. Initially this was achieved by changing the look and feel of the User Interface (UI) through skinning – which is a fairly cosmetic process. The upside to skinning is that it is also a relatively safe process with minimal risk of introducing vulnerabilities.

Over the last few years this process has progressed, and Samsung in particular has introduced and applied broad sets of  patches to the AOSP source code that add significant capabilities – such as Multi-Window on the Galaxy devices. These changes are fairly significant and end up touching a many parts from the kernel to the UI, with increasing risk of bugs and vulnerabilities. Of course the real question is whether this has actually happened.

Implications on Stability

My favorite device differentiator is stability, and the following statements make some relatively safe, but somewhat anecdotal assertions about stability / vendor support on smart phones.

  • Apple iPhones are fairly stable but introducing a major new OS version each year that is compatible with 3 generations of hardware may be having some impacts. Just for fun Google “iphone camera crashes” and see what you discover.
  • Google Nexus / Developer devices are more stable because of their minimalism. I have had a Nexus tablet for over a year and a Nexus phone for about 6 months. I have never had to power cycle either of them except for firmware patching / updates. The only app crash I have ever had is Facebook – and it happened once.  Based on discussions with other Nexus owners, this is fairly typical.
  • Google Nexus / Developer devices generally receive OS updates within days of announcement. This is because Google does this directly – and does not go through the device / telecom vendor channels (who have no business interest in maintaining a device already sold and obsoleted out of the sales chain).
  • Non-Nexus Android devices range from slightly flakey to very crashy (and barely functional), require regular reboots, and if they are updated – they can be up to a year behind the current version. Vendors typically lock the bootloader and discourage any third party OS development (AOKP and/or CyanogenMod for example).

I also understand that the plural of anecdote is not fact – but several of these assertions are supported by a recent academic paper that has some interesting findings on code provenance and vulnerabilities. More simply – the paper confirms that some Android phones are significantly more vulnerable / buggy than others, and who is to blame.

The vendors surveyed included Google Nexus, LG, HTC, Samsung and Sony. I will let you read the paper, but there are few tidbits worth sharing.

  • Google Nexus phones were used as the baseline since they have no third party software – and did provide the lowest number of vulnerabilities.
  • Sony customization produced significantly few vulnerabilities than the other three vendors.
  • Of LG, HTC and Samsung devices – between 65% and 85% of identified vulnerabilities came from their 3rd party code and customizations.
  • Samsung Galaxy S2 and S3 take the dubious honor of having the greatest number of vulnerabilities (see table 5 – S3 had 40 vulnerabilities versus the Nexus 4 with 3).

Conclusions I personally drew from this paper include: a) Whatever Sony is doing in their code development QA process – they should keep doing it, b) Whatever Samsung is doing in their code development QA process – they should stop doing it and fix it, and c) This data supports some of my anecdotal assertions above and actually adds detail.

Anecdotes are interesting, and as story tellers we can more easily relate to them versus raw data. However – occasionally data can support the anecdotes and then the combination can become useful.

Bottom Line

For the aspiring Android Systems Engineer, this quick look under the hood is intended to provide you with a few reference points you can related back to your Linux Systems Engineering understanding.

For the truly ambitious, the next step is to download the AOSP source and dig in…

Reference