IBM Skip to main content
     IBM Home  |  IBM Products & services  |  IBM Support & downloads  |  My IBM account

IBM brand servers  >  IBM Developer Domain  > Library

POWER4 and shared memory synchronisation


by R. William Hay (billhay@us.ibm.com)
IBM senior architect, processor and compiler architecture
Gary R. Hook (ghook@us.ibm.com)
IBM senior technical consultant, Web Servers Solutions Integration
April 2002

The p690 and p670 servers are built on the POWER4 processor, which gives you unprecedented scalability and performance. You need to understand how the processor handles shared memory synchronisation. Do you have applications that were developed for previous pSeries servers? Well, they might have given you expected results on those previous systems. But if the programs don't comply with the PowerPC architecture and they lack appropriate storage synchronisation instructions, you might get unexpected results when you run them on POWER4 systems. The authors explain what's going on, what to look for in your applications, what to do.

Introduction

The new POWER4 processor was first available in the IBM eServer p690, and now it's also in the p670 server. This next-generation product continues to draw on the PowerPC architecture while implementing decidedly cutting-edge technology in the processor subsystems.

Some of the behavior characteristics of the POWER4 processor might not be familiar to you. We want to talk especially about storage synchronisation. You need to understand how it works so you can make sure your applications programs include appropriate storage synchronisation instructions so they will give you expected results.

To achieve high performance on multiprocessor systems, many applications make use of multi-threaded or multi-process code, communicating through storage. The memory model of pSeries systems is weakly consistent, which allows the parallelization of storage operations among the internal processing units, the cache, and main storage. The result is speed and performance. Throughput is optimized. Also, it avoids a pipeline "backup" for data that's on its way out of the chip. However, a result of all this is that the processor might perform memory accesses in a different order than the load and store instructions in the program. In fact, as processor and memory subsystem designs become more and more optimized, it's more and more likely that memory accesses will be performed in an order different from the order implied by the program sequence.

When memory accesses must be performed in the order specified by the program, you must make sure that any programs that share memory with other programs -- or with I/O devices using DMA operations to access memory -- include the appropriate synchronisation instructions. This paper describes the storage synchronisation mechanisms and programming considerations for creating applications for pSeries platforms. The good news is that very few applications written for AIX 5L will require any coding changes to behave properly.

This paper doesn't discuss memory-mapped I/O or other issues that apply solely to device drivers.

Start here

Here are some questions to consider when developing your application for the POWER4 processor:

Does your application depend upon serialized access to any data?
        If there's no need to serialize access to data among multiple, concurrent threads/processes of execution, then your application will behave correctly.

Does your application consist of multiple, cooperative processes that share data via a shared memory/semaphore mechanism? Is your application multi-threaded, wherein lightweight processes send and receive data amongst themselves?
        If no, then your application will behave correctly.

Does your application rely solely on AIX-provided locking mechanisms (such as the pthreads library)?
        If yes, then your application will behave correctly.

What to do next?

  1. Get an assessment by an application architect. There's no "signature" for code that might exhibit storage synchronisation anomalies. The architect's task is to consider the potential for code paths to make assumptions about the readiness of one datum to be read at the time another datum is manipulated; for example, a structure in memory might be updated before a lock is released to allow others to read that structure. Timing is crucial in this type of situation, and knowledge of the hardware is required in order to properly write code to perform these tasks.


  2. Test your application on a POWER4 system. Testing it on a prior generation system, such as a POWER3, isn't adequate, nor is a brief, high-level assessment.


  3. Study the following information to understand the issue in more detail and gain a working knowledge of the types of situations to consider, and then follow the steps necessary to modify the code.

PowerPC storage synchronisation

The PowerPC Instruction Set Architecture (ISA) describes how the processor instructions access and modify storage. The architecture describes a "weakly consistent storage" model. (It also includes visible caches whose behaviour you must take into account whenever instructions are modified.)

The processor is also permitted, with certain restrictions, to execute instructions in an order different from the program order.

Storage synchronisation is necessary in a uniprocessor system when generating or modifying instructions. We don't talk about generating or modifying instructions in this article, because POWER4 shows no new behaviour in this regard. Programs running on a uniprocessor system, or programs that don't share memory with other programs or threads, will produce the intended results without the need for storage synchronization; this is a consequence of the architecture's requirement that all memory accesses performed by a given processor appear to be performed in program order with respect to that processor and applies on all processor implementations.

A multiprocessor pSeries system can be constructed so that the observed order might be different from the order in which the stores were executed. Here, "processor" means a PowerPC processor (CPU) or an I/O device that uses direct memory access (DMA) to move data between memory and a device. The stores can be observed as reordered for any of several reasons. For example, a cache can be organised as a number of banks, each of which can access memory, or make data available to a processor, independently of the other banks; this can lead to one bank responding sooner than another because it happens to have fewer stores directed to it.

The PowerPC processor architecture provides several instructions specifically to control the order in which stores perform their changes to memory and, thus, to control the order in which another processor observes the stores; to control the order in which instructions are executed; and for accessing shared storage. These instructions are:

sync 		Synchronise
lwsync		Lightweight Sync (a new instruction in POWER4)
eieio		Enforce In-Order Execution of I/O
lwarx		Load Word And Reserve
ldarx		Load Doubleword And Reserve
stwcx.		Store Word Conditional
stdcx.		Store Doubleword Conditional
isync		Instruction Synchronise

These instructions can be used to construct synchronisation points in the program so that the operations on shared memory behave in a well-defined fashion and produce consistent results on all PowerPC-based systems. The AIX operating system provides a collection of basic functions that use these instructions to perform important operations such as atomic updates of variables in shared memory (such as locks and semaphores).

Most applications that run pSeries systems already use the various system libraries to manage shared memory communication safely. Some examples are:

  • AIX pthreads library
  • ESSL SMP library
  • SMP runtime library provided with IBM compiler products

Nevertheless, some application code needs to ensure directly that proper synchronisation occurs. For example, one thread might need to set a shared flag variable to tell other threads that it has completed some part of its computation. The setting of this flag must not be observable by another processor until after all the computed data are observable by that processor; otherwise, some other thread (on another processor) could see the flag being set and access data locations that have not yet been updated. To prevent this, put a sync or lwsync between the data stores and the flag store.

When the code appears to work correctly on earlier pSeries systems but fails on POWER4 systems, the most likely cause will be due to missing, but required, sync instructions.

There are no tools available to help you identify where the sync instructions are missing. Neither the compiler nor the processor hardware can detect that a particular sequence of instructions "needs" synchronisation instructions to communicate properly with another processor. Only careful review of the application's data flow and inspection of the application (source) code can determine whether the required synchronisation instructions are present.

A brief review of the synchronisation instructions

The synchronisation instructions are described here only briefly. See the PowerPC architecture for the full and exact details.

sync Creates a memory barrier. On a given processor, any load or store instructions ahead of the sync instruction in the program sequence must complete their accesses to memory first, and then any load or store instructions after sync can begin.
lwsync Creates a memory barrier that provides the same ordering function as the sync instruction, except that a load caused by an instruction following the lwsync may be performed before a store caused by an instruction that precedes the lwsync, and the ordering does not apply to accesses to I/O memory (memory-mapped I/O).
eieio Creates a memory barrier that provides the same ordering function as the sync instruction except that ordering applies only to accesses to I/O memory.
isync Causes the processor to discard any prefetched (and possibly speculatively executed) instructions and refetch the next following instructions. It is used in locking code (e.g. __check_lock()) to ensure that no loads following entry into a critical section can access data (because of aggressive out-of-order and speculative execution in the processor) before the lock is acquired.

Note that lwsync is a new variant of the sync instruction and is interpreted by older processors as a sync. This instruction, as its name implies, has much less performance impact than sync, and is recommended for synchronisation of most memory (but not I/O) references.

Some examples of synchronisation

Example 1. Set a global flag to indicate that a block of (global) data has been stored.

 	<compute and store data>
		:
                lwsync/sync
                <store flag > 
            

This sequence shows the effective instruction order. You could put each of the three parts in a separate procedure or function. In fact, the lwsync or sync could be placed in a small function that's called whenever the corresponding memory barrier is needed.

Example 2. Wait for a flag to signal that a block of data has been stored

        Loop:load global flag
                 has flag been set?
                 No:   goto Loop
                 <yes: continue>
                 isync

                 < use data from block>

Here, the isync prevents the processor from using stale data from the block. Without the isync, such use could occur if the test of the flag succeeds and the loads that access the computed data were executed speculatively before the load that tests the flag and early enough to get the stale data.

You could use a sync, but isync is less expensive in terms of effect upon performance for this example.

A more complex example

Now consider a case in which a sync is required and an lwsync is insufficient. Starting with the following code fragments:

          struct {
		 lock_t lock;
		 int data;
		 } cpu1,cpu2;

A long time ago:

		 cpu1.data = 1;
		 cpu2.data = 1;

On an MP system two of the CPUs are running comparable code. On the first processor (CPU1):

		 lock(&cpu1.lock);
		 cpu1.data = 0;
		 unlock(&cpu1.lock);

 		 if (cpu2.data != 0)
		       sleep();
		 }

On the second processor (CPU2) the corresponding activity occurs:

		 lock(&cpu2.lock);
		 cpu2.data = 0;
		 unlock(&cpu2.lock);

		 if (cpu1.data != 0)
		       sleep();
		 }

In this example, the check of the "other" data from each processor is done without acquiring a lock. The code illustrates a race condition wherein it's expected that one thread will win and not go to sleep; the other thread (or threads) would detect that they need to go to sleep and wait their turn.

The lock and unlock functions are responsible for executing the appropriate synchronisation instructions, but here the selection of the required instruction for the unlock() function isn't clear without careful analysis. If the unlock() function is implemented with an lwsync it's possible for a deadlock to occur because the memory barrier created by the lwsync instruction doesn't order subsequent loads (in this case, the load of the other processor's data variable) with respect to preceding stores (here, the store of the processor's own data variable). Both threads of execution could go to sleep because they based their decisions upon stale data and hence have not seen the updated data fields. In this case, no one wins.

If the unlock() function is implemented with the sync instruction, however, the release of each lock will delay the access to cpuX.data until the store caused by unlock() has been performed with respect to other processors. Here, "performed" means that when another processor performs a load from that location it will return the value stored by unlock() and not a stale value that existed prior to that store. In this example, CPU1 would not attempt to examine cpu2.data until after both it (cpu2.data) and the lock (cpu2.lock) had been fully updated by CPU2 (presuming simultaneous execution of both threads). If sync is used, the storage accesses will occur in the intended order. In this case, it's never possible for both threads to go to sleep at the same time and deadlock can't occur.

Generating synchronisation instructions

The current C and C++ compilers from IBM provide built-in functions to generate the sync and eieio instructions inline. New releases of these compilers are planned to provide a richer set of built-ins which will permit convenient generation of lwsync and isync (among several other instructions).

Assembler-coded functions can be written that contain just the required instruction and a return.

The compilers have a facility that permits generation of any instruction inline. This facility can be used to replace a call to an external assembler-coded function with the instruction(s) contained in the function, without changing the program semantics of the call. This is known as "mc_func" (machine-code function), and it requires specification of the instructions in hex.

A brief description of this facility is provided in Appendix I. Examples of generating a lwsync and an isync instruction are given here.

Generate an lwsync

Define a function that, when called, executes a lwsync. Here, it's called "do_lwsync".

		void do_lwsync (void);	/* prototype */

		#pragma mc_func do_lwsync {"7c2004ac"}
		#pragma reg_killed_by do_lwsync

These lines should be placed before any function which refers to "do_lwsync".

Generate an isync

As above, except for the name and (hex) bits for the instruction:

		void do_isync (void);	/* prototype */

		#pragma mc_func do_isync {"4c00012c"}
		#pragma reg_killed_by do_isync

Conclusion

Shared memory synchronisation is, we know, a complex topic. The good news is that most applications can use system-provided functions to accomplish all their required synchronisation. Once you've identified all the synchronisation points and generated the required code, your application will run successfully on any PowerPC system.

Appendix - Generating machine instructions inline

The IBM C and C++ compilers provide a mechanism for generating machine instructions inline. This is done with a pair of pragmas which provide the necessary information to the compiler.

The intended use of this mechanism is to replace a call to an external assembler-coded function that contains a single instruction, or a small number of instructions, with the desired instructions in place. The function call is still apparent to the compiler, and the semantics of the embedded code are the same as an ordinary call to an external function.

The embedded instruction sequence is specified as a series of 32-bit strings, in hex notation.

The embedded instruction(s) must adhere to the standard register conventions. If the embedded instructions need arguments, like an ordinary function, the compiler supplies them as for any call, according to the linkage conventions (GPR3, GPR4, etc.). Results returned from the instructions must be placed in the usual result register (GPR3). Details of the linkage conventions are documented in the User's Guides for the compilers.

As a performance improvement, the reg_killed_by pragma can be used to inform the compiler that some (or all) of the normally "killed" registers are actually preserved by the embedded code.

Specifying the embedded instruction(s):

The mc_func pragma tells the compiler which instructions to embed in the compiled code, and for calls to which function:

		#pragma mc_func fcn-name {hex string}

For example, suppose an assembler function to execute a lwsync instruction has been written and is called "do_lwsync". To embed a lwsync instruction instead of calling the function, you would use the following:

		#pragma mc_func do_lwsync {"0x7c2004ac"}

This pragma must precede any function containing calls to "do_lwsync".

Specifying registers killed:

The reg_killed_by pragma specifies which of the machine registers that are normally considered "volatile" (killed) by the linkage conventions are actually altered. This permits the compiler to retain values in these registers.

If this pragma is used, then only the listed registers are considered killed. For example, the lwsync instruction uses no registers and alters none. Thus

		#pragma reg_killed_by do_lwsync

tells the compiler that the specified mc_func function alters no registers.

Registers actually killed are listed by name following the function name. The following names indicate registers that are ordinarily considered killed. All other registers are "non-volatile".

	GPRS:	gr0, gr3, ..., gr12
	FPRS:	fp0, fp1, ..., fp13
	CRs:	cr0, cr1, cr5, cr6, cr7
	Link:	lr

Thus

		#pragma reg_killed_by foo gr3, gr5-gr7

tells the compiler that mc_func function "foo" alters GPRs 3, 5, 6, 7 and no other registers.

References

Additional information is available on the new POWER4 systems and the PowerPC architecture. Note that some of the material in the referenced documents is not fully current; for example, the definition of sync has been modified (making sync create a memory barrier instead of ensuring that all preceding storage accesses have been globally performed) since the last two documents were published, and lwsync has been added.

Power4 System Microarchitecture - White Paper
This paper describes what drove the design of the POWER4 microprocessor and takes a closer look at the components of the resultant systems from a microarchitecture perspective.

Power4 System microarchitecture
From the IBM Journal of Research and Development, five papers on the design of the IBM POWER4 Microprocessor and its use in the IBM eServer p690 "Regatta" system.

PowerPC Microprocessor Family: The Programming Environments
This book covers 64-bit implementations of the PowerPC architecture, system implementation issues, and synchronisation.

Programming Environments Manual For 32-Bit Implementations of the PowerPC Architecture
Focusing on the 604E PowerPC processor.

About the authors

Bill Hay is a senior technical staff member at IBM. He joined IBM in 1984 in Toronto, Canada, as a professional hire. He has worked in the compiler development group in Toronto since then, and has worked on the POWER and PowerPC architectures since 1986. He is a senior architect for the optimising compilers produced in Toronto and is currently completing a 5-year assignment in Austin, Texas, where he has been a member of the team which produced the POWER4 processor. You can contact him at billhay@us.ibm.com

Gary R. Hook is a senior technical consultant at IBM, providing application development, porting, and technical assistance to independent software vendors. Mr. Hook's professional experience focuses on Unix-based application development. Upon joining IBM in 1990, he worked with the AIX Technical Support center in Southlake, Texas, providing consulting and technical support services to customers, with an emphasis upon AIX application architecture. Residing in Austin since 1995, Mr. Hook recently transitioned from AIX Kernel Development, specializing in the AIX linker, loader, and general application development tools. You can contact him at ghook@us.ibm.com.

 

  About IBM   |   Privacy   |   Legal   |   Contact IBM