S-114.240 Rinnakkaislaskenta laskennallisessa tieteessä

Compositional C++

 

Based on CC++ tutorial by

Paul A.G. Sivilotti and Peter A. Carlin

Teppo Miettinen

 

 

 

 

 

CONTENTS

CC++ Basics

Parallel threads

par

parfor

spawn

Pitfalls

Examples

Synchronisation

Example

Atomicity

Example

Processor objects

Processor object declaration

Global pointers

Data transfer functions

Example

Performance considerations

CC++ Library

Specifying sets of nodes

Channel communication

Summary

More information on the Internet

 

CC++ Basics

Compositional C++ is a strict superset of C++ developed at Caltech (California Institute of Technology). It was designed to alleviate the frustrations of parallel programming by extending the C++ by the following six basic abstractions: thread, sync variable, atomic function, processor object, global pointer, transfer function.

The CC++ is basically a code-to-code converter, which transforms CC++ code to C++ code containing embedded calls to the CC++ runtime library, which is based on the Nexus runtime library and an operating system specific thread library.

CC++ is targeting for task-parallelism. In the future focus is on the High Performance C++ (HPC++) which is a joint project between the CC++ group and the data parallel pC++ project. The goal is to offer both data- and task parallelism.

C++ compilers latest release version is 0.4 (beta release) but due to its simplicity it is already quite robust. The compiler is freely available as prebuilt binaries and as source code, but a successful installation requires a restricted combination of host operating system, C++ compiler, and threads packages. Currently supported platforms are:

Operating system: Solaris 2.3 and 2.4 
C++ compiler: sunCC SC3.0.1 (Sun's new native C++; recommended) 
C++ compiler: G++ 2.6.3 (and libg++-2.6.2.tar.gz) 
thread package: Solaris threads (bundled with Solaris 2.3) 

Operating system: AIX 3.2.5 (on IBM RS6000 and IBM SP2) C++ compiler: xlC (IBM AIX XL C++ Compiler/6000) C++ compiler: C Set++ (IBM's latest C++ compiler) thread package: IBM's DCE threads (available from IBM for about $100)

Operating system: SunOS 4.1.3 C++ compiler: G++ 2.6.3 (and libg++-2.6.2.tar.gz) thread package: FSU's pthreads and threadsafe malloc

Operating system: HP/UX 9.05 C++ compiler: HP-C++ 3.5 thread package: HP's DCE threads (libdce.a)

 

Parallel threads

A thread of control is a single process. The start of a thread of control is the root from which independent dynamic action within a system occurs; a given system may have simultaneous threads of control, some of which may dynamically come into existence and then cease to exist. Systems executing across multiple CPUs allow for truly concurrent threads of control, whereas systems running on a single CPU can only achieve the illusion of concurrency by using fair interleaving. So in a shared memory computer parallel programs can be constructed solely by using threads.

There are three mechanisms in CC++ by which parallel threads of control can be created; par, parfor and spawn.

par

Par defines a block in which statements are executed in parallel by independent threads of control. All legal CC++ statements can be used inside par block except variable declarations and statements that result in nonlocal changes in the flow of control. Such statements are for example goto and break.

The order in which the statements execute inside a par block is not defined. Thread execution may be interleaved or possibly concurrent and in some cases even sequential, the language does not define how often, or even how soon, instructions from a particular thread will be executed. But CC++ does guarantee that all threads eventually get a chance to execute. The execution of a parallel block is finished only after all the statements inside the block have finished executing. The end of a par-block can be seen as a barrier which synchronises the program execution.

parfor

Parfor is a parallel loop construct that executes the iterations as separate threads, while the body of each iteration is sequential. The loop is unravelled by assigning a constant value for the loop control variable within each iteration. The prequisite for this is that the values of the loop variable must be obtainable without executing the body of any iteration.

Parfor statement completes only when all the iterations have completed, but as with par-statement nothing can be said about the order of execution of different threads. It is only quaranteed that when the program execution proceeds past the parfor block all the threads inside the parfor have finished.

spawn

Spawn creates a single completely independent thread of control for a function that executes in parallel (or interleaved) with the thread that executed the spawn. Spawned function cannot return a value and it is in no way synchronised with other threads of control. Thus spawn construct offers no mechanism to determine the state of the spawned function; synchronisation must be explicitly programmed.

Pitfalls

As the order of execution of parallel threads of control is not defined, it should be clear that sharing or modifying data between threads is dangerous. CC++ offers ways to share data safely by utilising atomic functions and sync variables, but it does not force it. So it is the programmers duty to make sure that the sharing between threads is handled safely.

Examples

par {
   {
      ans1 = func(params1);
      result = func2(ans1);
   }
   par {
      ans2 = func(params2);
      ans3 = func(params3);
   }
}



int A[N];
int B[N,N];
parfor (int i=0; i<N; i++) {
   A[i] = i;
   parfor (int j=0;j<N;j++)
      B[i,j]=j;
}



spawn function(A,B);





/* Here is an example of the dangers of 
   variable sharing. This creates 
   unpredictable results! */

int i=0;
par{
   i=10;
   for(int j=0;j<10;j++)
      i++;
} //i may be 10, 20 or any other value!!

 

Synchronisation

Simple synchronisation between parallel threads in CC++ is achieved by sync variables. They are constants that initially are in an undefined state and can be assigned only once and are thus called single-assignment variables. Single-assignment variables provide a means for synchronisation because of the following rule:

If a thread of control attempts to read a single-assignment variable that has not yet been defined, that thread suspends execution until that single-assignment variable has been defined.

This rule offers also a possibility for trouble, because other threads can read the single-assignment variable immediately after it has been declared. Therefore sync variables should be assigned only after it is certain that all actions that need synchronisation have completed successfully. Sync variables can also create deadlocks, where all threads are waiting for the others to assign their syncronisation variables. The compiler does not detect possible deadlock situations, it is the programmers duty.

Sync variables can be declared the same way as normal C/C++ constants. For example pointers and arrays can be declared as sync.

Example

void some_function (sync int* b) {
   ... //Function executes
   *b = 1; //Sync variable is assigned
   ... //some_function may continue execution
}

int main()
{
   sync int sync_var;
   spawn some_function(&sync_var);
   ...  //main executes simultaneously with some_function()
   if (sync_var == 1) //spawning thread waits here until
   //some_function() has assigned some value to sync_var
   {;}
   ... //Now we know that some_function() has 
       //reached the _assignment_ of sync_var.
}

Atomicity

With parallel threads there is no way of knowing the order of execution of separate statements. Different threads may try to call the same service simultaneously and it can cause unpredictable results. Simple synchronisation can be achieved by the sync variables, but CC++ also enables a more versatile way of managing concurrent service calls.

A class can have one ore more of its member functions declared atomic. This declaration specifies that the actions of such a function will not be interleaved with the actions of any other atomic function of the same object. This can be seen as queue, where the first call is fully executed while other calls to any other atomic function of the same object are waiting for their turn.

Example

class value_store {
private:
   int x;
public:
   atomic void store (int i)
      { x = i; }
};

void f(void){
   class value_store vs;
   par { 
      vs.store(1);
      vs.store(2);
   }
   //Now either 1 or 2 is stored in vs
   //but both have been stored in vs for at least an instant.
}

Processor objects

A processor object is a collection of data and computation that defines a single address space. Processor objects are virtual address spaces because they can be located on the same physical address space.

Processor objects can be freely distributed among available computing resources and thus separating the physical and logical parallellisation of the problem. If computation resources change it is not necessary to change the structure of the program, remapping of processor objects is enough. Of course a program can be made faster by modifying the code according to the resources, but with less critical applications processor objects and virtual address spaces offer a great flexibility.

Processor object declaration

A processor object is declared when a global keyword is added to the class or structure declaration. The declaration specifies the interface to objects of that type. Normal C++ member function rules apply also to processor objects; All other processor objects can access only the public member functions and data with a global pointer to that processor object. Private and protected members are only accessible from member functions of that processor object, or objects derived from it.

Processor objects are allocated using the C++ new operator that is called with one argument, an implementation defined type proc_t. The type proc_t defines the host name where the processor object is to be run. As an example

{
   proc_t placement("nodename");
   probject *global probject_ptr= new (placement) probject(constructor-arguments);
}

creates a new processor object of type probject on host nodename.

Deallocating is done by using the delete operator on the global pointer that points to the appropriate processor object. The actual implementation of processor object allocation can vary depending on the system configuration and is handled by the Nexus runtime library.

 

Global pointers

When computation is distributed to several address spaces global pointers can be used to communicate data between processor objects. Global pointers are similar to local ("normal") pointers but they can reference addresses in any processor object in the computation when local pointers can only reference memory in the processor object they are defined in.

When a global pointer is dereferenced, the value it references is returned. If the global pointer references an object, member functions can be invoked through the pointer. In both these cases, since the object resides on another processor object, an implicit communication is performed to fetch the value or call the function. This makes the use of global pointers more expensive than local pointers. Global pointer storage also takes more memory.

Global pointers are declared using the keyword global in the pointer declaration and they can reference basic types and user-defined structures. As an example

int *global gpoint; //declares gpoint as a global pointer to an integer
int * *global gpoint; // as a global pointer to a local pointer to an integer
C *global gpoint; // as a global pointer to an object of type C

Currently global pointers to functions are not supported.

 

Data transfer functions

When a function with arguments is invoked through a global pointer, the arguments are copied and the function is invoked using these copies. Similarly return values are copied back to the processor object that invoked the function.

While transferring the arguments is simple if they are basic types, it is more complex when they are user-defined structures, particularly if they contain local pointers because local pointers are only valid in the processor object they were created in. Local pointers can be transferred as pointers, but they have no meaning in any other processor object than the one they were created in.

To define how a type is transferred between processor objects, CC++ uses data transfer functions that can be defined separately for every data type. For example

CCVoid& operator<<(CCVoid&,const TYPE& obj_in);

defines how TYPE should be packed when it needs to be transferred to another processor object. Similarily the function

CCVoid& operator>>(CCVoid&,TYPE& obj_out);

defines how TYPE should be unpacked when recieved. The program needs not to call these functions explicitly, they are invoked automatically when data of this type is transferred.

The compiler can usually automatically generate a transfer function if all data members are basic types, global pointers or user-defined structures. The compiler cannot generate the correct transfer functions for types with local pointers or arrays and will in such cases produce a warning message. For user specified data types it is always recommendable to create appropriate transfer functions in order to be sure that the transfer is done correctly.

Example

class Vector {
   int length;
   double* elements;
   friend CCVoid& operator<<(CCVoid&,const Vector&);
   friend CCVoid& operator>>(CCVoid&,Vector&);
};

CCVoid& operator<<(CCVoid& v,const Vector& input){
   v << input.length;
   for (int i=0; i<input.length; i++)
      v << input.elements[i];
   return v;
}

CCVoid& operator>>(CCVoid& v,Vector& output){
   v >> output.length;
   output.elements = new double[output.length];
   for (int i=0; i<output.length; i++)
      v >> output.elements[i];
   return v;
}

Performance considerations

CC++ uses Remote procedure calls (RPC) to interact across address space boundaries. In its most general form, an RPC specifies the data that is to be transferred and the remote operation that is to be performed with the data. Because RPC:s are designed for flexibility and use standard communication channels (e.g. pipes, streams, or TCP sockets) the RPC is inherently slower than low-level messaging systems such as MPI. RCP:s are also more complex, which creates more overhead due to the added marshalling at both ends, but on the other hand makes the interface simple and flexible.

In CC++ RPC:s within a single physical address space are generally cheaper than remote calls, but the actual difference depends heavily on the system implementation. As a general rule that applies to virtually all parallel programming, the number of data transfers should be kept small by transferring more data in one call.

Some performance degradation is also caused by thread spawning and thread interleaving. If a CC++ program is intended to have the best possible performance, it should have only one thread of control for one processor. Also direct use of remote data should be avoided due to the communication burden of global pointers, channel communication is more efficient.

Program created with CC++ are clearly slower than those that are based on a low-level message passing. But in many cases, the simplicity of RPC abstraction along with the software-engineering benefits such as modularity and program composition outweigh the performance gap. Especially the ability to run with a heterogenous and changing set of resources can be a valuable asset.

 

CC++ Library

CC++ Library is a set of routines that are intended to make it easier to write CC++ programs. It is also freely distributed as source code and so can be modified to meet the users requirements. The library contains routines for

Here is a brief summary of some interesting features that are included in the library.

Specifying sets of nodes

To simplify managing sets of nodes, the CC++ library has a locations class that is basically an array of proc_t structures that can be initialized using a file containing the available hosts. The proc_t objects that are in a locations object can be accessed via the array indexing operator that wraps around so that you will never index off the end of a locations object. The user can then modify the hostfile to define where the computation should be mapped to without modifying the actual program.

Simple load balancing can be achieved by mapping more processor objects to faster computers and specifying an appropriate number of nodes to SMP:s.

Channel communication

A channel provides a point-to-point communication structure, there is a sending side and a receiving side. Data is placed into the channel at the sending side and can be removed, in order, on the receiving side.

CC++ library defines a channel communications class that enables efficient communication between two endpoints. The channel communications class defines an efficient, typed, unidirectional, point-to-point communication facility. It utilizes both synchronous and asynchronous transfer.

Channel communications class is mainly a set of primitives that create the low-level communication channel that is more efficient but also more difficult to use than the normal RPC:s.

 

Summary

Compositional C++ offers a simple interface for programming parallel computation structures. Although CC++ offers only few basic extensions to basic C++ it preserves all the powerful features it has. This results in a rich and a flexible programming language that promotes the reuse of code.

The combination of threads and processor objects makes CC++ well suited for clusters of shared memory computers and heterogenous enviroment. The allocation of computation to different resources is also easy and can be done even without code modifications.

 

More information on the Internet

CC++ home page globus.isi.edu/ccpp/

The Nexus home page www.globus.org/nexus

An introduction to C++ www.acm.org/crossroads/xrds1-1/ovp.html

C++ tutorial www.swcp.com/~dodrill/

Performance evaluation of MPMD communication http://www.supercomp.org/sc97/proceedings/TECH/CHANG/INDEX.HTM