Distributed memory architecture pdf
To learn more, view our Privacy Policy. To browse Academia. Log in with Facebook Log in with Google. Remember me on this computer. Enter the email address you signed up with and we'll email you a reset link. Need an account? Click here to sign up. Download Free PDF. Azhar Akmal Fahreza. Alwi Ramadhan. A short summary of this paper. Be that as it may, in time the request for increased computational control was introduced within the age of enormously parallel systems.
Within the s, supercomputer with thousands of processors started to seem and by the end of the 20th century, massively parallel supercomputers with tens of thousands of "off-the-shelf" processors were the standard.
Supercomputers of the 21st century can utilize over , processors associated with quick associations. Since it's difficult to fit a large number of CPU cores into distributed memory machine, today's supercomputers all take the same approach to assembling a large system that takes a bunch of different computers and link them with a quick network. In distributed-memory machines, multiprocessor operating system constructs are physically distributed in order to offer efficient access to the global operating system functionalities.
Keywords Distributed memory, Supecomputer, architecture 1. The program uses part of the database from Warren and Pereira's Chat natural language query system. Atlas this program is also related to the Chat system.
Map this solves the problem of colouring a map with four colours, such that no two neighbours have the same colour. It consists of stacking N colored cubes in a column, so that no color appears twice within any given side of the column. Some of these benchmark programs were evaluated through simulation by Kish Shen [17, 18, 21]. He found that even under the assumption of no overheads Chat and Houses have a low level of or- parallelism.
Atlas and Map have a medium amount of parallelism and Cubes5, Cubes4 and Queens8 have higher amounts of parallelism. It shows that SICStus is between 2 and 4. The overheads of parallelism are analysed in detail later in this section.
The performance of Dorpp with multiple workers is presented in table 1. It presents the execution times in milliseconds, for the benchmark programs, with speedups relative to the 1 worker case given in parentheses. The benchmarks are divided in three groups according to the speedup shown: group H high speedup , group M medium speedup and group L low speedup. Dorpp - Workers msecs Group Programs 1 2 4 8 16 Cubes5 The programs in group H have rather large search spaces, and are therefore amenable to the execution of coarse-grained tasks see discussion on granularity, section 5.
A striking result here is the increase in speedup for Cubes5 relatively to Cubes4, which is mainly due to the increase in granularity. Atlas and Map are in an intermediate performance group, group M, which shows relatively good speedups up to 8 workers. We shall now analyse the impact of each of these factors on the benchmarks. Our scheduling scheme was designed to take into account the existence of the EDS machine remote sector copying mechanism, by ensuring that workers search for new tasks that are closely related to their previous tasks.
Table 2 provides a breakdown on the type of memory references that where made during program execution. The column Local gives the percentage of references to local memory. The column Miss represents the percentage of remote references which resulted in a remote sector being copied.
The remaining references will be made to locally cached copies of remote sectors thus referred to as Hit s. The table shows very good results for locality of reference, the question now is why such good results? This makes the layout of the data structure, in terms of sectors, optimal.
Moreover, whenever any remote worker needs to access the data structure it just has to copy the data once. That is, it has to copy only a minimal number of sectors.
It was expected that programs of this type would build distributed data structures in parallel before they were copied to other processors. Hence 64 MB is the limit for local and remote cache memory. Pages are allocated on demand for either remote caching or local memory.
Table 3 summarizes the values observed for locality of reference. The Miss results reported in table 2 are good up to 16 workers for the programs in groups H and M. However for those in group L the miss-rate value is rather high which indicates there will be major communication overheads. Among these are the granularity of tasks - which should be high enough to compensate the overhead of executing tasks remotely - and the number and size of network messages - which should be kept as low as possible to reduce related execution overheads.
These benchmarks were chosen from each speedup group high, medium and low speedup as detailed earlier. These results show that the decrease in the granularity of tasks is correlated with the increase in the overheads due to communication.
The likely explanation is that workers run out of work more frequently and therefore not only have to search for work more often but also have to broadcast more work-load messages.
More messages however contribute to further reduce the average granularity. This behaviour is more noticeable with programs in group L, such as Chat, hence explaining the poor speedups attained when running it with increasing numbers of workers.
We found that in order for programs to achieve good speedups the average granularity should be greater than instructions. This is also due to the eagerness of workers in broadcasting work-load changes. Experiments have been made to control the rate of work-load propagation and determine its impact in the performance of the benchmark programs - the details are given in a later section. We have further investigated the relationship between the performance of these parameters and the activities of the Dorpp's scheduler.
For programs with lower-grain parallelism such as Chat the overheads due to scheduling increase steeply with the number of workers.
The two main sources for this were the activities related to searching for work and messages broadcasting the work-load of workers. Workers run out of work more frequently if the tasks they execute are small, hence, increasing the number of work-load-messages other workers need to know that there is no more work on this worker's work-queue and increasing the overheads due to work-search.
These activities are clearly related with increase in network messages and thread suspensions, as well as the decrease in granularity as was shown in table 4. The progress of workers' activities during execution was also traced by recording the times when each worker changed activity. The activities traced are: searching for work level 1 , broadcasting low-work- load messages level 2 , actively working level 3 and broadcasting high-work-load messages level 4.
They are depicted in the graphs by four sequences of points belonging to four horizontal lines with the sequence for level 1 at the top and the sequence for level 4 at the bottom. On the other hand, a straight line at level 3 represents the period during which a worker is actively working.
After the initial period, the overhead activity of Dorpp goes down which corresponds to a high amount of useful work being performed. Approaching the end of computation, fewer and smaller tasks are available to keep the workers busy. Therefore more scheduling and work-load broadcasting is done by the workers. Because this period is short, overall the performance is good.
With 16 workers graphs c and d , the overheads have increased sharply because there are more workers competing for the same amount of work. Figure d now shows that the granularity has decreased immensely with workers having to search more frequently for work during execution.
We have also collected similar results and graphs for the Atlas and Chat benchmarks with 8 workers and 16 workers [19]. The overheads increase when going from 8 to 16 workers are more noticeable on these programs as one would expect. The idea behind this scheme is that, initially, the workers build up a reserve of local work until the upper threshold is reached. At this point the worker informs the other workers that its work load is high, hence allowing for idle workers to steal work from its work-queue.
When the amount of work in the work-queue falls to the lower threshold, the worker informs the other workers that its work load is low hence stopping them from stealing work from its work-queue. The selection of the correct values for the threshold values is crucial to performance. For example, if U is set too high, the work is exposed much more slowly and we may have a situation in which there is enough parallel work but still having workers idle waiting for the work to be published.
The threshold L can be set to zero to correspond to the situation in which workers only search for work when they become idle, that is when there is no work left in their work-queue. These values represent the number of choice-point entries in the respective work-queue. However, counting the number of choice-points in the work- queue is only a crude measure of the amount of parallel work available. An alternative estimate is to count the number of parallel branches in the work-queue used, for example, in Muse [11].
In general the results show an overall improvement for upper threshold values greater than one. The performance of benchmark programs with poor speedups, such as Chat and Houses, and also for Map, improved notably, especially Chat which doubled its speedup when parameters 0,5 and branch counting were used.
These gains in performance are a consequence of restricting the exposure of parallel work which in turn imposes some control on the eagerness of the workers in searching for work and in broadcasting work-load messages. By allowing a worker to build a reserve of local work before broadcasting its work-load, a slowdown on the propagation of parallel work is achieved. This turns out to be a good thing for these programs as they have plenty of parallel tasks which are rather small in size. This contrasts with what happened when the parameters 0,1 were used.
The granularity has increased enormously just by adjusting the thresholds. A promising area of research might be to automatically adjust the thresholds depending on characteristics of the execution, for example, a worker could measure its granularity and use this to tune the thresholds.
Comparing the two measures used for work-load in the work-queue, choice-point and branch counting, branch counting does slightly better on most of the benchmarks. For benchmarks such as Atlas with a rather fat and shallow or-tree a very small number of nodes, but many alternative branches per node , choice-point counting is inappropriate as it allows for a situation in which there are idle workers while at the same time potentially parallel tasks are not exposed. The Atlas program with choice- point counting and upper-threshold values of 3 or 5, was executed sequentially by just one worker.
The worker never reached a situation of broadcasting high-work-load to the other workers so that they would try to steal work. On the other hand, with branch counting, the speedups were better and improved slightly with those threshold values. To substan- tiate this belief, we have designed, implemented the Dorpp or-parallel Prolog system and evaluated it on a parallel simulator for the EDS machine.
Under lower-grain parallelism, the speedups obtained are not so good and scheduler overheads become important. A very high locality of reference was achieved for this set of benchmarks. Performance could be much better on a machine where these were lower. There are some optimizations that should be considered in order to further improve Dorpp's perfor- mance. One important optimization relates to the Prolog engine which should incorporate the shallow backtracking optimization as this helps to delay the creation of choice-points, therefore optimizing shallow failure and reducing the cost of backtracking.
This would certainly reduce the gap between the serial performance of Dorpp and that of other Prolog compilers. More experiments with scheduling should also be considered in order to tune the system and further reduce the scheduling overheads. In particular, experiments to improve the strategy of propagating the work load to other workers are needed, for example by incorporating the concept of processor neighborhoods.
Acknowledgements The authors would like to acknowledge and thank for the contribution and support that John Sargeant and Nigel Paver gave to this work by writing and maintaining the EDS parallel simulator. This work was carried out while the authors were at Manchester University.
References [1] K. Ali and R. The Muse or-parallel Prolog model and its performance. Schedulling or-parallelism in Muse. In 8th International Conference in Logic Programming. Ali, R. Karlsson, and S. Calderwood and P. Binding environments for parallel logic programs in non-shared memory multiprocessors. Santos Costa, D. Warren, and R. Gupta and B. On criteria for or-parallel execution of logic programs.
The MIT Press, Istavrinos and L. A process and memory model for a distributed-memory machine. Presentation layer is the topmost level of the application by which users can access directly such as webpage or Operating System GUI Graphical User interface. The primary function of this layer is to translate the tasks and results to something that user can understand. Application tier coordinates the application, processes the commands, makes logical decisions, evaluation, and performs calculations.
It also moves and processes data between the two surrounding layers. In this layer, information is stored and retrieved from the database or file system. The information is then passed back for processing and then back to the user. It includes the data persistence mechanisms database servers, file shares, etc. Better performance than a thin-client approach and is simpler to manage than a thick-client approach. Broker Architectural Style is a middleware architecture used in distributed computing to coordinate and enable the communication between registered servers and clients.
Here, object communication takes place through a middleware system called an object request broker software bus. Client and the server do not interact with each other directly. Client and server have a direct connection to its proxy which communicates with the mediator-broker. A server provides services by registering and publishing their interfaces with the broker and clients can request the services from the broker statically or dynamically by look-up.
Broker is responsible for coordinating communication, such as forwarding and dispatching the results and exceptions. It can be either an invocation-oriented service, a document or message - oriented broker to which clients send a message. It is responsible for brokering the service requests, locating a proper server, transmitting requests, and sending responses back to clients.
It provides APIs for clients to request, servers to respond, registering or unregistering server components, transferring messages, and locating servers. Stubs are generated at the static compilation time and then deployed to the client side which is used as a proxy for the client.
Client-side proxy acts as a mediator between the client and the broker and provides additional transparency between them and the client; a remote object appears like a local one. The proxy hides the IPC inter-process communication at protocol level and performs marshaling of parameter values and un-marshaling of results from the server. Skeleton is generated by the service interface compilation and then deployed to the server side, which is used as a proxy for the server.
Server-side proxy encapsulates low-level system-specific networking functions and provides high-level APIs to mediate between the server and the broker. It receives the requests, unpacks the requests, unmarshals the method arguments, calls the suitable service, and also marshals the result before sending it back to the client.
0コメント