HotPar '10 Lunchtime Discussion Summaries

HOTPAR '10 LUNCHTIME DISCUSSION SUMMARIES

Topics:

"Will major user applications ever actually make use of extensive parallelism, or is this stuff just for specialized applications in sciences?"

"Why hasn't MPI been replaced by a parallel programming language?"

"Will applications ever schedule themselves? Who says we need new schedulers anyway?"

"Will there ever be 'ample resources' that the OS (not the application) can afford to waste?"

"Will we ever be able to retrofit good solutions into our existing code, or is it just time to give up and rewrite everything?"

"When should parallelism be introduced into the curriculum, and how?"

"What feature would most improve the experience of debugging parallel programs?"

"Do we need to do anything differently to autotune for power/energy?"

"What new high-volume consumer applications will demand extreme parallelism?"

"Will major user applications ever actually make use of extensive parallelism, or is this stuff just for specialized applications in sciences?" (Dave Patterson, leader)
Back to Program Back to Topics List

The answer on both days: Yes and Yes.

We assumed that someone would solve the problem of how to provide thousands of cores on a chip within a reasonable power/energy budget and sufficient memory bandwidth for thousands of cores at a reasonable cost, but that was not the problem for our table.

The key was thinking about apps of the future vs. the apps of the past, as is done in benchmark suites like SPEC, ParSec, and STAMP. (There are more interesting apps in the world than gcc, bzip, and perlbench.) The second talk at HotPar, "Opportunities and Challenges of Parallelizing Speech Recognition" by Adam Janin, set the tone by saying that speech could use all the resources hardware could ever make available. It was easy to believe that would be true for music and graphics as well. Hence, it seemed that media apps would use extensive parallelism.

The other way to split apps was in the cloud vs. on mobile clients. Clearly, cloud apps are already extensively parallel: search, social networking sites, sales, auctions, and so on already use tens of thousands of computers to provides services to millions of customers at a time. Clearly, cloud providers would welcome manycore hardware.

The iPhone currently offers 225,000 apps, with 5 billion downloads. Two popular categories are games and entertainment. It seems that these media-oriented apps need the rapid response time that could justify 1,000 cores, even if only occasionally, to sustain the user experience. For example, perhaps a game normally only needs 20 in less computationally intensive times, but in particularly exciting situations, it needs them all.

For mobile clients, the questions were what to do locally and what the latency/energy costs of communicating were, as well as how the app could still function in situations where there was limited connectivity. For example, we'd like Google Goggles, with augmented reality on the display labeling what it sees via the camera, to keep labeling quickly no matter what the connectivity, even if it didn't always do it equally well.

Finally, we thought that increased I/O devices for mobile clients could use up more processors. One would way would be by employing more sensors: for example, lots of microphones might help speech and multiple cameras could help vision. Increased input would need more processing, as the mobile phone evolves into the Star Trek Tricoder. We would imagine that goggles or special sunglasses could increase the output demand, which could also use extensive parallelism, especially if lots of simple cores were more energy-efficient.

Comments:

from Alexandra (Sasha) Fedorova:
I've been reading about recent technology developed by neuroscientists, which enables teaching a blind person to see. The person has to wear a device with lots of cameras. Cameras convert images to some format which is then used to "transmit" visual signals to the brain through means other than a retina—for example, through the skin or the tongue. The incredibly plastic human brain then learns how to interpret these signals in the right way, channeling them to the visual cortex. It's really quite amazing! I imagine that having a parallel device that converts images will enable a higher resolution and more colourful vision. They also have similar devices for teaching deaf people how to hear. Although this won't generate as high a volume of sales as a popular first-person-shooter, just think about the impact on a particular individual. That's really mind-boggling!

from Andrew Brownsword:
You seem to be assuming that this could only replace eyesight. Assuming the technology could be made unobtrusive (e.g., woven into clothing or something) and non-visual spectrum (IR, UV, radar, etc.), then I'd be interested in having an extra pair of eyes on the back of my head. If cheap enough, sales could be very significant.

from Alexandra (Sasha) Fedorova:
Hmm, I'd be curious to see what this does to your brain ;) — processing visual information from so many sources. It'll probably be okay — I have a tremendous trust in the endless abilities of human brains.

"Why hasn't MPI been replaced by a parallel programming language?" (Geoff Lowney, discussion leader)
Back to Program Back to Topics List

MPI provides "portable performance" across a wide range of platforms, from supercomputers to commodity clusters. It is also a general mechanism for expressing parallelism. There is no other portable, general solution. Note that HPC applications are not "embarrassingly parallel"; a simple netbatch is not sufficient. The distributing programming model must support communication efficiently.

Big SMP machines were used for supercomputing in the past, but they are too expensive and are now only used for problems that require large shared memory.

There have been a number of point solutions that do not cover the entire HPC space: PGAS languages, domain specific languages on top of MPI (e.g., Cactus), CHARM++ from Illinois.

There is a new high-level alternative being introduced. Microsoft purchased Interactive Supercomputing last year. Interactive Supercomputing had a product that would map a MatLab program over a network. Microsoft is rumored to be doing the same for Excel. You will be able to write an HPC program in Excel and have it mapped over a distributed network.

The scientific community is more concerned about the increasing parallelism in a individual node, due to multi-core processors and GPUs. This requires the introduction of a new layer of parallelism in their programming model. In future supercomputers, most of the increase of parallelism will come from more parallelism within a node, not from more nodes in the network.

A particular concern is that "mainstream" solutions to multicore and GPU parallelism will not meet the needs of HPC. One example is that the supercomputer community would like a GPU to be a direct participant in the MPI network, not an attached processor where all communication must first be sent to the host and then the host relays it to the GPU.

Perhaps message passing will move from clusters to client computing. HTML5 uses message passing for parallelism. Problems with the memory model on today's multicore processors may make message passing more attractive.

"Will applications ever schedule themselves? Who says we need new schedulers anyway?" (Alexandra (Sasha) Fedorova, discussion leader)
Back to Program Back to Topics List

We started the discussion by posing a controversial question: In the era of core abundance we'll be able to give each thread its own core and then the scheduler, which traditionally multiplexes cores among threads, is no longer needed.

This argument was quickly overturned based on the following observations:

1. If we are successful in parallel computing research and if we use our imagination to design new exciting apps, we should have more threads than cores. That will likely be the case on mobile devices, which will have dozens but not hundreds of cores in the near future (10–15 years), and will certainly be the case in datacenters, where there are always more total threads than the number of cores on any particular server (and consolidating multiple apps on one machine is very attractive).

2. Even if we have systems with a larger number of cores than we can use, we may not be able to keep all cores on at the same time (that's Dave Patterson's observation). Then the job of the scheduler will be to decide which cores should be on and which cores should be off—a slightly different twist on the classical scheduling problem.

3. Even if you don't believe the first two arguments, here is the third one that's hard to disagree with. Given recent trends in hardware, a scheduler must be viewed as a RESOURCE MANAGER. That is, it must schedule resources other than just cores: memory, memory bandwidth, cache space, network, disk, power, temperature. The view that the scheduler only schedules CPUs is very outdated—it is an artifact of the old times when the CPU was the most "expensive" resource. Now things are different. Cores are abundant, but things like caches and memory bandwidth are scarce. So the job on the scheduler is to effectively multiplex those resources. So YES, WE STILL NEED SCHEDULERS, and OLD SCHEDULERS ARE BROKEN.

So, how do we define a next-generation scheduling problem? Before providing the definition, I must briefly summarize our discussion on the other topic: Will applications ever schedule themselves?

This turned out to be a very non-contentious issue, as we quickly agreed that applications already do schedule themselves. Almost every parallel programming environment that was presented at HotPar included some kind of task scheduler. Intel TBB has one, Cilk has one, OpenMP has one, commercial game engines have one. The list can go on. As of now, these self-scheduling runtimes typically assume that they "own" all of the machine's resouces (for example, an OpenMP manual suggests defaulting the maximum thread count to the number of system CPUs). However, if multiple applications are running simultaneously then someone (hint: the OS scheduler) must multiplex system resources among these runtime environments.

So, given that application runtimes would want to do their own resources scheduling within some type of "fixed resource containers," the next-generation scheduling problem can be defined as: Effectively multiplex system resources (resources in a broad sense, not just cores) among application containers.

This of course raises the question: what does it mean to "effectively multiplex"? And this leads us to another summary point.

We concluded that systems these days are fundamentally flawed in how they communicate with applications and in how they answer their needs. The application shouldn't have to tell the system that it needs a certain amount of cache or memory bandwidth. Rather, it should be able to tell its QoS requirements, that is, how fast it wants to run in terms of application-level metrics: frames per second, transactions per second, etc. Then the OS will periodically check with the application to see if it's running quickly enough, and if not, the OS should figure out which of the resources is the bottleneck and how to give the application more of that resource.

Of course, this is a very difficult problem: what if an application asks for more than it really needs? (Perhaps charging it for resource consumption can mitigate this problem.) Is there a need to perform admission control, to avoid oversubscription? How to balance application requests with other goals: fair sharing of resources among applications, staying within a certain power budget, and others?

All in all, there are lots of interesting questions for us to answer and much exciting scheduling research to do!

"Will there ever be 'ample resources' that the OS (not the application) can afford to waste?" (Michael Swift, discussion leader)
Back to Program Back to Topics List

We addressed this question fairly quickly: it is likely that there will very often be many idle cores, but it is not clear that the OS can afford to waste them.

First, on a mobile system and in a data center, power is a primary concern, and "waste" implies that the work being done is not efficient.

Second, there will be some applications that can use all the cores, so the OS should not depend on them being available for critical tasks, such as security checks. Rather, they should be considered for best-effort tasks, such as cleaning or reorganization.

However, we felt it is feasible to dedicate cores to the OS. Just as current systems dedicate memory for the OS to use, when there are many cores, to reserve some for the system, for example for I/O interrupts, is reasonable.

We discussed several other topics related to operations systems and manycore systems.

Virtualization: Recent manycore OS projects have not brought up virtualization. In fact, they are moving away from virtualizable systems by embedding more dependencies on the specific hardware (such as the cache sizes and communication latencies) into the OS. One opinion was that with a better OS, users won't need virtualization, so this is not a problem. A conflicting opinion was that there would always be value in having a layer below the operating system for things like dynamic partitioning, multi-tenancy, and migration.

Platforms: Most manycore OS research targets desktop machines. However, smartphone sales are approaching desktop sales, and it is possible that the majority of computer users will migrate to smartphones, thin clients, or tablets. It was also recognized that a good reason to investigate desktops is they run the most heterogeneous mix of workloads (batch, realtime, interactive) and are the easiest to work with, as they are open platforms with open source software. In contrast, smartphone and cloud computing platforms are much more opaque.

"Will we ever be able to retrofit good solutions into our existing code, or is it just time to give up and rewrite everything?" (Mendel Rosenblum, discussion leader)
Back to Program Back to Topics List

The consensus answer to this question appeared to be, "It depends on the existing code." An analogy was made with making buildings environmentally efficient. Sometimes it is possible to retrofit an existing building and improve its energy usage, but sometimes it may be necessary to demolish and rebuild to get a big win.

There is significant research effort being made by the community to build tools that make retrofitting easier; to the extent that this research is successful, it will enable the retrofitting solution for more applications.

Retrofitting existing code becomes a performance optimization exercise that is made difficult by the uncertainties when writing parallel code. Traditionally, when doing performance optimization the developer attempts to estimate the benefits and costs of the optimization before proceeding. This is hard, because the state of the art in predicting the performance of parallel code is primitive. That a successful attempt at parallelizing a nontrivial program is a well-respected, publishable result is evidence of poor models for the expected benefit of retrofitting parallelism.

Besides uncertainty of the benefit, there is uncertainty about the costs of the operation. In addition to the obvious software development costs for retrofitting parallelism into components of a complex system, there may be hard-to-quantify software engineering costs. For example, frequently parallelism requires changes to module interfaces and possibly new abstractions. The cost of these changes in maintaining the program over its lifetime can be difficult to estimate.

The uncertainties of both the costs and the potential benefits makes retrofitting a risky endeavor that should be attempted only if not doing so would have more dire results.

"When should parallelism be introduced into the curriculum, and how?" (Michael McCool, discussion leader)
Back to Program Back to Topics List

The key thoughts that arose from the discussion were:

Dependencies are a crucial core concept.

The curriculum should offer multiple tracks for different types of programmers, e.g., embedded, Web, enterprise.

Developing a good mental model is important, and the target is the "tool," which may be a deep systems stack.

The courses should be aimed at undergrads: second year, abstract; third, concrete; fourth, specialized.

Patterns are useful, but need to be grounded in concrete implementations such as Cilk's fork-join primitives.

Data locality is important, but we still need to teach parallelism first and treat data locality as an optimization (serial early, parallel later).

Data races and the like are really hard even for instructors to understand.

Computational parallelism needs to be clearly distinguished from systems parallelism.

It's important not to ghettoize parallelism as a senior-level study.

"What feature would most improve the experience of debugging parallel programs?" (David A. Padua, discussion leader)
Back to Program Back to Topics List

Several classes of debugging tools were identified during the discussions:

1. The most basic tools discussed were breakpoint debuggers. Not much is known by today's programmers about breakpoint debuggers specifically designed for parallel programs, although several such debuggers are already available (e.g., https://software.intel.com/en-us/articles/idb-linux/).

2. The ability to reproduce (or replay) the sequence of events leading to an error was considered an important feature to facilitate the debugging of parallel programs; in fact, the first discussion group was unanimous in selecting this feature as the most important of all. Deterministic replay is important even in the case of determinate programs, since that a program is externally determinate—that is, that it always produces the same outcome—does not mean that the program is internally determinate: that is, it does not mean that the state of the program (values of all variables, the program counter, etc.) always changes in the same order.

3. Tools to detect races were also mentioned, although not everyone thought these were of great importance. Enforcing determinacy at the language level was discussed as a way to avoid the need for race detection.

4. There will always be the need to develop algorithms and programs that are non-determinate. The first example that came to mind wasis Dekker's algorithm. For such programs there is no clear idea of what class of tools would facilitate debugging and reasoning about them.

5. Tools for debugging performance were also considered necessary, given the tremendous difficulty we have today in identifying the reasons for low performance and lack of scalability of parallel programs.

Comments:

from Andrew Brownsword:
I have one comment to add: thread-aware debuggers aren't enough if the programming model is task-based. When debugging in a task-based or other higher-than-thread-level system, the debugger needs to understand this system intrinsically. Being able to do such things as put breakpoints in a specific instance of a particular task and not have it break when other threads or other tasks hit it would be invaluable, particularly in data parallel situations where hundreds, thousands, or even millions of instances are running. Similarly, a "single slice" view of a particular instance's data is important. Intel's breakpoint debugger work is a step in that direction, but I don't think it has gone far enough yet.

"Do we need to do anything differently to autotune for power/energy?" (Richard (Rich) Vuduc, discussion leader)
Back to Program Back to Topics List

High-level conclusion: Probably not much. However, since neither lunch had a "deep" embedded systems research contingent present, we felt we could not draw strong conclusions. We did identify and discuss at least two interesting issues, summarized below, that we felt were important in addressing the larger question about autotuning for power/energy.

Background: Central to our discussion is what we mean by "autotuners" and "autotuning." As a working definition, we said that autotuning is program optimization driven by automated experiments, i.e., generating and running code. This approach is distinct from traditional profile- and feedback-based compilation in two ways. First, we are not necessarily transforming a program artifact (source code). Instead, an autotuner might generate code from, say, a declarative specification or using a special parameterized generator. This distinction further implies that an autotuner could in principle tune more than just discrete/continuous optimization parameters and knobs; it might instead be selecting from among entirely different algorithms or data structures. The second distinction is that the autotuner directs the experiments. That is, it in principle decides what the next experiment (optimization and/or measurement) should be.

The paper at HotPar '10 most relevant to our lunch topic is the work by Korthikanti and Agha of UIUC. (Korthikanti was present at the first lunch.) Their theoretical models predict, for a number of examples, that tuning the frequency of different parts of a parallel computation can reduce overall power/energy. However, on the practical side, Hong, Kim (GT), and Luk (Intel) have also shown experimentally for some class of computations that this kind of tuning might not matter much. (In particular, they observe that when minimizing, say, energy-delay, the optimal tuning configuration is not much different from that when minimizing time.) So, the potential gains in simultaneous performance and energy efficiency seemed to us unclear.

Issue 1. Measurement: Recall our working definition: autotuning is program optimization driven by automated experiments, i.e., generating code and measuring performance. As such, an autotuner optimizing with respect to some time/power-energy criterion needs to be able to measure power/energy with the same relative ease with which it can use hardware performance counters today.

Given a way to make such measurements, the user of an autotuner would simply replace "time" as the optimization metric with some other function of time and energy. (But see Issue 2, below.)

Note that such measurement mechanisms appear to be available in mobile devices, e.g., Android mobile OS's attribution of battery usage to applications and system components. However, it was not clear whether they were reliable measurements or merely very roughly inferred estimates based on indirect measurement. It was also unclear to us whether such mechanisms existed for desktop systems, though there may be several proposals to add them.

Issue 2. Are new algorithmic techniques needed? By our working definition, an autotuner considers a space of candidate implementations, which might include a suite of different algorithms. For example, an autotuner for sorting might be choosing from among a candidate set of asymptotically similar sorting algorithms, where different algorithms trade off time/storage/parallelism/locality in different ways for a given machine.

Other than changing the optimization metric from time to f(time, energy or power, ...), does the autotuner need to change in some fundamental way? Put another way, is there a fundamental difference between the design of algorithms with good time/space/accuracy properties and those with good time/power/energy properties?

Consider, for instance, that in a first course on parallel algorithms, we learn to design algorithms that simultaneously (a) are "work-optimal," i.e., do not aymptotically perform more operations than the best sequential algorithm; (b) have low "depth" (also, "span" or "critical path length") relative to the amount of work; and (c) avoid communication. Taken together, algorithms designed under these principles have a good shot at being scalable in theory and usable in practice.

Given these criteria as algorithm design principles, do we need new principles to design algorithms that are efficient with respect to both time and energy? After all, the above principles lead to computations that tend to have high degrees of parallelism but, by work-optimality, little operational waste compared to the best sequential code. These seem like good properties for saving energy as well.

If the answer is no, we do not need new principles, then we also will not need change autotuners or tuning methodologies, provided we can address the measurement issue (Issue 1, above). If the answer is yes, we will need both new algorithms and fundamental changes to the "space" of candidate implementations that an autotuner should consider.

With respect to time and communication, we know of course that it may be profitable to increase (redundant) computation in order to reduce communication. From an energy perspective, if computation is expensive in time it is probably also expensive in energy. In this sense, it would seem that the natural time/communication trade-off is the same if we want to save energy.

However, Korthikanti and Agha offer an intriguing counter-example in their HotPar '10 poster. They claim that under a fixed time constraint, a parallel quicksort with a sequential pivot step might use less energy than a fully parallel quicksort, given a fixed time budget. It would be interesting to validate this kind of prediction experimentally.

"What new high-volume consumer applications will demand extreme parallelism?" (John Nickolls, discussion leader)
Back to Program Back to Topics List

We brainstormed a wide-ranging list of possible consumer applications that would require extreme parallelism in a consumer device or in the cloud or both. Most apps involved real-time human interaction with interactive media such as video, images, graphics, audio, and speech. Many required fast search and access of large or distributed data sets, or fast simulations of large computational models, which we expected would require large-scale parallel computing in the cloud.

Parallel consumer application categories that seem promising are:

Media: games, video, images, vision, audio, speech, music. Real-time interactions need parallel performance. Parallel computation can make the application more valuable (better results), and easier to use.

Speech: recognition, translation, meeting notes, speaker identification, hands-free applications. Parallelism improves accuracy, vocabulary, noise tolerance, latency, and enables personal tuning.

Vision: identifying faces, objects, and places; image-based apps; security cameras; augmenting images with additional information; augmented reality, e.g., in glasses, microscopes, telescopes, sensors; fast medical imaging for moving objects; gesture recognition for games and user interfaces; smart cars that could follow the road, follow the route, receive collision warnings, and avoid collisions; robot vision; video conferencing and teaching.

Ambient AI: low latency, intelligent services, security, speech recognition, tagging photos, user interfaces, parallel search, query auto-completion, robots. Parallelism improves results.

Augmented Reality (AR): augmented vision, heads-up display, adding info to display (where am I? how do I get there?), automotive interfaces. Parallelism gives faster and better results.

Search: mobile parallel and server parallel/distributed cloud; filtering social media, collaboration filters, making recommendations, mapping photos to shopping info, price comparisons.

Need help? Use our Contacts page.

Back to Program
Last changed: 25 June 2010 jel