Data Locality on Manycore Architectures

Dissertation

This is the manuscript that I submitted for my Ph.D defense. The viva was successfully passed on the 18th of July, 2016.

Abstract

The continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.

Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues that previously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.

In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.

Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.

As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning.

Resources

  • The full manuscript can be found here.
  • Slides for the viva.

Volunteer Computing on the BOINC system

The BOINC system was developed in 2002 to help process and analyze the immense quantity of data that was produced by the SETI project. The main idea is to allow any person on earth that has a computer and access to the Internet to contribute his unused processor time to the advancement of scientific computational projects. A BOINC project basically is a server with a (very) big amount of data that is subdivided in equally-sized computational tasks. A volunteers participating in the project then cycles through the following steps:

  1. Contact the server to ask for new tasks
  2. Do the given computations
  3. Contact the server to send back the obtained results
This way each task can be characterized by a small set of variables: data size, computational size, start date and completion date.

Nowadays there are many different projects using the BOINC system and the diversity of use-cases is growing rapidly. My research, within the scope of my end of MSc. internship, focuses on a specific use-case of the BOINC system where a single server hosts multiple computational sub-projects. Such so-called "umbrella" projects have appeared during the last few years and a real-world example is the World Community Grid. As in many computer-related situations the multiplicity of projects involves a scheduling mechanism to share the available ressources between the users (i.e sub-projects).

However classical scheduling mechanisms exhibit some major-deficiencies in the specific case of a BOINC server. The main characteristics that are responsible for these weaknesses are:

  • heterogeneous ressources (speed, memory, network connnectivity)
  • heterogeneous reliability (reliable grid-based hosts, unreliable private hosts, ...)
  • multi-objective optimization criteria (job throughput, completion date)

The new challenge here is to develop a fairness criterion to measure the efficiency of the way in which the available ressources are shared among the different sub-projects. This criterion should also take into account the fact that each sub-project may have a different optimization objective. Last but not least it should allow for short and long-term fairness by using past scheduling results into the calculation of the current scheduling. Our current research tracks are revolving around two main paths:

  • a criteria based on a "regret" value reflecting the unhappiness of a sub-project in regard to it's scheduling over the last time-period
  • a criteria based on the "pressure" under which a given project is in regard to it's computational advance, deadline and ressource allocation at the current moment

The results of this internship have been gathered in my report and have be presented for examination for my Master's degree at the École Normale Supérieure de Lyon. Both the slides and the report are available below:

Applying Bit-Slicing to Cryptographic Functions

During the last twenty years computers and information technologies have invaded the everyday life of most modern cultures. This increasing quantity of circulating data has brought numerous new issues and challenges for researchers and engineers. One consequence is that research now also focuses on the data that is transferred and not merely on the methods used to accomplish the transfer. Data-integrity and confidentiality for example, have become essential to many uses of the Internet and thus the need for digital cryptography has grown similarly.

The Advanced Encryption Standard defined in 2001[4] has become a widely used cipher. It is used by IPsec for data encryption and is included in the Secure Socket Layer protocol for Internet communication. As with other ciphers there is a demand for efficient and fast implementations of the AES as servers, network routers and security interfaces all need a maximum throughput of encrypted data. In the same time the electrical power that is used for such a task should be as low as possible for economical and ecological reasons.

Published techniques used for fast and efficient AES encryption range from hardware acceleration to highly parallelized software implementations on various processor architectures[2, 3,6, 7]. The method known as "bit-slicing" was first developed for the Data Encryption Standard by Elie Biham[1] and achieved performance levels unknown until then by effective data rearrangement. In a paper in 2006, C. Rebeiro, D. Selvakumar and A.S.L Devi exposed the idea of applying a similar method to the Advanced Encryption Standard even though the mathematical structure is totally different.

During my internship at Kalray my work focused on the efficient implementation of the AES on their new processor structure. A special attention was given to the use of the specificities of the processor architecture to obtain the best possible energy efficiency in encryption and decryption. The bit-slice method was retained as a potential candidate and was efficiently implemented.

One part of this energy efficient implementation involved hybrid specialization[5] which resulted in the following research paper (never published unfortunately).

Citations:

  1. Eli Biham, A Fast New DES Implementation in Software, Technion - Israel Institute of Technology, Haïfa, 1997
  2. Joppe W. Bos, Dag Arne Osvik, Deian Stefan, Fast Implementation of AES on Various Platforms, EPFL, Lausanne, Switzerland, 2009
  3. David Canright, A Very Compact Rijndael S-Box, Naval Postgraduate School, California, 2005
  4. FIPS, Advanced Encryption Standard (AES), Federal Institute for Procedures and Standards, USA, 2001
  5. M.A. Khan, H.P. Charles, D. Barthou, An Effective Automated Approach to Specialization of Code, Heidelberg, Germany, 2008
  6. C. Rebeiro, D. Selvakumar, A.S.L. Devi, Bitslice Implementation of AES, Centre for Development of Advanced Computing, India, 2006
  7. Vincent Rijmen, Efficient Implementation of the Rijndael S-Box, Katholieke Universiteit Leuven, Belgium, 2000

Complexity analysis of the historical attack on the MD-4 hash algorithm

Cryptography has always been some sort of special interest of me. So when I had to do a full solo project for one year on a subject of my own choice I went for it. Among the great diversity of cryptographic algorithms, hash algorithms have the specific property of focusing on the protection of data against modification and not on the privacy of the data itself. Currently there have already been several generations of hash algorithms as the earlier ones have been broken by efficient cryptographic attacks. Designing a new algorithm would have been an impossible task, however analysing one of the attacks on the algorithms used in the past was not. In 1997 Hans Dobbertin presented the first known attack on the Message Digest (MD4) attack that had been designed a few years earlier by Ronald Rivest among others. Since then faster and more efficient attacks have been discovered for MD4 but a fine-grain analysis of the attack by Dobbertin was not to be found : so started my project.

During a course on Data Security and Protection in my second MSc. year I was asked to hold a two-hour lesson on the cryptanalysis of the MD-4 hash algorithm. While the described attack itself is out-dated the goal was to present some main ideas and techniques that are used during the cryptanalytic study of a cryptographic function. My full report and the slides I used for the lesson can be found here below.

Associated ressources: