## Bytecode Profiling

First, HydraVM accepts program bytecode and converts it to architecture-specific machine code. We consider the program as a set of \emph{basic blocks}, where each basic block is a sequence of non-branching instructions that ends either with a branch instruction (conditional or non-conditional) or a return. Thus, any program can be represented by a graph in which nodes represent basic blocks and edges represent the program control flow -- i.e., an execution graph. Basic blocks can be determined at compile-time. However, our main goal is to determine the context and frequency of reachability of the basic blocks -- i.e., when the code is revisited through execution. To collect this information, we modify Jikes RVM's baseline compiler to insert additional instructions (in the program bytecode) at the edges of the basic blocks (e.g., branching, conditional, return statements) that detect whenever a basic block is reached. Additionally, we insert instructions into the bytecode to 1) statically detect the set of variables accessed by the basic blocks, and 2) mark basic blocks with input/output operations, as they need special handling in program reconstruction. This code modification doesn't affect the behavior of the original program. We call this version of the modified program, profiled bytecode.

   for (Integer i = 0; i < DIMx; i++) {
for (Integer j = 0; j < DIMx; j++) {
for (Integer k = 0; k < DIMy; k++) {
X[i][j] += A[i][k] * B[k][j];
}
}
}


## Superblock detection

With the profiled bytecode, we can view the program execution as a graph with basic blocks and variables represented as nodes, and the execution flow as edges. A basic block that is visited more than once during execution will be represented by a different node each time. The benefits of execution graph are multifold: 1) Hot-spot portions of the code can be identified by examining the hot paths of the graph, 2) static data dependencies between blocks can be determined, and 3) parallel execution patterns of the program can be identified.

To determine superblocks, we use a string factorization technique: we represent each basic block by a character that acts like an unique identifier for that block. Now, an execution of a program can be represented as a string. For example, the code above shows a matrix multiplication code snippet. An execution of this code for a 2x2 matrix can be represented as the string abjbhcfefghcfefghijbhcfefghcfefghijk. We factorize this string into its basic components or factors, using a variant of Main's string factorization algorithm. This factorization converts the matrix multiplication string into ab(jb(hcfefg)2hi)2jk. Using this representation, combined with grouping blocks that access the same memory locations, we divide the code into a set of nested calls, where each call execute a group of basic blocks, which becomes a superblock.

Thus, we divide the code, optimistically, into independent parts called superblocks that represent subsets of the execution graph. Each superblock doesn't overlap with other superblocks in accessed variables, and represents a long sequence of instructions, including branch statements, that commonly execute in this pattern. Since a branch instruction has taken and not taken paths, the superblock may contain one or both of the two paths according to the frequency of using those paths. For example, in biased branches, one of the paths is often considered; so it is included in the superblock, leaving the other path outside the superblock. On the other hand, in unbiased branches, both paths may be included in the superblock. Therefore, a superblock has multiple exits, according to the program control flow during its execution. A superblock also has multiple entries, since a jump or a branch instruction may target one of the basic blocks that constructs it. The parallelizer module orchestrates the construction of superblocks and distributes them over parallel threads. However, this may potentially lead to out-of-order execution of the code, which we address through STM concurrency control. I/O instructions are excluded from superblocks, as changing their execution order affects the program semantics.

This work is supported in part by AFOSR under grant FA9550-14-1-0187. Any opinions, findings, and conclusions or recommendations expressed in this site are those of the author(s) and do not necessarily reflect the views of AFOSR.