# Software Development : Structured Programming

As the Internet is full of introductory material, this section will contain only a few remarks that might give the material at other sites a bit different flavor.

# Loops

## Translatability

Any common loop construct can be rewritten by using any other common loop construct. For example, the following for-loop

for(i=0;i<42;i++){
do_something();
} // for


is equivalent to the following while-loop:

i=0;
while (i<42){
do_something();
i++;
}


which is equivalent to the following (Ruby) code:

i=0
begin
if (i<42)
do_something();
i++;
end # if
end while (i<42)


which in turn can be rewritten as the for-loop at the start of this example.

## Generalisations

By citing Андрей Николаевич Терехов, a way to comprehensibly describe a loop is:

FOR i FROM a BY b TO c WHILE d
DO f OD


## Loop Parallelization

If separate iterations of a loop do not depend on each other, then the result of the loop does not depend on the execution order of its iterations.

        function the_output_of_this_function_does_not_depend_on_the_rand_value(x) {
var ar_data = [42, 77, 55];
// 0.0 <= Math.random() <= 1.0
var b_choice_1 = (0.5 <= Math.random());
if (b_choice_1) {
for (i = 0; i < 3; i++) {
ar_data[i] = ar_data[i] + x;
} // for
} else {
if (0.5 <= Math.random()) {
// The key phrase here is: "loop unrolling".
// If loops are short "enough", then the
// unrolling can be used as an assembler
// level optimization technique that is based
// on the idea that by unrolling a loop, there is
// no need to allocate a CPU register for
// the loop iteration variable, in this case,
// "the i", and the "i=0" assembler command
// can also be skipped.
ar_data = ar_data + x;
ar_data = ar_data + x;
ar_data = ar_data + x;
} else {
if (0.5 <= Math.random()) {
ar_data = ar_data + x;
ar_data = ar_data + x;
ar_data = ar_data + x;
} else {
ar_data = ar_data + x;
ar_data = ar_data + x;
ar_data = ar_data + x;
} // else
} // else
} // else
return ar_data;
} // the_output_of_this_function_does_not_depend_on_the_rand_value


Loop parallelization that takes advantage of the CPU SIMD(Single-Instruction-Multiple-Data) instructions is called vectorization. As of 2015 the command line arguments for switching in vectorization of GCC and LLVM is:

S_THE_REST=" -ftree-vectorize "
-mtune=native $S_THE_REST # optimizes for the current CPU, but inserts extra code # for backwards compatibility with older CPU types -march=native$S_THE_REST # optimizes for the current CPU, but
# does NOT insert the extra code for the
# backwards compatibility with the older CPU types


Loop parallelization can be used for executing different iterations of the loop at different computers in a cluster. In that case program execution continues after the results of all of the iterations of the loop have been received from the various computers in the cluster. The keyword is MapReduce. If in stead of a cluster the iterations are run at different CPU cores of a single computer, then one possible algorithm for spreading the iterations between CPU cores is to load the iteration tasks to a queue and have the CPU cores fetch the tasks from that queue at their own pace. More detailed descriptions of that algorithm can be found by searching the phrase "work stealing", but a key thing to remember with that algorithm is that the activity of distributing the tasks between CPU cores can be too expensive for some tasks and to start an operating system process is even more expensive.

### The Amdahl’s law

Speed-up that can be achieved by algorithm parallelization has a theoretical limit that can be explained by an example, where the algorithm consists of 3 function calls.

        function the_algorithm() {
func_part_1(); //if sequential, takes 0.2 time units
func_part_2(); //if sequential, takes 0.5 time units
func_part_3(); //if sequential, takes 0.3 time units
// 0.2 + 0.5 + 0.3 = 1.0
} // the_algorithm


If the time spent on the execution of the func_part_2() were divided by 2 by parallelizing the func_part_2() and executing the iterations on 2 CPU cores in parallel, then the whole execution time of the func_part_2() would be

$$\cfrac{0.5 timeunits}{2} = 0.25 timeunits$$

and the time spent on the execution of the_algorithm() would be

$$0.2 timeunits + 0.25 timeunits + 0.3 timeunits = 0.75 timeunits$$

If the func_part_1() were to be executed on 100 CPU-cores in parallel and the func_part_2() and func_part_3() were executed on a single CPU core, then the time spent on the execution of the_algorithm() would still be

$$\cfrac{0.2 timeunits}{100} + 0.5 timeunits + 0.3 timeunits = 0.002 timeunits + 0.8 timeunits = 0.802 timeunits$$

INCOMPLETE, POOLELI, lisada profileerimise peatükk, sinna teadusarvutuste graafi-lõikamise algoritm ning siis interpreteeritavate keelte JIT'i sisseülituse vältimisest tulenev muutujate väljaspool tsüklit deklareerimise jutt. Profileerimise peatükki lisada ka GCC ja LLVM'iga binaaride instrumenteerimise osa.