aboutsummaryrefslogtreecommitdiff
path: root/day1/README.md
blob: f5be657cc76fe9ad15fc020ec4768515f0f19fa7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Day 1 - BioNix Workshop

Let's start by defining *computational reproducibility* as always
obtaining the same output from a computation given the same inputs. In
other words, computational reproducibility is about making computations
*deterministic*.  In the research context, this is important as
reproducibility allows others (and ourselves) to verify and build upon
what we have done in future.

# A functional view of things and why Nix is needed

What makes reproducibility difficult is the management of *state*, or
the context within with a computation takes place. State manipulation is
widespread:  how many apps updates or system updates do you recall
automatically being installed over the past year?  Do you think your
analysis today will be the same in one years time if your software stack
has changed?

One way to deal with this problem is to make computations *pure* by forbidding
the use of anything that is not explicitly stated as an input. This is
the same idea of pure functional programming, only at the higher level of
executing software.

Nix effectively enforces purity for software execution by ensuring the software
cannot access anything outside of the specified inputs. By this way, it can
guarantee a very high degree of reproducibility. Nix is a general build engine
most commonly used for building software today, but as we will see a bit later
it can also execute computational biology workflows in a pure manner with a small
library called BioNix.

# Pipelines in BioNix

```
# This is an example pipeline specification to do multi-sample variant calling
# with the Platypus variant caller. Each input is preprocessed by aligning
# against a reference genome (defaults to GRCH38), fixing mate information, and
# marking duplicates. Finally platypus is called over all samples.
{ bionix ? import <bionix> { }
, inputs
, ref ? bionix.ref.grch38.seq
}:

with bionix;
with lib;

let
  preprocess = flip pipe [
    (bwa.align { inherit ref; })
    (samtools.sort { nameSort = true; })
    (samtools.fixmate { })
    (samtools.sort { })
    (samtools.markdup { })
  ];

in
platypus.call { } (map preprocess inputs)
```

# Nix the language

We will start with learning Nix the langauge, which is used for
specifying workflows. If you are familar with JSON, it is very similar
in terms of availble data types but has one very important addition:
functions. Let's cover the basic data types and their syntax:

- Booleans: `true` and `false`
- Strings: `"this is a string"`
- Numbers: `0`, `1.234`
- Lists: `[ 0 1.234 "string" ]`
- Attribute sets: `{ a = 5; b = "something else"; }`
- Comments: `# this is a comment`
- Functions: `x: x + 1`
- Variable binding: `let x = 5; in x #=> 5`
- Function application: `let f = x: x + 1; in f 5 #=> 6`
- File paths: `/path/to/file`

Some common operators:
- Boolean conjunctions and disjunctions: `true || false #=> true` `true && false #=> false`
- Ordering: `3 < 3 #=> false`, `3 <= 3 #=> true`
- Conditionals: `if 3 < 4 then "a" else "b" #=> a`
- Addition and subtraction: `3 + 4 #=> 7`, `3 - 4 #=> -1`
- Multiplication and division: `3 * 4 #=> 12`, `3.0 / 4 #=> 0.75`
- String concatenation: `"hello " + "world" #=> "hello world"`
- String interpolation: `"hello ${"world"}" #=> "hello world"`, `"1 + 2 = ${toString (1 + 2)}" #=> "1 + 2 = 3"`
- Attribute set unions: `{ a = 5; } // { b = 6; } #=> { a = 5; b = 6; }`

# About this interface

This workshop uses [A tour of
nix](https://github.com/nixcloud/tour_of_nix) with some altered content
for the purposes of learning enough of Nix the language to write
workflows in BioNix during the second part. Click next to continue to
the exercises.