Advanced Development Notes¶
Overview of Make¶
Existing pipelines within this project are in the language make; my condolences!
make lacks many of the features of modern workflow languages, so some of the usage
may be unusual or quirky, or lacking entirely in features you’d expect from snakemake
or other similar languages.
Rule Structures¶
If you’re not familiar with make at all, you should definitely skim the make manual
first. I’ll just cover some extremely brief thoughts on make usage in these pipelines.
rules rely heavily on text replacement and substitution for building input files from outputs; see make function documentation for an overview
pattern rules are great where possible, but sometimes are incompatible or just don’t fit into my brain. feel free to swap them in where I’ve missed opportunities to do so
most pipelines begin with some form of enumeration of the possible outputs, and restriction of those outputs to ones with valid corresponding inputs. a lot of this relies on
findto detect thingsmakehas no direct compatibility withyaml, which is a real pain. that’s been on the to-do list for ages. thus the global configuration parameters inMakefile.configremain in rawmakeinstead of in a niceyamlfile. please do patch in support as you feel like itthe most complex DAG construction happens with the analysis modules in
shared-makefiles/; for that, I built a custom preprocessor that actually deals withyamlnatively and emits targets in a formatmakecan handle. seeshared-makefiles/Makefile.boltlmmas an example. my apologies that it’s not more generalized
Job Tracking/Logging¶
All rules in the make pipelines are wrapped in utility functions that emit tracking files
(by default appended with .success, configurable with $(TRACKING_SUCCESS_SUFFIX)). Log data
are emitted to output files prefixed uniquely based on the rule. So everything is there, but it
can be a bit of a mess to find given how many files are emitted.
Integration with Clusters¶
make has no integrated support for cluster job submission. These pipelines have their calls wrapped
in two simple utility functions which either dispatch jobs via a qsub interface (for cgems), or run
the job in the main process but log the results with tracking files. This is a total mess. Other languages
obviously offer actual support, so further development should make use of that. For extension of the current
pipelines, support for slurm in particular needs to get patched into an appropriate interface/monitor program.
Conda Environments¶
Again, no intrinsic make support, and make runs things in subshells that make launching separate
environments rather complicated. It should be possible to wrap individual rules in activated conda
environments. However, given the comparative simplicity of the workspace as a whole, the top-level conda
environment should be sufficient for all modules, at least as currently written. Just make sure the top-level
environment is active before launching any step of the analysis.
Thread Safety/Directory Locking¶
Once again, make has no direct support for this, and I’ve never gotten around to actually implementing it.
This is probably the biggest silent weakness of the pipeline as written. Please keep this in mind when writing
and running pipelines. You can have bolt/saige/other analysis tools running at the same time; but don’t dispatch
multiple bolt jobs at the same time! It will collide with nasty consequences.
Result Directory and Filename Conventions¶
Adding Analysis Modules¶
In theory, it is very straightforward to add additional analysis
modules to the project by writing individual pipelines targeted
at specific tools (e.g. SNPTEST, PLINK), placing them in shared-makefiles,
and adding a global rule in Makefile. In practice, presumably no
future developers will want to write any such module in make, so
there need to be some thoughts to integrating modules from other languages.
I’m going to try to present the interface specification for these analysis modules; with that, you should be able to write a module in any language, and it should slot in pretty seamlessly.
Inputs¶
The following pertinent files will be available to the analysis pipeline
on run start. Note that all files are assumed to be prefixed with the installation
directory of the pipeline. Variables $(VARIABLE_NAME) are defined in Makefile.config.
Chip data:
each input platform and ancestry combination will have the following files, so long as that combination has sufficient subjects for analysis:
$(CLEANED_CHIP_OUTPUT_DIR)/{ancestry}/{platform}.step1.maf.geno.hwe.{bed,bim,fam}: plink format genotypes with mild cleaning applied
$(CLEANED_CHIP_OUTPUT_DIR)/{ancestry}/{platform}.step2.pruning.{in,out}: plink format output from –indep for variant {inclusion, exclusion}
$(CLEANED_CHIP_OUTPUT_DIR)/{ancestry}/{platform}.step3.pruning.{in,out}: plink format output from –indep-pairwise for variant {inclusion, exclusion}
$(CLEANED_CHIP_OUTPUT_DIR)/{ancestry}/{platform}.step4.het.remove: plink format subject removal file for heterozygosity outliers
Imputed data:
each input platform and ancestry combination will have the following files, so long as that combination has at least 100 subjects:
$(BGEN_OUTPUT_DIR)/{platform}[/batch{#}]/{ancestry}/chr{#}-filtered.bgen
$(BGEN_OUTPUT_DIR)/{platform}[/batch{#}]/{ancestry}/chr{#}-filtered.bgen.bgi
$(BGEN_OUTPUT_DIR)/{platform}[/batch{#}]/{ancestry}/chr{#}-filtered-noNAs.sample
Inputs are fixed at bgen version 1.2 dosages due to compatibility with bolt-lmm, saige, fastGWA
Modified sample file is due to format discrepancy in raw output from plink2
Phenotype data:
plain-text, tab-delimited phenotype file
header row
single column with unique subject ID (name is configurable as $(PHENOTYPE_ID_COLNAME))
conventionally, missing values are stored as
NA
Outputs¶
Results directory structure:
all results should be emitted under $(RESULT_OUTPUT_DIR) as follows:
$(RESULT_OUTPUT_DIR)/$(analysis_prefix)/{ancestry}/{software}
above, $(analysis_prefix) is the variable taken from
config/*.yaml
Output filenames:
all intermediate and output files should be prefixed in the results directory as follows:
$(phenotype).$(platform)[_batch{#}].{software}
$(phenotype) is the variable taken from
config/*.yaml
Required output files and formats:
the following files are those used downstream by existing pipeline components:
$(phenotype).$(platform)[_batch{#}].{software}.tsv.gz
results file per platform/batch
format is tab-delimited, columns as follows (with header as listed):
CHR: chromosome of variant
POS: physical position of variant, in GRCh38
SNP: variant ID (see note below)
Tested_Allele: coded allele (corresponding to effect direction of BETA)
Other_Allele: non-coded allele
Freq_Tested_Allele_in_TOPMed: allele frequency (see note below)
BETA: regression coefficient (binary traits: logOR) for variant
SE: standard error of test
P: association p-value
N: actual sample size tested for variant
Ncases: binary results only: actual number of cases tested for variant
Ncontrols: binary results only: actual number of controls tested for variant
SNP defaults to “chr:pos:ref:alt” codes from TOPMed. This needs to be replaced with rsIDs when requested with the
config/*.yamloptionid_mode: rsid.Freq_Tested_Allele_in_TOPMed defaults to reference IDs, approximate frequencies from the imputation reference subjects, to avoid issues with identifiability of subject samples. These should instead be replaced with actual subject allele frequencies when requested with the
config/*.yamloptionfrequency_mode: subject.
$(phenotype).$(platform)[_batch{#}].{software}.rawids.tsv
the format of this file is the same as the above, except SNP must contain unique IDs, in this case the “chr:pos:ref:alt” IDs from the TOPMed reference data
this file is canonically actually an upstream intermediate that leads to the above output file
note the lack of compression. this can be patched to behave differently
as things are currently configured, this file is required by
shared-makefiles/Makefile.metal, the meta-analysis pipeline. this is because the rsID mapping requested byid_mode: rsidand used for the “Atlas” website creates duplicate sites in a very few cases, which causes issues formetalwhen trying to unambiguously link variants to one another across platformsthis is an extremely messy behavior, and one I’d love to see patched out somehow in the future
Adding Other Pipelines¶
In addition to the above, other pipelines will likely be needed if this project is to continue.
For example, bgen v1.2 format has worked well for the PLCO “Atlas” project, but will likely
need to be replaced or augmented in the future.
Most of the project’s pipelines live in a dedicated subdirectory of the appropriate name. They are
called from a dedicated rule in the top-level Makefile, and dispatch themselves based on variables
they import from Makefile.config. This process can be repeated for other necessary backend pipelines.
Note that, in particular for later pipelines operating on ancestry-split data, there needs to be
the capacity to dynamically restrict the DAG to combinations of platform and ancestry that exist
in the actual data, not just the full enumeration of platform and ancestry combinations. The make
pipelines do this by assuming upstream pipelines run to completion and detecting whatever output files
happen to be present from those pipelines, and working from there. Other languages have more elegant
support for this kind of DAG restriction. Just make sure you do it: there is never any guarantee
that any particular input combination will be present, and in fact for many ancestries given US sampling
criteria, it’s almost guaranteed they will be absent.
Extension to Other Languages¶
No one will want to write any further pipelines in make. However, it should be reasonably
straightforward to create modules in other languages. Make sure the modules conform to the above
interface specification, or possibly modify it while maintaining back compatibility.
The only major issue comes up around job dispatch. You can write a snakemake call into
the top-level Makefile dispatcher; however, that will not straightforwardly handle process
monitoring in the way recursive make usually does, and it loses out on a bunch of snakemake’s
convenient features.
The best solution then should be to create a language-specific dispatcher that handles module calls
within the language of the module. So, write a top-level Snakefile that covers snakemake analysis
modules. As ever, care must be taken to be sure upstream pipelines have run to completion before analysis.
However, the way the make pipelines are structured, that’s the case regardless, so the added
burden should be minimal.