Installation

Overview

This document will cover the installation process for this pipeline. Note that this in particular is a work in progress that will be finalized by 05 February 2021, but until then should be considered incomplete and not necessarily suitable for use.

The pipeline is configured for installation with conda. Other installation methods are possible but not supported.

Short Version (for experts)

  • Install or activate conda

  • If needed, install git and git-lfs, and activate git-lfs

  • Clone the analysis pipeline repository

  • Navigate into the repository directory

  • Add the CGR conda channel to your .condarc

  • Create the conda environments specified by environment.yaml and environment-ldsc.yaml

  • Activate the environments (ldsc for ldsc and ldscores pipeline; the other for everything else)

  • Update Makefile.config to point to your copies of the following:

    • PLCO chip freeze

    • PLCO imputed data freeze

    • PLCO phenotype data

  • You are now ready to start running the pipeline, good luck!

Long Version (for interested parties)

Environment Management with conda

conda is a package management system used for installing software packages across different systems. This is how the run environments for this pipeline are maintained. There are two environments in use: one for most applications and with a copy of python3, and one for the limited set of tools (currently only ldsc) that require end-of-life python2.

If you have not used conda before, you will need to install it for your user account. You can follow these conda installation instructions for more detailed information. Be sure to add the channels they list under “2. Set up channels”

Check your ~/.condarc; add the channels r and https://raw.githubusercontent.com/NCI-CGR/conda-cgr/default/conda-cgr to the top of your list of channels if they are not already present.

Warning

At the end of this step, your ~/.condarc should contain, in its channels block, the following entries in order:

  • https://raw.githubusercontent.com/NCI-CGR/conda-cgr/default/conda-cgr

  • r

  • bioconda

  • conda-forge

  • defaults

If it does not do so, the environment resolution below will almost certainly fail.

Getting a Copy of the Pipeline

Before getting the pipeline, we need command line tools to appropriately download it, and we’ll use conda for that. Install required software packages into your base environment (or a development environment if you prefer):

conda install mamba git git-lfs

Note

If you already have git and git-lfs available on your system, installing them here is unnecessary. mamba, however, is pretty essential on most systems.

Then activate your base environment (or proceed with your development environment if that’s the path you’re taking):

conda activate base

If you have never used git-lfs before, activate it one time only:

git lfs install

Now, navigate somewhere on your system where you want to install a copy of the pipeline. Then clone a copy of the pipeline repository (due to large reference files, this clone operation may take up to 30 seconds on some systems):

git clone https://github.com/NCI-CGR/plco-analysis

Warning

If you do not have git-lfs installed correctly, this clone operation will fail with messages regarding lfs not operating correctly.

Warning

At the time of first writing of this pipeline, the large reference backend files for this pipeline are stored on GitHub, due to a lack of publicly-exposed alternatives. If sufficiently many people download these files in a short span of time, GitHub prevents further use of lfs managed files for the calendar month, since CGR is evidently using a free GitHub account. Among other possible solutions, the bandwidth limit is evidently refreshed monthly, so if you hit the cap, you can just wait. But also, please don’t try to clone multiple copies of this pipeline; once you have a copy, you can make other copies on a local system with cp -R.

Now, navigate into the pipeline directory:

cd plco-analysis

Build conda Environments

Create the two conda environments used by the pipeline using the environment specification files included in the pipeline repository:

mamba env create -f environment.yaml
mamba env create -f environment-ldsc.yaml

Note

The environment specified by environment.yaml will be named plco-analysis by default. This is a python3 environment and has many dependencies; depending on your system and the state of your environment cache (if you don’t know what that is, don’t worry about it), this can take tens of minutes to complete.

The environment specified by environment-ldsc.yaml will be named plco-analysis-ldsc by default. This is a python2 environment, and is very small, governing exclusively the operation of the LD score regression software ldsc. As python2 has reached end of life, this environment should never be expanded unless absolutely necessary, and ideally should be removed when ldsc achieves python3 compatibility (lol).

Warning

conda environments can be finicky. The plco-analysis pipeline in particular is somewhat delicate. It works (as of 30 January 2021). However, the way conda is structured, it may well break at a future date. I will record here some thoughts on debugging the environment if you end up getting errors from mamba env create.

  • See the above discussion of conda channels. They all need to be present. It’s possible having extra channels not listed may create issues, so if you happen to have more, try temporarily removing them and see if that fixes it. Also note that the order of channels matters in resolving conflicting versions of the same package between channels.

  • If you get errors about an environment already existing, it’s possible you have an environment named plco-analysis or plco-analysis-ldsc already present in your miniconda installation. That’s bad lol. You can check your existing environments with conda info --envs (or simply list the contents of the directory /path/to/miniconda3/envs). If indeed there is an existing environment, perhaps you’ve already done this process before? Otherwise, you can override the name of the environment you’re creating now by instead using mamba env create -f environment.yaml -n different_name or by changing the entry in environment.yaml.

  • If you’re getting truly bizarre errors (conflicting paths in packages, missing package files, etc.), it’s possible your cache has become corrupted. Don’t even ask me how this happens. It can (I have seen it) create inscrutable errors that simply vanish when you clean up the cache. A traditional method for doing this is just deleting and reinstalling conda entirely; that’s certainly a time-honored approach. But it’s more aggressive than you may need. You can instead try running conda clean --all, or simply recursively deleting the contents of /path/to/miniconda3/pkgs.

  • I’ll note here that specific errors regarding boost-cpp=1.70 are more troublesome. The packages bolt-lmm, r-saige, and some not-yet-tracked-down dependencies of r-saige were built specifically against boost-cpp=1.70 and block newer versions. I’ve thus built the plco-analysis internal packages annotate_frequency, combine_categorical_runs, initialize_output_directories, merge_files_for_globus, and qsub_job_monitor against boost-cpp=1.70 as well. If this breaks in the future, or if/when boost-cpp=1.70 leaves conda, there’s going to be trouble. My apologies to Future Person who has to deal with this nonsense.

Environment Usage

I’ve said it above and I’ll say it again here so that when this inevitably causes, you’ll hopefully see it somewhere:

  • activate plco-analysis-ldsc when you are running the ldsc pipeline in ldsc/Makefile with make ldsc; or when you are running the ldscore regression pipeline in shared-makefiles/Makefiles.ldscores with make ldscores:

    conda activate plco-analysis-ldsc

  • activate plco-analysis for all other pipelines:

    conda activate plco-analysis

Updating Project Configuration

At the time of this writing, project-wide configuration (primarily location of genotypes and phenotypes) is controlled by variables in the file plco-analysis/Makefile.config. The extent to which you need to update variables in this file depends on where you’re trying to install your copy of the pipeline, and what directory permissions you have. Some defaults for cgems/ccad are present by default. Note that the variables have defaults and commented explanations in-file, so you should read those for more details or examples.

You will likely need to change the following:

  • PROJECT_BASE_DIR: installation path of your pipeline, including the directory plco-analysis.

  • CHIP_FREEZE_INPUT_DIR: path to your PLCO chip freeze files. By default it expects PLCO_GSA.{bed,bim,fam}, and equivalent files for OmniX, Omni25, Omni5, and Oncoarray.

  • EXTERNAL_FILE_INPUT_DIR: this is a site for future development pulling in external metadata files; for the moment, it is merely the presumed location of the cross-platform subject deduplication file, by default named PLCO_final_subject_list_Ancestry_UniqGenotypePlatform_04132020.txt

  • FILTERED_IMPUTED_INPUT_DIR: path to your PLCO imputation freeze files. This folder should contain post-Rsq-QC, non-redundant subjects files in minimac4 format. For DUPS requests, the relevant folder is typically named something like Non_redundant_PLCO/Imputed/Post_Imputation_QCed/latest

  • PHENOTYPE_FILENAME: path to and name of phenotype file for the study. The format is described briefly in Makefile.config: plain-text, tab-delimited, single header row. Note that the Atlas analysis configuration files expect augmented covariate columns describing certain possible batch effects as binary indicator variables. This functionality can be disabled by removing the relevant rows from the configuration files plco-analysis/config/*config.yaml