Installation
============

Overview
--------

This document will cover the installation process for this pipeline.
Note that this in particular is a work in progress that will be finalized
by 05 February 2021, but until then should be considered incomplete
and not necessarily suitable for use.

The pipeline is configured for installation with conda_. Other installation
methods are possible but not supported.

.. _conda: https://docs.conda.io/en/latest/

Short Version (for experts)
---------------------------

*  Install or activate conda_
*  If needed, install git and git-lfs, and activate git-lfs
*  Clone the `analysis pipeline repository`_
*  Navigate into the repository directory
*  Add the `CGR conda channel`_ to your **.condarc**
*  Create the conda_ environments specified by **environment.yaml** and **environment-ldsc.yaml**
*  Activate the environments (ldsc for ``ldsc`` and ``ldscores`` pipeline; the other for everything else)
*  Update **Makefile.config** to point to your copies of the following:

   *  PLCO chip freeze
   *  PLCO imputed data freeze
   *  PLCO phenotype data

*  You are now ready to start running the pipeline, good luck!

.. _`analysis pipeline repository`: https://github.com/NCI-CGR/plco-analysis
.. _`CGR conda channel`: https://raw.githubusercontent.com/NCI-CGR/conda-cgr/default/conda-cgr

Long Version (for interested parties)
-------------------------------------

Environment Management with `conda`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`conda`_ is a package management system used for installing software packages across different systems.
This is how the run environments for this pipeline are maintained. There are two environments in use:
one for most applications and with a copy of python3, and one for the limited set of tools (currently only
`ldsc`_) that require end-of-life python2.

.. _`ldsc`: https://github.com/bulik/ldsc

If you have not used `conda`_ before, you will need to install it for your user account. You can follow
`these conda installation instructions`_ for more detailed information. Be sure to add the channels they
list under "2. Set up channels"

.. _`these conda installation instructions`: https://bioconda.github.io/

Check your ``~/.condarc``; add the channels ``r`` and ``https://raw.githubusercontent.com/NCI-CGR/conda-cgr/default/conda-cgr``
to the top of your list of channels if they are not already present.

.. warning::
   At the end of this step, your ``~/.condarc`` should contain, in its **channels** block, the following entries in order:

   * ``https://raw.githubusercontent.com/NCI-CGR/conda-cgr/default/conda-cgr``
   * ``r``
   * ``bioconda``
   * ``conda-forge``
   * ``defaults``

   If it does not do so, the environment resolution below will almost certainly fail.

Getting a Copy of the Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before getting the pipeline, we need command line tools to appropriately download it, and we'll use conda for that.
Install required software packages into your base environment (or a development environment if you prefer):

.. code-block:: bash

   conda install mamba git git-lfs

.. note::
   If you already have ``git`` and ``git-lfs`` available on your system, installing them here is unnecessary. ``mamba``,
   however, is pretty essential on most systems.

Then activate your base environment (or proceed with your development environment if that's the path you're taking):

.. code-block:: bash

   conda activate base

If you have never used ``git-lfs`` before, activate it one time only:

.. code-block:: bash

   git lfs install

Now, navigate somewhere on your system where you want to install a copy of the pipeline. Then clone a copy of the pipeline repository
(due to large reference files, this clone operation may take up to 30 seconds on some systems):

.. code-block:: bash

   git clone https://github.com/NCI-CGR/plco-analysis

.. warning::

   If you do not have ``git-lfs`` installed correctly, this clone operation will fail with messages regarding ``lfs`` not operating
   correctly.

.. warning::

   At the time of first writing of this pipeline, the large reference backend files for this pipeline are stored on GitHub, due
   to a lack of publicly-exposed alternatives. If sufficiently many people download these files in a short span of time, GitHub
   prevents further use of ``lfs`` managed files for the calendar month, since CGR is evidently using a free GitHub account.
   Among other possible solutions, the bandwidth limit is evidently refreshed monthly, so if you hit the cap, you can just wait.
   But also, please don't try to clone multiple copies of this pipeline; once you have a copy, you can make other copies on a local
   system with **cp -R**.

Now, navigate into the pipeline directory:

.. code-block:: bash

   cd plco-analysis

Build conda Environments
~~~~~~~~~~~~~~~~~~~~~~~~

Create the two `conda`_ environments used by the pipeline using the environment specification files included in the pipeline repository:

.. code-block:: bash

   mamba env create -f environment.yaml
   mamba env create -f environment-ldsc.yaml

.. note::

   The environment specified by ``environment.yaml`` will be named ``plco-analysis`` by default. This is a python3 environment and
   has many dependencies; depending on your system and the state of your environment cache (if you don't know what that is, don't worry
   about it), this can take tens of minutes to complete.

   The environment specified by ``environment-ldsc.yaml`` will be named ``plco-analysis-ldsc`` by default. This is a python2 environment,
   and is very small, governing exclusively the operation of the LD score regression software `ldsc`_. As python2 has reached end of life,
   this environment should never be expanded unless absolutely necessary, and ideally should be removed when `ldsc`_ achieves python3
   compatibility (lol).

.. warning::
   `conda`_ environments can be finicky. The ``plco-analysis`` pipeline in particular is somewhat delicate. It works (as of 30 January 2021).
   However, the way `conda`_ is structured, it may well break at a future date. I will record here some thoughts on debugging the environment
   if you end up getting errors from ``mamba env create``.

   * See the above discussion of `conda`_ channels. They all need to be present. It's possible having extra channels not listed may create issues,
     so if you happen to have more, try temporarily removing them and see if that fixes it. Also note that the *order* of channels matters in
     resolving conflicting versions of the same package between channels.
   * If you get errors about an environment already existing, it's possible you have an environment named ``plco-analysis`` or ``plco-analysis-ldsc``
     already present in your miniconda installation. That's bad lol. You can check your existing environments with ``conda info --envs`` (or
     simply list the contents of the directory ``/path/to/miniconda3/envs``). If indeed there is an existing environment, perhaps you've already
     done this process before? Otherwise, you can override the name of the environment you're creating now by instead using
     ``mamba env create -f environment.yaml -n different_name`` or by changing the entry in ``environment.yaml``.
   * If you're getting truly bizarre errors (conflicting paths in packages, missing package files, etc.), it's possible your cache has become
     corrupted. Don't even ask me how this happens. It can (I have seen it) create inscrutable errors that simply vanish when you clean up the cache.
     A traditional method for doing this is just deleting and reinstalling `conda`_ entirely; that's certainly a time-honored approach. But it's
     more aggressive than you may need. You can instead try running ``conda clean --all``, or simply recursively deleting the contents of
     ``/path/to/miniconda3/pkgs``.
   * I'll note here that specific errors regarding ``boost-cpp=1.70`` are more troublesome. The packages ``bolt-lmm``, ``r-saige``, and some
     not-yet-tracked-down dependencies of ``r-saige`` were built specifically against ``boost-cpp=1.70`` and block newer versions. I've thus
     built the ``plco-analysis`` internal packages `annotate_frequency`_, `combine_categorical_runs`_, `initialize_output_directories`_,
     `merge_files_for_globus`_, and `qsub_job_monitor`_ against `boost-cpp=1.70` as well. If this breaks in the future, or if/when `boost-cpp=1.70`
     leaves `conda`, there's going to be trouble. My apologies to Future Person who has to deal with this nonsense.

.. _`annotate_frequency`: https://github.com/NCI-CGR/annotate_frequency
.. _`combine_categorical_runs`: https://github.com/NCI-CGR/combine_categorical_runs
.. _`initialize_output_directories`: https://github.com/NCI-CGR/initialize_output_directories
.. _`merge_files_for_globus`: https://github.com/NCI-CGR/merge_files_for_glbus
.. _`qsub_job_monitor`: https://github.com/NCI-CGR/qsub_job_monitor

     
Environment Usage
~~~~~~~~~~~~~~~~~

I've said it above and I'll say it again here so that when this inevitably causes, you'll hopefully see it somewhere:

* activate ``plco-analysis-ldsc`` when you are running the **ldsc** pipeline in ``ldsc/Makefile`` with ``make ldsc``; or when
  you are running the **ldscore regression** pipeline in ``shared-makefiles/Makefiles.ldscores`` with ``make ldscores``:

  ``conda activate plco-analysis-ldsc``

* activate ``plco-analysis`` for **all other pipelines**:

  ``conda activate plco-analysis``


Updating Project Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At the time of this writing, project-wide configuration (primarily location of genotypes and phenotypes)
is controlled by variables in the file ``plco-analysis/Makefile.config``. The extent to which you need
to update variables in this file depends on where you're trying to install your copy of the pipeline,
and what directory permissions you have. Some defaults for ``cgems/ccad`` are present by default. Note that
the variables have defaults and commented explanations in-file, so you should read those for more details or examples.

You will likely need to change the following:

* ``PROJECT_BASE_DIR``: installation path of your pipeline, including the directory ``plco-analysis``.
* ``CHIP_FREEZE_INPUT_DIR``: path to your PLCO chip freeze files. By default it expects ``PLCO_GSA.{bed,bim,fam}``,
  and equivalent files for OmniX, Omni25, Omni5, and Oncoarray.
* ``EXTERNAL_FILE_INPUT_DIR``: this is a site for future development pulling in external metadata files; for the moment,
  it is merely the presumed location of the cross-platform subject deduplication file, by default named
  ``PLCO_final_subject_list_Ancestry_UniqGenotypePlatform_04132020.txt``
* ``FILTERED_IMPUTED_INPUT_DIR``: path to your PLCO imputation freeze files. This folder should contain post-Rsq-QC,
  non-redundant subjects files in `minimac4`_ format. For DUPS requests, the relevant folder is typically named
  something like ``Non_redundant_PLCO/Imputed/Post_Imputation_QCed/latest``
* ``PHENOTYPE_FILENAME``: path to and name of phenotype file for the study. The format is described briefly
  in ``Makefile.config``: plain-text, tab-delimited, single header row. Note that the ``Atlas`` analysis configuration
  files expect augmented covariate columns describing certain possible batch effects as binary indicator variables.
  This functionality can be disabled by removing the relevant rows from the configuration files ``plco-analysis/config/*config.yaml``

.. _`minimac4`: https://genome.sph.umich.edu/wiki/Minimac4