Tutorial

Core idea behind neuropycon_cli is to make use of wildcards provided by UNIX shell to handle datasets with nested folders organization so the target files can be processed in parallel and so the results can be stored in a folder structure identical to the original one.

In this section I will explain, how to combine neuropycon_cli commands to join data processing nodes into pipelines and how to select input files from dataset folders in the most efficient way using wildcards.

Commands, options and arguments

In each call to neuropycon command line interface we aim to build a processing pipeline or a workflow. Namely, we specify which processing nodes we want to run on the data, how we want to connect them and how we want the resulting pipeline to be run.

For each processing node there’s a corresponding command in the command line interface. To specify how a particular node should be invoked, we use options to these commands.

To illustrate this let’s start with the command that is always going to appear first — neuropycon. This command has several options defining the general behaviour of the pipeline. For instance, -p or --plugin option defines weither the workflow should be run in parallel or serial fashion. Another option, -n or --ncpu defines how many parallel threads should be used. Thus, if we want our workflow to run in parallel on 4 cores, we should start with

$ neuropycon -p MultiProc -n 4 <...>

Probably the most useful option for the start is --help which shows a help message with a full list of available commands and options and exits. This option can be used on any command to see what it does and what options can/must be set. Try for instance

$ neuropycon --help

or

$ neuropycon ica --help

Options to neuropycon commands always start with single or double hyphen depending on weither the short or long option name is used (i.e. -n and --ncpu options for neuropycon command). Most of the time options are not required to be set explicitly since most of them have default values. If no default value is set for an option and user does not set such option explicitly the prompt with a demand to set an option will appear.

In contrast to options, arguments specification in neuropycon_cli is mandatory and is used only for input command which (don’t look surprized) specifies paths to the input files. The number of arguments to input command is unlimited and each argument should go as is (without hyphens). Arguments should be separated by whitespaces.

For example, imagine we had two files named sample1.fif and sample2.fif in the current working directory. You can create two empty files for testing purposes by running

$ touch sample1.fif sample2.fif

The most simple looking pipeline would look like

$ neuropycon input sample1.fif sample2.fif

  _   _                      _____        _____                 _.-'-'--._
 | \ | |                    |  __ \      / ____|               ,', ~'` ( .'`.
 |  \| | ___ _   _ _ __ ___ | |__) |   _| |     ___  _ __     ( ~'_ , .'(  >-)
 | . ` |/ _ \ | | | '__/ _ \|  ___/ | | | |    / _ \| '_ \   ( .-' (  `__.-<  )
 | |\  |  __/ |_| | | | (_) | |   | |_| | |___| (_) | | | |   ( `-..--'_   .-')
 |_| \_|\___|\__,_|_|  \___/|_|    \__, |\_____\___/|_| |_|    `(_( (-' `-'.-)
                                    __/ |                          `-.__.-'=/
                                   |___/                              `._`='
                                                                        \\
 180108-16:24:35,776 workflow INFO:
          Workflow my_workflow settings: ['check', 'execution', 'logging']
 180108-16:24:35,780 workflow INFO:
          Running in parallel.
 180108-16:24:35,781 workflow INFO:
          Executing: path_node.a1 ID: 0
 180108-16:24:35,783 workflow INFO:
          Executing node path_node.a1 in dir: /home/dmalt/my_workflow/_keys_sample2-fif/path_node
 180108-16:24:35,790 workflow INFO:
          [Job finished] jobname: path_node.a1 jobid: 0
 180108-16:24:35,791 workflow INFO:
          Executing: path_node.a0 ID: 1
 180108-16:24:35,793 workflow INFO:
          Executing node path_node.a0 in dir: /home/dmalt/my_workflow/_keys_sample1-fif/path_node
 180108-16:24:35,796 workflow INFO:
          [Job finished] jobname: path_node.a0 jobid: 1

This pipeline does pretty much nothing since we didn’t specify commands including any processing nodes. The only result would be appearance in the current working directory of a folder named my_workflow with some logging information and subfolder _keys_sample-fif named so to uniquely correspond to the file being processed. If we were actually doing some useful job all the result would appear inside the _keys_sample-fif folder.

Commands order

Now let’s add more nodes to the pipeline.

To do so two rules must be followed concerning the order of invoked commands.

  1. input command must always appear last. This limitation allows specifying input without knowing in advance the exact number of files to be processed, i.e. when we take all files matching a wildcard (see below).

  2. All nodes corresponding to the commands other than input are linked in order of appearance. Output from the previous node becomes an input of the next.

In practice this means that i) order of commands matters, ii) input and output of adjacent nodes must be coherent and iii) no matter what, input specification must always be the last thing in the chain of commands.

For example, suppose we had resting_state.fif file with resting state MEG data and we want to cut it into 1-second epochs and save these epochs in numpy format .npy. Sequence of nodes for this task should look as following:

_images/nodes_order.png

Having in mind rules outlined above we can write the corresponding sequence of commands:

$ neuropycon epoch -l 1 ep2npy input resting_state.fif

-l 1 option defines the resulting epochs to be of 1 second length.

Input specification

A common scenario for MEEG datasets organization would be when the recorded data for a number of subjects are stored in subfolders with each subfolder’s name depicting information about subject’s number, recording condition, applied preprocessing steps etc.

Sample folders layout might look as following:

NeuroPyConData
├── K0015
│   ├── ecg_eog_info.pickle
│   ├── ec_ipynb_report.html
│   ├── eo_ipynb_report.html
│   ├── K0015_rest_raw_tsss_mc_trans_ec.fif
│   └── K0015_rest_raw_tsss_mc_trans_eo.fif
├── K0025
│   ├── ecg_eog_info.pickle
│   ├── ec_ipynb_report.html
│   ├── eo_ipynb_report.html
│   ├── K0025_rest_raw_tsss_mc_trans_ec.fif
│   └── K0025_rest_raw_tsss_mc_trans_eo.fif
├── K0034
│   ├── ecg_eog_info.pickle
│   ├── ec_ipynb_report.html
│   ├── eo_ipynb_report.html
│   ├── K0034_rest_raw_tsss_mc_trans_ec.fif
│   └── K0034_rest_raw_tsss_mc_trans_eo.fif
├── R0008
│   ├── ecg_eog_info.pickle
│   ├── ec_ipynb_report.html
│   ├── eo_ipynb_report.html
│   ├── R0008_rest_raw_tsss_mc_trans_ec.fif
│   └── R0008_rest_raw_tsss_mc_trans_eo.fif
└── R0023
    ├── ecg_eog_info.pickle
    ├── ec_ipynb_report.html
    ├── eo_ipynb_report.html
    ├── R0023_rest_raw_tsss_mc_trans_ec.fif
    └── R0023_rest_raw_tsss_mc_trans_eo.fif

5 directories, 25 files

Here we have 5 subjects with subject IDs K0015, K0025, K0034, R0008, R0023. For each of them there are two .fif files with MEG recordings for eyes closed and eyes open conditions.

One way to go about the processing of these data would be to manually deal with each file we want to process which isn’t very handy for big datasets.

A better idea would be to make use of a regular organization of subfolders. Normally dataset processing requires applying a set of operations to similar files inside the folders tree and it would be nice if we could address these subgroups of similar files using some kind of matching patterns to apply a certain pipeline of processing operations to them. A perfect tool for that is provided by UNIX shell.

Consider the following example:

If we were to list all the fif files in the subdirectories we could use a shell command like this:

$ ls ./NeuroPyConData/*/*.fif

Where * symbol is a wildcard mapping one or more string literals.

Now if we wanted to list only those files that contain recordings for the ‘eyes closed’ conditiion we would go like this:

$ ls ./NeuroPyConData/*/*ec.fif

Note ‘ec’ suffix et the end of our matching pattern.

Similar syntax can be applied if we want to process matching files instead of just listing them.

Let’s proceed with the example from the previous section where we epoched a single file and converted it to .npy format. To apply processing steps to all eyes-closed files in our dataset all we have to do is use the wildcard we’ve already created:

$ neuropycon epoch -l 1 ep2npy input ./NeuroPyConData/*/*ec.fif

  _   _                      _____        _____                 _.-'-'--._
 | \ | |                    |  __ \      / ____|               ,', ~'` ( .'`.
 |  \| | ___ _   _ _ __ ___ | |__) |   _| |     ___  _ __     ( ~'_ , .'(  >-)
 | . ` |/ _ \ | | | '__/ _ \|  ___/ | | | |    / _ \| '_ \   ( .-' (  `__.-<  )
 | |\  |  __/ |_| | | | (_) | |   | |_| | |___| (_) | | | |   ( `-..--'_   .-')
 |_| \_|\___|\__,_|_|  \___/|_|    \__, |\_____\___/|_| |_|    `(_( (-' `-'.-)
                                    __/ |                          `-.__.-'=/
                                   |___/                              `._`='
                                                                        \\
INPUT ---> EPOCHING ---> EP2NPY
180109-18:28:11,91 workflow INFO:
         Workflow my_workflow settings: ['check', 'execution', 'logging']
180109-18:28:11,99 workflow INFO:
         Running in parallel.
180109-18:28:11,101 workflow INFO:
         Executing: path_node.a4 ID: 0
180109-18:28:11,103 workflow INFO:
...
...
...

As we can see from the output just below the NeuroPyCon logo, the created pipeline is indeed input —> epoching —> ep2npy.

Generated output

After running the line above a newly created folder named my_workflow will appear in the current working directory.

Hint

We can change the default name and location of the output folder from ./my_workflow by specifying options --save-path or -s and --workflow-name or -w to neuropycon command like this:

$ neuropycon -s ~/ -w npy_convert_workflow epoch -l 1 ep2npy input ./NeuroPyConData/*/*ec.fif

Let’s explore the contents of this directory

$ ls my_workflow
d3.js
graph1.json
graph.json
index.html
_keys_K0015__K0015_rest_raw_tsss_mc_trans_ec-fif
_keys_K0025__K0025_rest_raw_tsss_mc_trans_ec-fif
_keys_K0034__K0034_rest_raw_tsss_mc_trans_ec-fif
_keys_R0008__R0008_rest_raw_tsss_mc_trans_ec-fif
_keys_R0023__R0023_rest_raw_tsss_mc_trans_ec-fif

We see that there are 4 folders starting with _keys_<SubjID>__ — one for each of the files processed. Let’s look inside one of them:

_keys_K0015__K0015_rest_raw_tsss_mc_trans_ec-fif
├── ep2npy
│   ├── _0x3ea00ae5c1c4233082317a7027820486.json
│   ├── _inputs.pklz
│   ├── _node.pklz
│   ├── _report
│   │   └── report.rst
│   ├── result_ep2npy.pklz
│   └── ts_epochs.npy
├── epoching
│   ├── _0x78142f3fd848accc85737102606876da.json
│   ├── _inputs.pklz
│   ├── K0015_rest_raw_tsss_mc_trans_ec-epo.fif
│   ├── _node.pklz
│   ├── _report
│   │   └── report.rst
│   └── result_epoching.pklz
└── path_node
    ├── _0x32a93e6b5a92f59c1ebfc1e1347577b4.json
    ├── _inputs.pklz
    ├── _node.pklz
    ├── _report
    │   └── report.rst
    └── result_path_node.pklz

Thus, each of the processing nodes created a folder in the outuput directory except for the input node (instead there’s path_node folder which is required for inner machinery of the CLI to work) Inside the ep2npy folder we can find ts_epochs.npy file with numpy-converted epoched data. epoching folder has K0015_rest_raw_tsss_mc_trans_ec-epo.fif — file with MNE-python Epochs data structure.

These are the main outputs of the pipeline we just applied to our data. These files can be used for further and analyses (i.e. used as input for another neuropycon chain of commands).

The rest of files in these folders contains caching files and logging information for the executed nodes. You can read about them in nipype documentation. Some hint about caching and how it can be useful though will be given in the next section.

Hotstart

Nipype framework around which neuropycon_cli is built is smart about re-running computations. As I’ve already pointed out some caching information is stored inside the output directory. It means that if we were to rerun the pipeline with the same input again nipype would use this information together with the output files and it would understand that certain nodes have already been executed and it would not run them again.

Now imagine that after converting the data to numpy we realized that it would also be useful to look at some connectivity measure (i.e. imaginary part of coherency) on these data. We could just use good and ready .npy files from the ep2npy subfolders and supply them as an input to another neruopycon ... chain which will create yet another (or overwrite existing) output folder with the connectivity matrices inside.

This is a bit messy though. A cleaner way would be to make use of the caching capabilities of nipype. We can simply augment our previous command-chain with a new conn command and it will automatically pick up outputs from the precomputed nodes and use them for connectivity computation. The result will look as following:

$ neuropycon  epoch -l 1 ep2npy conn -b 8 12 -s 1000 input NeuroPyConData/*/*ec.fif

And if we looked again inside the _keys_K0015__K0015_rest_raw_tsss_mc_trans_ec folder we would see that a new folder with connectivity matrix in .npy indeed appeared inside :

_keys_K0015__K0015_rest_raw_tsss_mc_trans_ec-fif
├── _con_method_imcoh_freq_band_8.0.12.0
│   └── sp_conn
│       ├── _0x289b2f9a3344b8cc2f8cb32978d83319.json
│       ├── conmat_0_imcoh.npy
│       ├── _inputs.pklz
│       ├── _node.pklz
│       ├── _report
│       │   └── report.rst
│       └── result_sp_conn.pklz
├── ep2npy
│   ├── _0xc93cf8c735df05d156d881d7026c399c.json
│   ├── _inputs.pklz
│   ├── _node.pklz
│   ├── _report
│   │   └── report.rst
│   ├── result_ep2npy.pklz
│   └── ts_epochs.npy
├── epoching
│   ├── _0x78142f3fd848accc85737102606876da.json
│   ├── _inputs.pklz
│   ├── K0015_rest_raw_tsss_mc_trans_ec-epo.fif
│   ├── _node.pklz
│   ├── _report
│   │   └── report.rst
│   └── result_epoching.pklz
└── path_node
    ├── _0x32a93e6b5a92f59c1ebfc1e1347577b4.json
    ├── _inputs.pklz
    ├── _node.pklz
    ├── _report
    │   └── report.rst
    └── result_path_node.pklz