The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

App::Pipeline::Simple - Simple workflow manager

VERSION

version 0.9_1

SYNOPSIS

  # called from a script

DESCRIPTION

Workflow management in computational (biological) sciences is a hard problem. This module is based on assumption that UNIX pipe and redirect system is closest to optimal solution with these improvements:

* Enforce the storing of all intermediary steps in a file.

  This is for clarity, accountability and to enable arbitrarily big
  data sets. Pipeline can contain independent steps that remove
  intermediate files if so required.

* Naming of each step.

  This is to make it possible to stop, restart, and restart at any
  intermediate step after adjusting pipeline parameters.

* detailed logging

  To keep track of all runs of the pipeline.

A pipeline is a collection of steps that are functionally equivalent to a pipeline. In other words, execution of a pipeline equals to execution of a each ordered step within the pipeline. From that derives that the pipeline object model needs only one object that can recursively represent the whole pipeline as well as individual steps.

METHODS

new

Constructor

verbose

Control logging output. Defaults to 0.

Setting verbose sets the logging level:

  verbose   =  -1    0     1
  log level =>  WARN INFO  DEBUG

config

Read in the named config file.

id

ID of the step

description

Verbose description of the step

name

Name of the program that will be executed

path

Path to the directory where the program resides. Can be used if the program is not on path. Will be prepended to the name.

next_id

ID of the next step in execution. It typically depends on the output of this step.

input

Value read in interactively from command line

itype

Type of input for the command line value

start

The ID of the step to start the execution

stop

The ID of the step to stop the execution

dir

Working directory where all files are stored.

step

Returns the step by its ID.

each_next

Return an array of steps after this one.

each_step

Return all steps.

run

Run this step and call the one(s).

debug

Run in debug mode and test the configuration file

logger

Reference to the internal Log::Logger4perl object

render

Transcribe the step into a UNIX command line string ready for display or execution.

stringify

Analyze the configuration without executing it.

graphviz

Create a GraphViz dot file from the config.

RUNNING

App::Pipeline::Simple comes with a wrapper spipe command line program. Do

   spipe -h

to see instructions on how to run it.

Example run:

  spipe -config t/data/string_manipulation.xml -d /tmp/test

reads instructions from the config file and writes all information to the project directory.

The debug option will parse the config file, print out the command line equivalents of all commands and print out warnings of problems encountered in the file:

  spipe -config t/data/string_manipulation.xml -d /tmp/test

An other tool integrated in the system is visualization of the execution graph. It is done with the help of GraphViz perl interface module that will need to be installed from CPAN.

The following command line creates a Graphviz dot file, converts it into an image file and opens it with the Imagemagic display program:

  spipe -config t/data/string_manipulation.xml -graph > \
    /tmp/p.dot; dot -Tpng /tmp/p.dot | display

CONFIGURATION

The default configuration is written in YAML, a simple and human readable language that can be parsed in many languages cleanly into data structures.

The YAML file contains four top level keys for the hash that the file will be read into: 1) name to give the pipeline a short name, 2) version to indicate the version number, 3) description to give a more verbose explanation what the pipeline does, and 4) steps listing pipeline steps.

  ---
  description: "Example of a pipeline"
  name: String Manipulation
  version: '0.4'
  steps:

Each step needs an id that is unique within the pipeline and a name that identifies an executable somewhere in the system path. Alternatively, you can give the path leading to the executable file with key path. The name will be added to the path, padded with a suitable separator character, if needed.

Arguments to the executable are given individually within arg tags. They are named with the key attribute. A single hyphen is added in front of the arguments when they are executed. If two hyphens are needed, just add one the file.

Arguments can exist without values, or they can be given with attribute value.

  s3:
    name: cat
    args:
      in:
        type: redir
        value: s1.txt
      "n": {}
      out:
        type: redir
        value: s3_mod.txt
    next:
      - s4

There are two special keys in and out that need to have a further type defined. The IO type can get two kind of values:

unnamed

that indicates that the argument is an unnamed argument to the executable.

redir

will be interpreted as UNIX redirection character '&lt' or '&gt' depending on the context.

The values file and dir are not needed by the pipeline but are useful to include to make the pipeline easier to read for humans. The interpretation of these arguments is done by the program executable called by the step.

Finally, the step tag can contain the next key that gives an array of IDs for the next steps in the execution. Typically, these steps depends on the previous step for input.

Practices that are completely bonkers, like spaces in file names, are not supported.

Advanced features

The pipeline does not have to be linear; it can contain branches. For example, the pipeline can have several start points with different kinds of input: file and string.

Sometimes it is useful to be run the same pipeline with different parameter. The starting point of execution can take a value from the command line. Leave the value for the given argument blank in the configuration file and give it from the command line. Matching of values is done by matching the type string.

  spipe -conf input_demo.yml --input=ABC --itype=str

  ---
  description: "Demonstrate input from command line"
  name: input.yml
  version: '0.1'
  steps:
    s1:
      name: echo
      args:
        in:
          type: unnamed
          value:
        out:
          type: redir
          value: s1_string.txt

The empty value will be filled in from the command line into the config.yml stored in the project directory. Also, the config file looks slightly different since the steps are written out as App::Pipeline::Simple objects. Functionally there is no difference.

AUTHOR

Heikki Lehvaslaiho, KAUST (King Abdullah University of Science and Technology).

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Heikki Lehvaslaiho.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.