Wapiti Manual

This page is an automatic conversion of the manual page that can be found in the doc/ directory of the package.

Table of contents

Synopsis

       wapiti mode [options] [input] [output]

Description

   Overview
       Wapiti  is  a  program  for  training and using discriminative sequence
       labelling models with various algorithms using an elastic penalty.   It
       currently implement maxent models, maximum entropy Markov models (MEMM)
       and linear-chain conditional random fields (CRF) models

       It can work in different mode depending on the first argument you  give
       to  it,  either training a model, labeling new data, or dumping a model
       in readable form.

       The mode switch can be either "train", "label", or "dump". (only a pre-
       fix long enough to differentiate them is really needed)

   Options
       -h | --help
              Output a short help message.

       --version
              Output version and revision numbers.


   Train mode
       --me   Activate the pure maxent mode, see below for more details.

       -T | --type <string>
              Select  the  type  of  model  to  train. Can be either "maxent",
              "memm", or "crf", or "list" to get a list  of  supported  models
              types. By default "crf" models are used.

       -a | --algo <string>
              Select the algorithm used to train the model, specify "list" for
              a list of available algorithms. The first algorithm in this list
              is used as default.

       -p | --pattern <file>
              Specify  the  file  containing  the patterns for extracting fea-
              tures. The format of the pattern file is detailed below.

       -m | --model <file>
              Specify a model file to load and to train again. This allow  you
              either  to  continue  an  interrupted  training or to use an old
              model as a starting point for a new training. Beware that no new
              labels  can be inserted in the model. As the training parameters
              are not saved in the model file, you have to specify them again,
              or  specify new one if, for example, you want to continue train-
              ing with another algorithm or a different penalty.

       -d | --devel <file>
              Specify the data file to load as a development set. At  the  end
              of  each  iterations  the error rate is computed on this dataset
              and displayed in the progress line. If enabled, this  values  is
              used  to check convergence and stop training. If none are speci-
              fied, the training set is used instead but beware that  this  is
              bad practice to use the training set to choose the stopping cri-
              terion.

       --rstate <file>
              Restore an optimizer state from the given file and restart opti-
              mization  from  this point. Only available for L-BFGS and R-PROP
              but the saved state are compatible between MEMM and CRF  models.
              This  allow  to  keep  more informations about the optimal point
              found while training an MEMM to bootstrap a  CRF  model,  or  to
              restart an optimization with adjusted parameters.

       --sstate <file>
              Save  the  full optimizer state at the end of optimization so it
              can be restored later with --rstate.

       -c | --compact
              Enable model compaction at the end of the  training.  This  will
              remove  all  inactive  observations from the model, leading to a
              much smaller model when an l1-penalty  is  used.  See  the  note
              below for more details.

       -t | --nthread <integer>
              Set  the  number  of thread to use, this can drastically improve
              performance but is very algorithm dependent. Best  value  is  to
              set it to the number of core you have. Default is 1.

       -j | --jobsize <integer>
              Set  the  size  of  the  job a thread will get each time it have
              nothing more to do. This is the number of sequences  to  proceed
              and default to 64. Increasing it will reduce communication over-
              head but can lead to a bad ballancing between threads,  reducing
              it  increase  the  communication  overhead but can ballance work
              better between threads in case of small datasets.
	      
       -s | --sparse
              Enable the computation of the forward/backward in sparse mode.

       -i | --maxiter <integer>
              Defines  the  maximum  number of iterations done by the training
              algorithm. A value of 0 means unlimited and training  will  con-
              tinue  until  another stopping criterion is reached. The default
              is unlimited and algorithm will stop using the others  criteria.

       -1 | --rho1 <float>
              Defines  the L1-component of the elastic-net penalty. Increasing
              this value lead to smaller models and can improve training  time
              but  will  probably  lead  to reduced performances. Setting this
              value to 0 result in a classical L2-penalty only.  If  algorithm
              can  optimize the L1-penalty, the default value is 0.5, else the
              default is 0.

       -2 | --rho2 <float>
              Specifies the L2-component of the elastic-net  penalty.  Setting
              this  value  to  0  lead  to  a  simple L1 regularization. While
              allowed, this is discouraged as it can lead to numerical  insta-
              bility. The default value is 0.00001.

       -o | --objwin <integer>
              Set  the window size for the objective value stopping criterion,
              see below for more details. Default value is 5.

       -w | --stopwin <integer>
              Set the window size for the devel stopping criterion, see  below
              for more details. Default value is 5.

       -e | --stopeps <float>
              Set  the  size of the interval for stopping criterion, see below
              for more details. Default value is 0.02%.

       --clip Enables gradient clipping for the L-BFGS. This will set to 0 the
              gradient  component  whose  corresponding features values are 0,
              preventing the trainer to move the feature away from 0. This  is
              useful  if you have a sparse model and want to refine it with an
              l2-only regularization without loosing the sparsity.

       --histsz <integer>
              Specifies the size of the history to keep in L-BFGS to  approxi-
              mate the inverse of the diagonal of the Hessian. Increasing this
              value lead to better approximation, so generally less iterations
              but increase memory usage. The default is 5.

       --maxls <integer>
              Set  the  maximum number of linesearch step in L-BFGS to perform
              before giving up.

       --eta0 <float>
              Set the learning rate for SGD trainer.

       --alpha <float>
              Set the alpha value of the exponential decay in SGD trainer.

       --kappa <float>
              Set the  kappa  parameter  for  BCD  trainer.  Default  is  1.5,
              increasing  this  value  make the algorithm more stable but also
              slower. Try to increase it if you have numerical instability.

       --stpmin <float>
       --stpmax <float>
              Set the minimum/maximum step size  allowed  for
              the  RPROP  trainer. Defaults are 1e-8 and 50.0, thoses seems to
              be good value to get numerical stability  with  double  computa-
              tions.

       --stpinc <float>
       --stpdec <float>
              Set  the  increment/decrement  factor used to
              update the steps in the RPROP trainer. Defaults values  are  1.2
              and  0.5.  Increment must be greater than 1.0 and decrement must
              be between 0 and 1.0.

       --cutoff
              Select the alternate projection scheme for RPROP  with  l1-regu-
              larization,  this  can  lead  to  better model depending on your
              task.


   Label mode
       --me   Activate the pure maxent mode, see below for more details.

       -m | --model <file>
              Specifies a model file to load and to  use  for  labeling.  This
              switch is mandatory.

       -l | --label
              With  this switch, Wapiti will only output the predicted labels.
              Without, it output the full data with an additional column  con-
              taining the predicted labels.

       -c | --check
              Assume  the data to be labeled are already labeled so during the
              labeling process we can check  our  own  result  displaying  the
              error rates. This doesn't affect the labeling process and output
              data will remain exactly the same.  However,  progress  will  be
              more  verbose  and  informative:  at the end of the process, for
              each labels, the precision, recall, and f-measure will  be  dis-
              played.  If  you  ask for N-best output, statistics are computed
              only on the best sequence.

       -s | --score
              Output a line with score before the data. The line start with  a
              '#'  symbol followed by the output number in the n-best list and
              the score of the sequence of labels. Also  output  a  score  for
              each  label  of  the  sequence.  Beware that, if you use viterbi
              labelling, this is a raw score not really meaningful, it is  not
              normalized  so it cannot be interpreted as a probability. To get
              normalized scores, you must use posterior decoding.

       -p | --post
              Use posterior decoding instead of the classical  Viterbi  decod-
              ing. This generally produce better results at the cost of slower
              decoding.  This  also  allow  to  output  normalized  score  for
              sequences and labels.

       -n | --nbest <int>
              Output  the  N best sequences of labels instead of just the best
              one. The N sequences of labels are output in order in the output
              file.

       --force
              Enable   forced  decoding  for  labelling  data  already  partly
              labelled. See below for more details.


   Dump mode
       For the moment, there is no switch specific to this mode.


Usage

       Wapiti can work in different modes. The mode determines which  switches
       are  available  (see above) and what the model expects in the input and
       output files. In train mode, Wapiti expects a training dataset as input
       and  outputs the trained model. In label mode, it expects data to label
       as input and will output the same data labeled by the  model.  Finally,
       in  dump  mode it expects a model as input and outputs it in a readable
       form.

       In train mode Wapiti will load a previous model if one is  given,  read
       the  train  dataset  and  an  eventual  devel one, and train the model.
       Progress information are outputted during  all  these  steps.  Training
       stop  when the model is fully optimized, when one of stopping criterion
       is reached or when the user send a TERM signal. (see below)

       In label mode, progress is not very  informative  except  if  you  give
       already labeled data. In this case, error rates are displayed.


Stopping criterion

       There is various way for training to stop depending on the command line
       switch provided.

       The simpler criterion is the iteration  count.  By  default,  algorithm
       will  iterate forever but you can specify a maximum number of iteration
       with --maxiter.

       Finding the exact optimum is generally  not  needed  to  get  the  best
       model.  There  is  an infinity of points around the optimum who lead to
       almost exactly the same model and are as good  as  the  best  one.  The
       error  window  criterion check for this by looking at the error rate of
       the model over the development set and stop training when it is  stable
       enough.  To  do this, the error rate of the last few iterations is kept
       and when the difference between extreme values  falls  bellow  a  given
       value,  training is stopped. (If no devel set is given, the error rates
       are computed over the training data, but this is bad practice)

       For algorithms which provide the objective function value at each iter-
       ation,  we also stop them when this value has not changed significantly
       over the past few iterations. This window size  is  controlled  by  the
       objwin parameter.

       Each  algorithm  can also provide their own stopping system like l-bfgs
       which stops when numerical precision prevents further progress.

       The last criterion is the user itself. By  sending  a  TERM  signal  to
       Wapiti you instruct it to stop training as soon as possible, discarding
       the last computation, in order to finish training and save  the  model.
       If  you  don't  care about the model, sending a second TERM signal will
       make the program violently exit without saving anything. (on most  sys-
       tem, a TERM signal can be send with CTRL-C)


Regularization

       Wapiti use the elasitc-net penalty of the form

       rho_1 * |theta|_1 + rho_2 / 2.0 * ||theta||_2^2

       This mean that you can choose to use the full elastic-net or more clas-
       sical L1 or L2 penalty. To fallback to one of these, you just  have  to
       set respectively rho1 or rho2 to 0.0.

       Some  algorithms  work  only  with  one or the other component, in this
       case, the value of the other is simply ignored.  See  the  document  of
       each algorithm for more details.


Algorithms

       l-bfgs  This  is the classical quasi-newton optimization algorithm with
       limited memory. It works by approximating the inverse of  the  diagonal
       Hessian using an history of the previous values of the features weights
       and gradient.

       This algorithm requires the gradient to  be  fully  computable  at  any
       point  so it cannot do L1 regularization. In this case the OWL-QN vari-
       ant is used instead which can handle the full elastic-net penalty.

       It requires to keep 5 + M * 2 vectors the sizes of which are the number
       of  features.  Each  component  of  these  vectors are double precision
       floating point values. So, for training a model with  F  features,  you
       need  8  *  F  *  (5 + M * 2) bytes of memory. If the OWL-QN variant is
       used, one additional vector is needed to keep the pseudo-gradient.

       sgd-l1 This is  the  stochastic  gradient  descent  for  L1-regularized
       model.  It works by computing the gradient only on a single sequence at
       a time and making a small step in this direction.

       The SGD algorithm will find very quickly an acceptable solution for the
       model,  but  will take a longer time to find the optimal one, and there
       is no guarantee it will find it.

       The memory requirement are lighter than for quasi-Newton methods as  it
       requires only 3 vectors the size of which are the number of features.

       bcd  This is the blockwise coordinate descent with elastic-net penalty.
       This algorithm is best suited for very large label sets and sparse fea-
       ture  sets.  It  optimizes  the  model one observation at a time, going
       through all observations at each iteration.  It  usually  converges  in
       only a few dozen iterations (rarely more than 30).

       This  the  more memory economical algorithm as it only requires to keep
       the feature weight vector in memory. In this algorithm, using  complexe
       bigram features come almost for free.

       This  flexibility  has  a  price: don't use it if your features are not
       sparse, as it will be very slow in this case.

       NOTE: This algorithm is available only for training CRF models.

       rprop (rprop+ / rprop-) This algorithm use the gradient only to find  a
       good search direction, not for choosing the step to make in that direc-
       tion. It can be verry effective on some dataset.

       Compared to quasi-newton methods, rprop reach the neighboorhood of  the
       optimum  more  quickly but the lack of second order information and the
       restricted use of the first order one make the fine tunning slower.

       Memory requirement are quite light as it require 4 vectors of the  size
       of the feature set.

       The rprop- is a variant of rprop+ without backtracking, its performance
       compared to rprop+ is task dependent and it require one less vector  so
       for very large model it can be better.


Multi-threading

       Wapiti  can  efficiently  use  multiple threads to speedup the gradient
       computation for l-bfgs and rprop algorithms. Using the --nthread param-
       eter, you can specify the number of threads to use.

       Beware  that  if  the atomic updates were disabled at compilation time,
       each threads after the first will cost you an extra vector of the  size
       of  the  feature set. This imply that for large models, multiple thread
       can cost you a lot of memory. Atomic updates  are  supported  at  least
       with GCC and CLang compilers. It may also work if your compiler support
       the same intrinsics atomic operations or if you reimplement the atm_inc
       function in gradient.c for it.

       The  multi-threading  code  can be disabled at compilation time if your
       platform is not supported. See wapiti.h for more details.


Datafiles

       Data files are plain text files containing sequence separated by  empty
       lines.  Each  sequence  is a set of non-empty lines where each of these
       represents one position in the sequence.

       Each lines are made of tokens separated either by spaces or by  tabula-
       tions.  All tokens are observations available for training or labeling,
       except the last one in training mode which is assumed to be  the  label
       to predict.

       If  no  patterns are specified, each tokens are interpreted directly as
       an observation and is combined with label in  order  to  generate  fea-
       tures. If patterns are specified, they are used in combination with the
       tokens to generate the features. The observations must be  prefixed  by
       either  'u', 'b' or '*' in order to specify if it is unigram, bigram or
       both.


Patterns

       Pattern files are almost compatible with CRF++ templates.  Empty  lines
       as  well  as  all  characters  appearing after a '#' are discarded. The
       remaining lines are interpreted as patterns.

       The first char must be either 'u', 'b' or '*' (in upper or lower case).
       This  indicates  the type of feature: respectively unigram, bigrams and
       both, must be generated from this pattern.

       The remaining of the pattern is used to build  an  observation  string.
       Each  marker  of the kind "%x[off,col]" is replaced by the token in the
       column "col" from the data file at current  position  plus  the  offset
       "off".  The "off" value can be prefixed with an "@" to make it an abso-
       lute position from the start of the sequence if it is positive and from
       the  end  if  it is negative. An offset of "@1" will refer to the first
       line of data and "@-1" to the last line.

       For example, if your data is
           a1    b1    c1
           a2    b2    c2
           a3    b3    c3
       The pattern "u:%x[-1,0]/%x[+1,2]" applied at position 2 in the sequence
       will produce the observation "u:a1/c3".

       The  sequence  is  extended  in front and back with special tokens like
       "_X-1" or "_X+2" in order to apply markers with any offset at all posi-
       tion in the sequence.

       Wapiti  also supports a simple kind of matching, very useful, for exam-
       ple, in natural language processing. This is done using two other  com-
       mands  of  the form %m[off,col,"regexp"] and %t[off,col,"regexp"]. Both
       commands will get data the %same way the %x command using the "col" and
       "off"  values  but apply a regular expression to it before substituting
       it. The %t will replace the data by "true" or "false" depending if  the
       expression match on the data or not. The %m command replace the data by
       the substring matched by the expression.

       The regular expression implemented is just a subset of classical  regu-
       lar  expression  found in classical unix system but is generally enough
       for most tasks. The recognized subset is quite simple. First for match-
       ing characters:
            .  -> match any characters
            \x -> match a character class (in uppercase, match the complement)
                    \d : digit       \a : alpha      \w : alpha + digit
                    \l : lowercase   \u : uppercase  \p : punctuation
                    \s : space
                  or escape a character
            x  -> any other character match itself
       And the constructs :
            ^  -> at the beginning of the regexp, anchor it at start of string
            $  -> at the end of regexp, anchor it at end of string
            *  -> match any number of repetition of the previous character
            ?  -> optionally match the previous character So, for example, the
       regexp "^.?.?.?.?" will match a prefix of at most four  characters  and
       "^*$"  will match only on data composed solely of uppercase characters.

       For the commands, %x, %t, and %m, if  the  command  name  is  given  in
       uppercase,  the  case  is removed from the string before being added to
       the observation.


Forced decoding

       The forced decoding switch allow decoding of  already  partly  labelled
       data.  If  some  labels  are already known and only the unknown must be
       predicted, instead of doing a full prediction and correcting the wapiti
       output  as  a post-processing step you can enable forced decoding. This
       allow you to specified the already known labels and let wapiti use this
       information to improve the decoding.

       In  order  to  do this you must provide the same data as usual with all
       the columns needed for your patterns, and you must add  another  column
       like the one provided for the --check option with the known labels. For
       each lines where a prediction must be made by wapiti, either leave this
       column blank or specify an invalid label.

       Wapiti  decoder  will  just fill the blank and use the information pro-
       vided to improve there prediction.


Pure maxent mode

       If you don't make anything special, Wapiti  will  automatically  choose
       between  the  maxent  codepath  and  the linear-chain codepath for each
       sequence. If a sequence have a length of one and no bigram features, it
       will switch to the maxent codepath.

       This  imply  that,  if you want to train a maxent only model, you still
       have to prefix all your features/patterns with 'u' to indicate  a  uni-
       gram feature, and separate all line in your input file with empty lines
       to make sure they are all length one.

       The pure maxent mode, activated by the --me switch in train  and  label
       mode, take care of the two problems. When activated, all lines in input
       files are used independantly as a single sample, and  blank  lines  are
       ignored. Additionally, all features are automatically prefixed with 'u'
       forcing them as unigram features, so you don't have to put  the  prefix
       yourself.

       Be  carefull  that you have to specify the pure maxent mode both during
       training and labelling.


Model compaction

       If you specify the --compact switch for training,  when  the  model  is
       optimized  all  the  observations which generate only inactive features
       are removed from the model. In case of l1-penalty this can dramatically
       reduce the model size.

       First,  this  is interesting to produce a smaller model so the labeling
       will require a lot less memory and will be faster.

       Second, this can allow you to train  bigger  models.  L-BFGS  generally
       produces  better  models  than  SGD  but requires a lot more memory for
       training. To reduce the memory needed during L-BFGS  optimization,  you
       can  train  a  very  big model with a few SGD-L1 iterations, which will
       give you a rough model but with a lot of inactive features; this  model
       can be compacted to a smaller model which can be easily trained with L-
       BFGS.

       There is a tricky thing here. Compaction only removes  the  observation
       from the model not from the patterns. That is why, if you load the same
       data again, the compacted observations will be regenerated. To  prevent
       this,  loading  a  model before training prevents the generation of new
       observation keeping only the compacted model.

       But this conflicts with another feature,  the  incremental  model  con-
       struction,  which  allows  us  to load a model and add to it additional
       patterns in order to first train small models and  increase  them  pro-
       gressively.  So  if  you  specify  both a model and a pattern file, the
       observation construction will be re-enabled and so the compaction  will
       just have the effect of reducing the loading time.


Example

       For  training  a very sparse CRF model on data in file 'train.txt' with
       patterns in file 'pattern' and using owl-qn algorithm, run the command:
              wapiti train -p pattern -1 5 train.txt model
       This  will  generate a model file named 'model'. You can later use this
       model to tag the data in the file 'test.txt' with the command:
              wapiti label -m model test.txt result.txt
       The tagged data will be stored in file 'result.txt'

Exit Status

       wapiti returns a zero exit status if all succeeded. In case of  failure
       non-zero is returned a an error message is printed on stderr.

Author

       Thomas Lavergne (thomas.lavergne (at) reveurs.org)

Copyright

       Copyright (c) 2009-2012  CNRS