java -jar MaltOptimizer.jar -p <phase number> -m <MaltParser jar path> -c <training corpus> [-v <validation method>]
Note: To use the version of MaltParser included in the MaltOptimizer-1.0.2 distribution, use -m maltparser-1.7.jar. To use another version of MaltParser, make sure that you specify the path correctly and that the version is 1.7 or later.
Phase 1: Data Characteristics
In the data analysis, MaltOptimizer gathers information about the following properties of the training set:
- Number of words/sentences
- Percentage of non-projective arcs/trees
- Existence of ''covered roots'' (arcs spanning tokens with HEAD = 0)
- Frequency of labels used for tokens with HEAD = 0
- Existence of non-empty feature values in the LEMMA and FEATS columns
- Identity (or not) of feature values in the CPOSTAG and POSTAG columns
java -jar MaltOptimizer.jar -p 1 -m <MaltParser jar path> -c <training corpus>
Phase 2: Parsing Algorithm
In the second phase, MaltOptimizer explores a subset of the parsing algorithms implemented in MaltParser, based on the results of the data analysis. In particular, if there are no non-projective dependencies in the training set, then only projective algorithms are explored, including the arc-eager and arc-standard versions of Nivre's algorithm and an implementation of Covington's projective parsing algorithm. By contrast, if the training set contains a non-negligible proportion of non-projective dependencies, then MaltOptimizer may also test Covington's non-projective algorithm and algorithms using pseudo-projective parsing or online reordering. After testing each of the algorithms with default settings, MaltOptimizer tunes the parameters of the best performing algorithms and creates a new option file for the best performing configuration so far. The user is given the opportunity to edit the option file (or stop the process) before optimization continues.
java -jar MaltOptimizer.jar -p 2 -m <MaltParser jar path> -c <training corpus> [-v <validation method>]
For phase 2, the usage includes an additional flag -v (for validation) with default value dev (for development set) and alternative value cv (for cross-validation).
Phase 3: Feature Models and Learning Algorithm
In the third phase, MaltOptimizer tries to optimize the feature model given the parameters chosen so far (in particular the parsing algorithm). It first performs backward selection experiments to ensure that all features in the default model for the given parsing algorithm actually make a contribution. It then proceeds with forward selection experiments, trying potentially useful features one by one and in combination. An exhaustive search for the best possible feature model is practically impossible, so the optimization strategy is based on heuristics derived from proven experience, see Quick Guide to MaltParser Optimization
. The major steps of the forward selection experiments are the following.
- Tune the window of POSTAG (n-gram) features over the stack and buffer.
- Tune the window of (lexical) FORM features over the stack and buffer.
- Tune dependency tree features using DEPREL and POSTAG features.
- Add predecessor and successor features for salient tokens using POSTAG and FORM features.
- Add CPOSTAG, FEATS, and LEMMA features if available.
- Add conjunctions of POSTAG and FORM features.
After the feature selection experiments are completed, MaltOptimizer tunes the parameters of the learning algorithm and creates a new option file and a new feature specification file. The user is given the opportunity to edit both of these files and continue with manual optimization.
Finally, the system stores a final MaltParser configuration file (finalOptionsFile.xml) in the MaltOptimizer installation directory. At the end of the optimization process, you may run MaltParser as follows:
java -jar <MaltParser jar path> -f finalOptionsFile.xml -F <path to the feature model suggested>
java -jar MaltOptimizer.jar -p 3 -m <MaltParser jar path> -c <training corpus> [-v <validation method>]
For phase 3, the usage includes an additional flag -v (for validation) with default value dev (for development set) and alternative value cv (for cross-validation).