kfold¶ ↑
kfold creates K-fold splits from data files and assists in training and testing (useful for cross-validation in supervised machine learning)
Command overview¶ ↑
help Display global or [command] help documentation. split Split a data file into K partitions test Apply trained models on a dataset previously split using kfold train Train models on a dataset previously split using kfold
Example usage¶ ↑
10-fold cross-validation of the standard MaltParser on a treebank named shuffled.c32.conll may be done as follows:
kfold split -f -i shuffled.c32.conll --fold -d '\n\n' kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse eval07.pl -q -g shuffled.c32.conll -s shuffled.c32.conll.output
The MaltParser does not like to put its models in a subdirectory, so rather than using the standard model files suggested by kfold (%M), we construct custom non-nested model filenames using %B.model_%N.
Command details¶ ↑
The following is simply the output of the built-in help commands.
Splitting data files¶ ↑
NAME: split DESCRIPTION: Given the data file INPUT, the partitions are written to files named INPUT.parts/{01..K} SYNOPSIS: kfold split -i INPUT [options] EXAMPLES:
# Split the file sample.txt into 4 parts kfold split -k4 sample.txt
# Split the double-newline-delimited file sample.conll into 10 parts kfold split -d"\n\n" sample.conll
OPTIONS:
-i, --input FILE Data file to split
-k, --parts N The number of partitions desired
-d, --delimiter DELIM String used to separate individual entries (newline per default)
-g, --granularity N Ensure the number of entries in each partition is divisible by N (useful for block-structured data)
-f, --overwrite Remove existing parts prior to executing
--fold Additionally, create K folds of K-1 parts in a another folder
--parts-name STRING Use the given name as suffix for the partitions folder created
--folds-name STRING Use the given name as suffix for the folds folder created
Training on the folds¶ ↑
NAME: train DESCRIPTION: Given training data previously split in K parts and folds, train K models on the K folds Certain keywords in the training command and its arguments are interpolated at runtime: * %N - fold number, e.g. '01' * %F - fold filename, e.g. 'brown.train/01' * %I - alias for %F * %M - model filename, e.g. 'brown.models/01' * %B - basename (as specified on the command line), e.g. 'brown' SYNOPSIS: kfold train --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS] EXAMPLES:
# Train MaltParser for cross-validation kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn
OPTIONS:
-f, --overwrite Remove existing models prior to executing
--base NAME Default prefix of training folds and model files
--folds-name SUFFIX Look for folds {01..K} in the folder BASE.SUFFIX
--models-name SUFFIX Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M
Testing the models on their reciprocal data file parts¶ ↑
NAME: test DESCRIPTION: Process K parts of a split datafile using K previously trained models. Certain keywords in the testing command and its arguments are interpolated at runtime: * %N - part number, e.g. '01' * %T - part filename, e.g. 'brown.test/01' * %I - alias for %T * %O - output filename, e.g. 'brown.outputs/01' * %M - model filename, e.g. 'brown.models/01' * %B - basename (as specified on the command line), e.g. 'brown' SYNOPSIS: kfold test --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS] EXAMPLES:
# Apply trained MaltParser models for cross-validation kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse
OPTIONS:
-f, --overwrite Remove existing test output prior to executing
--base NAME Default prefix of model files and test outputs
--parts-name SUFFIX Look for parts {01..K} to be processed in the folder BASE.SUFFIX
--models-name SUFFIX Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M
--outputs-name SUFFIX Yield output filenames as BASE.SUFFIX/{01..K} as interpolation pattern %O
--output-name SUFFIX Put the concatenated output of all models in BASE.SUFFIX