HTS Parallel Training

This is a small modification of HTS training process that allows it to run HERest in parallel.


Parallel::ForkManager perl module installed.


  • Create a file with the following content:
import sys
import os

fname = sys.argv[1]
num_chunk = int(sys.argv[2])
out_folder = sys.argv[3]

all_data = open(fname).readlines()

chunk_size = len(all_data)/num_chunk

end = 0

for i in range(0,num_chunk):
start = i*chunk_size
if i == num_chunk -1:
end = len(all_data)
end = i*chunk_size + chunk_size

This file will split the training data into small chunks.

  • Set an executable permission:
chmod +x
  • Add the following line to
$parallel = 1; # 0 mean off
$nj = 15; # number of threads you want to run

$split = "path to";
  • Add the following line to
$HERest = "$HEREST    -A    -C $cfg{'trn'} -D -T 1 -m 1 -u tmvwdmv -w $wf -t $beam";
mkdir "$prjdir/tmp", 0755;

# split data into chunks for parallazation
if ($parallel){
system("rm $prjdir/tmp/* ");
system("$split $scp{'trn'} $nj $prjdir/tmp/ ");
print "$split $scp{'trn'} $nj $prjdir/tmp/ ";
opendir DIR, "$prjdir/tmp/" or die "cannot open dir $dir: $!";
@files = grep { $_ ne '.' && $_ ne '..' } readdir DIR;
closedir DIR;

sub herest_par {
# Parameters:
# @mlf : label mlf files such as full.mlf ($mlf{'full'})
# @list : list of full context label. e.g
# @in_cmp: input model of cmp features to HERest
# @in_dur: input model of duration features to HERest
# @out_cmp: output model of cmp features
# @out_dur: output model of duration features
# @param: additional params for HERest. e.g -k 4
# @report_stat: use to report stats file for building trees

# Usages examples:
# step: ERST0
# herest_par($mlf{'mon'},$lst{'mon'},$monommf{'cmp'},$monommf{'dur'}, $model{'cmp'},$model->{'dur'},"-k $k","");
# step: ERST1
# $opt = "-C $cfg{'nvf'} -s $stats{'cmp'} -w 0.0";
# herest_par($mlf{'ful'},$lst{'ful'},$fullmmf{'cmp'},$fulmmf{'dur'}, $model{'cmp'},$model->{'dur'},"",$opt);

my ($mlf,$list,$in_cmp,$in_dur,$out_cmp,$out_dur,$param,$report_stat) = @_;

if ($parallel){
my $pm = new Parallel::ForkManager($nj);
for (my $i=1; $i < $nj+1; $i++) {
$pm->start and next; # do the fork
shell("$HERest $param -S $prjdir/tmp/$files[$i-1] -I $mlf -H $in_cmp -N $in_dur -M $out_cmp -R $out_dur -p $i $list $list");


my $directory = dirname( "$in_cmp" );
opendir DIR, $directory or die "cannot open dir $dir: $!";
my @files = grep { index($_, 'hmm') != -1 } readdir DIR;
foreach (@files){
$dur = $_;
$dur =~ s/hmm/dur/g;
$acc.=" $directory/".$_." $directory/$dur";
if ($report_stat){
shell("$HERest -H $in_cmp -N $in_dur -M $out_cmp -R $out_dur -p 0 $report_stat $list $list $acc");
shell("$HERest -H $in_cmp -N $in_dur -M $out_cmp -R $out_dur -p 0 $list $list $acc");
system("rm $directory/*.acc");
if ($report_stat){
shell("$HERest -S $scp{'trn'} -I $mlf -H $in_cmp -N $in_dur -M $out_cmp -R $out_dur $report_stat $list $list");
shell("$HERest $param -S $scp{'trn'} -I $mlf -H $in_cmp -N $in_dur -M $out_cmp -R $out_dur $list $list");
  • Modify ERST0 section in, change the HERest command: $HERest{...} ... into
herest_par($mlf{'mon'},$lst{'mon'},$monommf{'cmp'},$monommf{'dur'}, $model{'cmp'},$model{'dur'},"-k $k","");
  • Modify ERST1 in
$opt = "-C $cfg{'nvf'} -s $stats{'cmp'} -w 0.0";  # This is an option to generate stats file for build the decision tree
herest_par($mlf{'ful'},$lst{'ful'},$fullmmf{'cmp'},$fulmmf{'dur'}, $model{'cmp'},$model{'dur'},"",$opt);
  • You can do the same for all other parts that run HERest command in