?? infomap-build.1
字號:
.\" Process this file with .\" groff -man -Tascii infomap-build.1.TH INFOMAP-BUILD 1 "February 2004" "Infomap Project" "Infomap NLP Manual".SH NAME.TP infomap-build \- build an Infomap WordSpace model.SH SYNOPSIS.B infomap-build.RB [ "-w " working_dir] .RB [ "-p " param_file].RB [ "-D " "var_1=val_1 ... " "-D " "var_N=val_N]".RB ( "-s " "single_corpus_file | " "-m " multi_file_list)<model_tag>.B infomap-build .BR -s \ <single_corpus_file> <model_tag>.B infomap-build.BR -m \ <file_list_file> <model_tag>.SH DESCRIPTION.B infomap-buildbuilds an Infomap WordSpace model from a properly formatted inputcorpus. It is the main driver program of the Infomap NLP software..B infomap-buildis a wrapper around.BR make (1),which in turn builds a model by invoking various other Infomap NLPtools.In its simplest form, shown in the last two lines in the abovesynopsis, .B infomap-buildis passed a corpus and a model tag. The corpus is either a singlefile (specified as an argument to the.B -soption), or is stored in multiple files, one file per corpus document.For multi-file corpora, a file listing the names of all the files making upthe corpus is given as an argument to the.B -moption. The model tag will be used to refer to the resulting model..B infomap-build creates a directory whose name is the model tag. The files generatedduring model building will be generated in this directory, which is a subdirectory of the default working directory. The default workingdirectory is the value of the .B INFOMAP_WORKING_DIRenvironment variable if it is set; otherwise it is.I /tmp/$USERNAME/infomap_working_dir..SH OPTIONS.TP.BI -D \ var=valThis option defines a variable whose value will be passed throughto .BR make .It can be used to set parameters that control the building of themodel, such as the size of word vectors. Values set using.B -Doverride both the defaults (from .IR @pkgdatadir@/default-params ) and the values specified using.BR -p ,if any.Useful variables that can be set using -D are describedin the.B MODEL PARAMETERSsection below..TP.BI -m \ file_list_fileFor multi-file corpora (hence the "m"), a file listing all of thefiles that make up the corpus, one per line. Each file must consistof exactly one corpus document. This option and the.B -soption are mutually exclusive..TP.BI -p \ param_fileA file containing parameters to control the building of the model. These parameters should be specified in variable=value format, one per line.The values in this file override the defaults given in.IR @pkgdatadir@/default-params .Values passed to.B -D override the values in this file.See the .B MODEL PARAMETERSsection below..TP.BI -s \ single_corpus_fileFor single-file corpora (hence the "s"), the file containing thecorpus. Within this file, documents should be marked by <DOC> and</DOC> tags; within each document, the text that is actually to beprocessed should be within a <TEXT> tag and a </TEXT> tag. Thisoption and the.B -m option are mutually exclusive..TP.BI -w \ working_dirThe working directory in which to build the model. Model files willbe written to a directory named.I model_tagthat is a subdirectory of this directory. This option overrides boththe .B INFOMAP_WORKING_DIRenvironment variable and the system default.RI ( /tmp/$USERNAME/infomap_working_dir ).SH MODEL PARAMETERSThe following parameters control the building of models. These parameterscan be specified by listing them in a file in .B VAR=VALUEform and passing that file as an argument to the.B -poption. They can also be specified on the command line using the.B -Doption. Values given on the command line override those given ina file.While default values for these parameters are listed below forconvenience, the true defaults are obtained from the file.I $pkgdatadir/default-paramsat runtime, and should be trusted over the values given herein case of conflict..B ROWS.RSThe number of words for which to learn word vectors. Called.B ROWSbecause it is the number of rows in the matrix of co-occurrence countsproduced by.BR count_wordvec (1). Default is 20,000..RE.B COLUMNS.RSThe number of content-bearing words to use as features in the processof computing word vectors. Called.B COLUMNSbecause it is the number of columns in the matrix of co-occurrencecounts produced by.BR count_wordvec (1).Each word vector is reduced from .B COLUMNSdimensions to .B SINGVALSdimensions by.BR svdinterface (1).Default is 1000..RE.B SINGVALS.RSThe number of dimensions that the word vectors ultimately producedwill have. Called.B SINGVALSbecause the original co-occurrence vectors are reduced to this manyelements by Singular Value Decomposition (SVD) (see.BR svdinterface (1)).Default is 100..RE.B SVD_ITER.RSThe number of iterations to be used by the SVD algorithm. Default is100. See .BR svdinterface (1)..RE.B PRE_CONTEXT_SIZE.RSThis parameter and .B POST_CONTEXT_SIZEcontrol the size of the context window used by.BR count_wordvec (1)in computing its co-occurrence counts.Any word occurring in the .B PRE_CONTEXT_SIZE words immediately preceeding a target word.B wwill be considered to have appeared in the context ofthat occurrence of .BR w .(Note that context windows can also be truncated bydocument boundaries.)Default is 15..RE.B POST_CONTEXT_SIZE.RSThis parameter and.B PRE_CONTEXT_SIZEcontrol the size of the context window used by.BR count_wordvec (1)in computing its co-occurrence counts.Any word occurring in the.B POST_CONTEXT_SIZEwords immediately following a target word.B wwill be considered to have appeared in the context ofthat occurrence of .BR w .(Note that context windows can also be truncated by documentboundaries.)Default is 15..RE.B WRITE_MATLAB_FORMAT.RSThis parameter is a binary flag. If it is set to 1, .BR count_wordvec (1)will write the co-occurrence matrix in MATLAB's input format,as well as in the format used by.BR svdinterface (1).If it is set to 0, no such additional output will bewritten.Default is 0..RE.B VALID_CHARS_FILE.RSThe valid characters file contains the valid word characters. Thesecharacters are the ones your words will eventually be composed of. Allother characters are considered by the tokenization to be breaking andare skipped. The list of characters in the valid characters file aregiven as a continuous string without delimiters.The default valid characters file is .B $pkgdatadir/valid_chars.en, which is for the English language and specifies [a-z][A-Z], '_' and'~' as valid word characters. If you want to use infomap for languagesusing a different character sets (say ISO-8859-2 for Central European)or wish to use other breaking characters, you have to prepare your ownvalid chars file.Watch out for newlines: if you have one at the end of this file, it will be considered as a legitimate part of words (may not be what you want). See .B prepare_corpus (1) for more on the hard-wired features of the tokenization method..RE.B STOPLIST_FILE.RSThe stoplist file should contain a list of words, one wordper line, that are to be treated as stopwords and ignored duringprocessing (i.e., they will not be selected as content-bearing words,and word vectors will not be computed for them). The defaultis $pkgdatadir/stop.list, which is a reasonable choice for the English language. If you want to use infomap for languages other than English, youhave to prepare your own list of stopwords or at least prevent the English listfrom operating by specifying an empty stoplist file..RE.B COL_LABELS_FROM_FILE.RSIf equal to 1, this Boolean variable indicates that the column labels of the word-word co-occurrence matrix should be read from the file .B COL_LABEL_FILE.If set to 0, .BR count_wordvec (1)will choose column labels automatically.Default is 0..RE.B COL_LABEL_FILE.RSIf.B COL_LABELS_FROM_FILEequals 1,then this is the name of the file containing a set of user-specified content-bearing words which .BR count_wordvec (1) will use as column labels of the co-occurrence matrix..RE.\" .SH EXAMPLES.SH FILES.I @pkgdatadir@/Makefile.data.RSDescribes dependencies between generated model files..B infomap-buildinvokes.BR make (1)with this as the Makefile..RE.I @pkgdatadir@/default-params.RSThis file contains default values for model-building parameters, suchas the size of word vectors, the number of words for which to learnvectors,and the number of content-bearing words.These values can be overridden by specifying a different parameter fileusing the.B -poption and/or by setting individual parameters using.BR -D ..RE.SH ENVIRONMENT VARIABLES.B INFOMAP_WORKING_DIR.RSThe working directory in which to build the model; model fileswill be created in a subdirectory named.I model_tagin this directory, which will be created if necessary.This variable overrides the systemwide default(/tmp/$USERNAME/infomap_working_dir), and can be overridden by the.B -woption..RE.SH SEE ALSO.BR associate (1), \ infomap_build (1), \ prepare_corpus (1), \ count_wordvec (1), \ svdinterface (1), \ encode_wordvec (1), \ count_artvec (1), \ write_text_params (1)..SH DIAGNOSTICSReturns 0 to indicate success; nonzero value to indicate error..SH BUGSPlease report bugs to .BR infomap-nlp-users@lists.sourceforge.net ..SH CREDITSThe Infomap NLP software was written by Stefan Kaufmann, HinrichSchuetze, Dominic Widdows, Beate Dorow, and Scott Cederberg. TheInfomap algorithm was originally developed by Hinrich Schuetze.The.B infomap-buildscript was written by Scott Cederberg..SH AUTHORThis manual page was written by Scott Cederberg. Please directinquiries and bug reports to .BR infomap-nlp-users@lists.sourceforge.net .
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -