?? changelog
字號:
Thu Sep 25 17:35:01 1997 Andrew McCallum <mccallum@jprc.com> * Makefile.in (LIBBOW_H_FILES): Added bow/kl.h.Wed Sep 17 14:51:33 1997 Karl Kleinpaste <karl@jprc.com> * rainbow-h.c (top): Add socket #includes. (rainbowh_options): Add --query-server and -n options. Also, remove #define of PRINT_TREE_SCORES, in favor of runtime -n. (rainbowh_arg_state): Add rainbowh_query_serving to what_doing, plus server_port_num, for --query-server, and print_tree_scores, for -n. (rainbowh_parse_opt): Add --query-server and -n detection. (_hier_barrel_set_node_scores): Properly conditionalize tree score printing. (hier_barrel_print_scores_recurse): Add FILE *out. (hier_barrel_print_scores): Add FILE *out. (rainbowh_query): New routine, stripped from the mainline `if'. (rainbowh_socket_init): Add for --query-server capability. (rainbowh_serve): Add for --query-server capability. (main): Init print_tree_scores; init lexer end pattern; insert conditional call to service --query-server; slice out mainline forWed Sep 17 14:22:18 1997 Andrew McCallum <mccallum@jprc.com> * split.c (bow_test_split): Remove the assertion that we use at least 90% of the documents as training data.Tue Sep 9 10:55:31 1997 Andrew McCallum <mccallum@jprc.com> * rainbow-h.c: Change default method from prind to naivebayes. (_hier_barrel_cdoc_write): Handle bow_file_format_version 5. (_hier_barrel_cdoc_read): Likewise. (hier_barrel_set_local_class_model): Call vpc function with new argument. (hier_barrel_set_vpc_with_weights): Likewise. (hier_barrel_add_document): Set HBARREL->DOC_BARREL->METHOD to HIER_DEFAULT_METHOD. (hier_barrel_set_vpc_and_populate_lower_branches): Only recursively set vpc in children branches if there are documents there. (hier_barrel_prob_wi_in_ci): Assert that the CDOC->NORMALIZER has been set. Set M_EST_M according to CDOC->NORMALIZER, which is number of unique words. (_hier_barrel_local_score): Clean up a little. (main): Call hier_set_method() if BOW_ARGP_METHOD.Sat Aug 30 19:03:17 1997 Andrew McCallum <mccallum@jprc.com> * kl.c (bow_kl_score): Initialize scores to class prior divided by query document length, not just class prior. This way our classifications match Naive Bayes, as they should.Fri Aug 29 09:12:05 1997 Andrew McCallum <mccallum@jprc.com> * barrel.c (bow_barrel_add_from_text_dir): Add newline before warning about a file being skipped because it is not text. Before the following change we were overflowing DV->ENTRY[i].DI in document barrel's when there were more than 32767 documents. Karl's Yahoo experiments were trying to build models with about 60000 documents. We would get an error in vpc.c at the assertion that "ci < num_classes". * dv.c (bow_dv_add_di_count_weight): Warn if we overflow int, not short. (bow_dv_write_size): Adjust for change of COUNT and DI from short to int. (bow_dv_write): Likewise. (bow_dv_new_from_data_fp): Likewise. * bow/libbow.h (bow_cdoc): Change member CLASS from short to int. (bow_de): Change members DI and COUNT from short to int. * barrel.c (_bow_barrel_cdoc_write): If BOW_FILE_FORMAT_VERSION is 5 or greater, change CDOC->CLASS from short to int. (_bow_barrel_cdoc_read): Likewise. * bow/libbow.h (BOW_DEFAULT_FILE_FORMAT_VERSION): Changed from 4 to 5. * io.c: Add comment about bow_file_format_version history.Thu Aug 28 22:36:55 1997 Andrew McCallum <mccallum@jprc.com> * kl.c (bow_kl_score): Add class prior probabilities. * lex-simple.c (bow_lexer_simple_get_raw_word): When we find the NULL at the end of the document, and before we find the beginning of a word, back up DOCUMENT_POSITION (even though will return 0 this time already). Add some assertions about DOCUMENT_POSITION. * lex-html.c (bow_lexer_html_get_raw_word): When we find the NULL at the end of the document, back up DOCUMENT_POSITION so we will return 0 next time we are called. Add some assertions about DOCUMENT_LENGTH.Wed Aug 27 11:23:34 1997 Andrew McCallum <mccallum@jprc.com> * bow/libbow.h: Include <unistd.h>. * rainbow.c (rainbow_parse_opt): Fix typo. * rainbow.c (rainbow_parse_opt) [SERVER_KEY]: Set DOCUMENT_END_PATTERN to a single dot on a line. (main): Don't set DOCUMENT_END_PATTERN here for server mode. * lex-simple.c (bow_lexer_simple_open_text_fp): Explicitly seek the PRE_PIPE_FP to the end of the file! Otherwise, we can sometimes read the same file over and over again in the many `while(open_text_fp())' loops throughout the library. * rainbow.c (rainbow_print_weight_vector): Change the test for deciding when we need to multiply by CDOC->NORMALIZER before printing the weight. Instead of looking specifically for "naivebayes", look for a METHOD->NORMALIZE_WEIGHTS function pointer that is NULL. Now this works properly for the "kl" method too. * kl.c (bow_kl_set_weights): Calculate the total number of occurrences of each word; store this in DV->IDF. The the DV weights to the weighted log odds ratio P(w|C)*log(P(w|C)/P(w|~C)). * rainbow.c (rainbow_lisp_setup): Update for new default arguments. (rainbow_lisp_query): Add LOO_CV argument to bow_barrel_score(). * kl.c (bow_kl_score): Move declaration of SCORES_SUM.Tue Aug 26 11:12:00 1997 Andrew McCallum <mccallum@jprc.com> * vpc.c (bow_barrel_set_vpc_priors_by_counting): Add assertion about the PRIOR.Mon Aug 25 14:03:58 1997 Andrew McCallum <mccallum@jprc.com> * rainbow-ac.pl: As a diagnostic, print the number of predictions found in the file. * naivebayes.c (bow_naivebayes_set_weights): Set CDOC->NORMALIZER to the number of unique terms in each class. (This is now used by rainbow-h.) * kl.c (bow_kl_set_weights): Add assertion about CDOC->NORMALIZER. * foilgain.c (bow_foilgain_ci_per_wi_new): New function. * bow/libbow.h (bow_default_method_name): New macro. * barrel.c (bow_barrel_new): Use new macro `bow_default_method_name' instead of "naivebayes".Tue Aug 19 09:50:16 1997 Andrew McCallum <mccallum@jprc.com> * int4word.c (bow_words_set_map): Be sure to initialize the map/counts if they haven't been initialized yet. Otherwise, WORD_MAP_COUNTS will point nowhere an we can tromp on memory. I was getting malloc() errors before this was fixed. (bow_words_keep_top_by_infogain): Change so that word indices are ordered by information gain, even when NUM_WORDS_TO_KEEP is less than the number of words returned by bow_infogain_per_wi_new(). * wi2dvf.c (bow_wi2dvf_entry_at_wi_di): New function. * dv.c (bow_dv_entry_at_di): New function. * bow/libbow.h: Declare new functions. * barrel.c (bow_barrel_add_from_text_dir): Add verbosity when a file is skipped because istext() fails. (bow_new_slow_barrel_printf): New function. * vpc.c (bow_barrel_new_vpc): New argument, NUM_CLASSES. Use it to initialize an array that is filled with counts of the number of documents per class. Initialize CDOC->NUM_WORDS to be the number of documents per class. This can then be used in "event=document" models. (bow_barrel_new_vpc_merge_then_weight): New argument, NUM_CLASSES. (bow_barrel_new_vpc_weight_then_merge): Likewise. * rainbow.c (rainbow_index): Use macro bow_barrel_new_vpc_with_weights(), with new `num_classes' argument. (rainbow_query): Likewise. (rainbow_test): Likewise. (main): Likewise. (rainbow_test_files): Likewise. If QUERY_WV is NULL, verbosify a warning. * bow/libbow.h (bow_method): Add NUM_CLASSES argument to VPC_WITH_WEIGHTS. (bow_barrel_new_vpc_with_weights): Add NUM_CLASSES argument. (bow_barrel_new_vpc): Likewise. (bow_barrel_new_vpc_merge_then_weight): Likewise. (bow_barrel_new_vpc_weight_then_merge): Likewise. * bow/naivebayes.h (bow_params_naivebayes): Remove SCORE_WITH_LOG_PROBABILITIES. * kl.c (bow_kl_score): Reformat error message. * naivebayes.c (bow_naivebayes_set_weights): Only set CDOC->WORD_COUNT if not doing BOW_BINARY_WORD_COUNTS, otherwise leave them as the "document counts" as they were initialized in vpc.c.Thu Aug 14 11:46:46 1997 Andrew McCallum <mccallum@jprc.com> * naivebayes.c: Remove all references and code for SCORE_WITH_LOG_PROBABILITIES. Use KL method instead. (bow_method_crossentropy): Removed, and all related structures and functions. * opts.c (bow_options): Remove "naivebayes-score-with-log-probs" option. (parse_bow_opt): Don't handle it anymore. * naivebayes.c: Add a naivebayes-specific command-line option by using "argp child". (naivebayes_argp_m_est_m): New static variable. (naivebayes_options): New argp structure. New command-line option "naivebayes-m-est-m". (naivebayes_parse_opt): New function. (naivebayes_argp: New structure. (naivebayes_argp_child): New structure. (_register_method_naivebayes): Add the argp child. (bow_naivebayes_score): Comment out assertion that (loo_class == -1) because it trips up rainbow-h. These changes were made a while ago. * rainbow-h.c (hier_recursive_set_rankings): Pass new LOO argument to bow_barrel_score. (classify_single_doc): Likewise. (hier_barrel_set_vpc_and_populate_lower_branches): Likewise. (hier_barrel_prob_wi_in_ci): Add two new pass-by-ref arguments that return certain counts. Pass new arguments. (check_prob_wi_in_ci): Pass new arguments. (_hier_barrel_local_score): Call above function with new arguments, and print them out. (main): Switch back to using POPULATE_BY_SCORING and HIER_NIECE options by default.Wed Aug 13 16:44:07 1997 Andrew McCallum <mccallum@jprc.com> * lex-simple.c (bow_lexer_simple_open_text_fp): Print error message if popen() call failed. * opts.c (bow_argp_add_child): Change asssertion. Add call to memset(), which should be unnecessary. Before this code was added, some inlinks WebKB files were being declared as "nontext" and skipped because many lines had the same length. * istext.c (bow_fp_is_text): Pay attention to BOW_ISTEXT_AVOID_UUENCODE. * opts.c (bow_istext_avoid_uuencode): Declare new global variable. (bow_options): New option "istext-avoid-uuencode". (parse_bow_opt): Handle it. * bow/libbow.h (bow_istext_avoid_uuencode): New global variable set by command-line option. (bow_lex_pipe_command): Make it extern! * kl.c (bow_kl_score): Give more detailed error message for LOO negative probabilities. Before this code was added, some WebKB files were being skipped because the non-MIME-header part was already buffered in STDIO. * lex-simple.c (bow_lexer_simple_open_text_fp): When using BOW_LEX_PIPE_COMMAND, make sure that the file descriptor file position matches the stdio FP position, otherwise we can get a premature EOF because the stdio has already read much of the file for buffering.Mon Aug 11 11:51:11 1997 Andrew McCallum <mccallum@jprc.com> * info_gain.c (bow_infogain_per_wi_print): If NUM_TO_PRINT is 0, then print infogain of all words, not zero words. * bow/libbow.h (bow_model_next_wv): Declare new split function.Mon Jul 14 11:09:04 1997 Andrew McCallum <mccallum@jprc.com> * rainbow-stats.pl (overall_accuracy): Shorten the label before the numbers. * istext.c (bow_fp_is_text): Initialize MAX_LINE_LENGTH_HISTOGRAM_LENGTH to avoid warning. * istext.c (bow_fp_is_text): Re-enable the uuencode-block detection. Now, in order to reject the file, insist that the length of the lines with the most common length be greater than or equal to 50. Hopefully this will not falsely reject HTML files as it did before.Tue Jul 1 08:39:25 1997 Andrew McCallum <mccallum@jprc.com> * kl.c (bow_kl_score): Remove assertion that SCORE_INCREMENT be non-zero. It can be zero when PR_W_C == PR_W_D, then LOG(PR_W_C/PR_W_D) will be zero, and SCORE_INCREMENT will be zero.Mon Jun 30 17:41:06 1997 Karl Kleinpaste <karl@jprc.com> * rainbow.c (rainbow_serve): Added. (rainbow_socket_init): Added. (rainbow_parse_opt): Added SERVER_KEY case. (rainbow_query): Modified FILE * handling for use of other than stdin/stdout. (main): Added query-server handling.Sat Jun 28 12:22:30 1997 Andrew McCallum <mccallum@jprc.com> * rainbow.c (rainbow_test_files): Temporarilty comment out code that removes some of the training documents from training until we add a scheme that really makes the default test percentage 0. (main): Put the call of rainbow_test_files after doing things necessary to update the class/word weights for the command-line options. Temporarily, ALWAYS rebuild the VPC model, even if non of the parameters change because the weights read from disk were bad; find out why eventually! * prind.c (bow_prind_score): When BOW_PRINT_WORD_SCORES, also print PR_W_C. * prind.c (bow_prind_score): When all pre-normalized scores are zero, set normalized scores to -1.0/#classes, don't leave them as zero. [Perhaps we should set the scores to the class priors? Althought this does not fall our of the PrInd derivation.] * kl.c (bow_kl_score): When all pre-normalized scores are zero, set normalized scores to -1.0/#classes, not -9999. * arrow.c (arrow_query): Pass LOO_CV argument to score.Thu Jun 26 14:48:28 1997 Andrew McCallum <mccallum@jprc.com> * lex-simple.c (bow_lexer_simple_open_text_fp): Attend to BOW_LEX_PIPE_COMMAND and implement it. * opts.c (bow_lex_pipe_command): New global variable. (bow_options): New command-line option "lex-pipe-command". (parse_bow_opt): Handle it. * bow/libbow.h: Declare new global variable. * istext.c (bow_fp_is_text): Move local variables to avoid GCC
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -