?? project-notes.htm
字號:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Project Notes</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><style type="text/css">body { margin: 50px; font-family: Arial, Helvetica, sans-serif; font-size: 10pt; line-height: 25px; }.titletext { font-family: Arial, Helvetica, sans-serif; font-size: 18pt;}.section { font-family: Arial, Helvetica, sans-serif; font-size: 12pt; font-weight: bold;}</style></head><body><div align="center" class="titletext"> <p>Weka K-Nearest-Neighbor Development</p></div><hr align="center" width="300" size="1"><p><br> <br> <font class="section">Plan</font> <br> <br> The original plan was a vague "implementing k-nearest-neighbor classifier in the weka environment." Since then the plan for this project has formed a bit more than that. The second plan involved creating a k-nearest-neighbor classifier that can be used in Weka Explorer. However, complications explained later has made that goal impractical. Even so, I had intended to implement a gui-based program to perform the k-nearest-neighbor classifier, with plenty of options, and that can still be done. </p><p><br> <font class="section">Process</font> </p><p>I started by learning how to develop applications in Weka from the book Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten and Eibe Frank, which I borrowed. I later discovered that a tutorial pdf file included in the Weka distribution contained a slightly older version of the relevant chapters on weka development, which I used when after I returned the book.<br> <br> Using the book as a guide, I wrote a KNN class compliant with Weka specifications. Most of the class was written using Vim, which is a Vi-like editor. Later, I discovered that it would be impossible for me to compile the code with the Weka library because I lack the storage space on the FSU servers to store the weka.jar file. As a result I coded the rest of the file using NetBeans IDE 4.1 on my own computer, and compiled successfully.<br></p><font class="section">Weka Software Development</font><p> All classifiers in Weka extend weka.classifiers.Classifier. There are some important members to override when developing a new classifier. The relevant methods used in my KNN implementation are as follows:</p><p><strong>buildClassifier()</strong></p><p>This member takes an Instances object (training set), which is the Weka representation of a data set, and builds the classifier. The k-nearest-neighbor, in its simplest form, does not need to be built before execution, so this method simply checks for invalid data, and stores the Instances as member data to be used when the k-nearest-neighbor is performed on test samples.</p><p><strong>classifyInstance()</strong></p><p>This is a method that classifies a single instance (data point). This is where the bulk of the work takes place. The distance is calculated between the given instance and every instance in the training set (previously stored by buildClassifier()). A running list of nearest neighbors is kept in the process, a LinkedList is used for this because a new entry can be added easily. The list has a maximum size of k, so to keep only the k nearest neighbors. A tricky part is later counting the number of occurances of a given class in a flexible way, unbounded by number of classes, and taking into consideration non-integer class labels, which is often the case in Weka. This was achieved with a Hashtable, where a Double object (class) acted as a key, and an Integer (count) as the value. The values in the Hashtable are incremented as the list of k-nearest-neighbors are read. The likely class of the test sample is whichever has the highest value.</p><p>There were some snags found later that sunk my attempt to make the KNN classifier work with Weka Explorer or other Weka GUIs. First of all, in order to add the KNN classifier so that it can be used in a Weka GUI involves going into the weka.jar file and modifying GenericPropertiesCreator.props to include the KNN's package. As it turns out, the tools I have lack the ability to rebuild a .jar file in such a way that it actually works afterwards. I could go find a better jarring tool that I might get better results with, but I realized that it's not really worth it. That is because even if I could make the KNN executable by Weka Explorer, it would only be able to classify using a default k, because there is no provision that I can find within Weka Explorer that allows a user to specify a classifier-specific argument. And although one could specify such arguments when using Weka's Simple CLI, which is a command-line interface, that's not very impressive.</p><p>So instead of fixing something that would get me meager results, I decided to devote the rest of my time to making my own GUI-based program to run the KNN with. </p><p><font class="section">Homemade GUI</font> </p><p>"Seemed like a good idea at the time"</p><p>I proceeded to create a GUI. Not remembering how to do that, I figured out how to make NetBeans do it for me. Each button and field allows users to input parameters for classification.</p><p>Simple interface, 2 tabs, one for loading datasets, the other for classifier settings. A text box for the results of the classifier, and a status window at the bottom so the user knows how the program is doing.</p><p>I got the basic knn classifer to work perfectly, the partial distance one, not so much. Can't find the problem.</p><p>Here's a problem I found later. I can't actually execute the program outside of netbeans. Usually I'll have an executable jar file, but in this case, the jar refuses to execute. It is quite a problem.</p><p><font class="section">Theoretical Improvements</font> </p><p>"that'll be the day"</p><p>Some things I wanted to do on this project that I didn't get to, that would have really made this application better.</p><ul> <li>A working partial distance functionality.</li> <li>search tree method, which is complicated, becuase there is many parameters one could have, such as number of branches per node, perhaps a different number for each level, how many levels the tree should have, or how many data points in each leaf node.</li> <li>Multi-threaded k-nearest-neighbor, could divide the classification of data points into subsets executing concurrently. Threads are too complicated for the time constraints I have.</li> <li>The attributes to omit in the partial distance technique be listed with their attribute names, and selectable, as opposed to the text field currently in the program.</li> <li>...</li></ul></body></html>
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -