Each student is expected to work on a class project which carries 40% weight in grading and is expected to span about 6 weeks. Following are the general rules about the project:
- Each project should be done in a team of two. It is possible to do a single-person project with prior approval.
- The clock for the projects start on Nov 2, 2010.
- Each team should send a one-paragraph proposal on the selected project for approval. The proposal must contain clearly the problem description, the intended outcome, and the key theoretical and/or experimental components.
- All projects must contain experimental results on at least 3 datasets. Prior approval is needed for any exceptions (please have a very good reason for an exception). Each dataset should contain at least 100K points.
- Evaluation will be based on three things: Project presentation, a final report, and the novelty of ideas, observations and results.
- Timeline:
- Nov. 4 (11:59 pm): Submission of project proposal (one paragraph)
- Dec. 5 (11:59 pm): Submission of presentation slides to Junfeng He. We will combine all presentations on a single machine to avoid computer swapping.
- Dec. 7 (12:35-2:25 pm): Project presentation. Each presentation will be 10-minutes long with 2 minutes for questions.
- Dec. 14 (11:59 pm) : Submission of final report. Each report is expected to be 4 page long. A latex template will be provided.
Students are encouraged to pick a problem that is relevant to their research/application. However, if you are looking for ideas, here is an optional list of projects:
Exploratory
- Kernel Logistic Regression vs SVM: Learn kernel logistic regression with subsampled data (with randomized clustering) and L1 penalty. Show comparisons with SVM in speed and accuracy.
- Develop an improved algorithm for Kernel Logistic Regression with L1 regularization and compare with other online (first-order and quasi-second-order) methods. Keep in mind the non-differentiability of L1 term. Also, how much hand tuning of parameters is necessary.
- Combine underlying data distribution with randomized hashing, e.g., LSH. Assume that underlying distribution is a finite mixture model (i.e. contains clusters).
- Develop a hashing technique for sequence data, e.g, time-series. First define a similarity for time series data and then try to preserve locality with that.
- Develop a hashing method to generate binary codes that uses weighted hamming distance to measure similarity and improves over unweighted state-of-the-art methods.
Experimental/Survey Style
For all the following projects, at least 5 datasets should be used.
- Large scale study of randomized methods vs PCA on datasets with 1K-10K dimensions. Compare exact and inexact PCA methods (randomized as well sampling based decomposition).
- Large scale clustering using randomized methods+Kmeans vs Kmeans in original space. Observe sensitivity with respect to various parameters. Implement EM and compare that as well.
- Compare several state-of-the-art tree and hashing methods.