EECS6898 Large-Scale Machine Learning

Each student is expected to work on a class project which carries 40% weight in grading and is expected to span about 6 weeks. Following are the general rules about the project:

Each project should be done in a team of two. It is possible to do a single-person project with prior approval.
The clock for the projects start on Nov 2, 2010.
Each team should send a one-paragraph proposal on the selected project for approval. The proposal must contain clearly the problem description, the intended outcome, and the key theoretical and/or experimental components.
All projects must contain experimental results on at least 3 datasets. Prior approval is needed for any exceptions (please have a very good reason for an exception). Each dataset should contain at least 100K points.
Evaluation will be based on three things: Project presentation, a final report, and the novelty of ideas, observations and results.
Timeline:
- Nov. 4 (11:59 pm): Submission of project proposal (one paragraph)
- Dec. 5 (11:59 pm): Submission of presentation slides to Junfeng He. We will combine all presentations on a single machine to avoid computer swapping.
- Dec. 7 (12:35-2:25 pm): Project presentation. Each presentation will be 10-minutes long with 2 minutes for questions.
- Dec. 14 (11:59 pm) : Submission of final report. Each report is expected to be 4 page long. A latex template will be provided.

Students are encouraged to pick a problem that is relevant to their research/application. However, if you are looking for ideas, here is an optional list of projects:

Exploratory

Kernel Logistic Regression vs SVM: Learn kernel logistic regression with subsampled data (with randomized clustering) and L1 penalty. Show comparisons with SVM in speed and accuracy.
Develop an improved algorithm for Kernel Logistic Regression with L1 regularization and compare with other online (first-order and quasi-second-order) methods. Keep in mind the non-differentiability of L1 term. Also, how much hand tuning of parameters is necessary.
Combine underlying data distribution with randomized hashing, e.g., LSH. Assume that underlying distribution is a finite mixture model (i.e. contains clusters).
Develop a hashing technique for sequence data, e.g, time-series. First define a similarity for time series data and then try to preserve locality with that.
Develop a hashing method to generate binary codes that uses weighted hamming distance to measure similarity and improves over unweighted state-of-the-art methods.

Experimental/Survey Style

For all the following projects, at least 5 datasets should be used.

Large scale study of randomized methods vs PCA on datasets with 1K-10K dimensions. Compare exact and inexact PCA methods (randomized as well sampling based decomposition).
Large scale clustering using randomized methods+Kmeans vs Kmeans in original space. Observe sensitivity with respect to various parameters. Implement EM and compare that as well.
Compare several state-of-the-art tree and hashing methods.

Large-Scale Machine Learning

EECS 6898, Fall 2010

Sanjiv Kumar

Columbia University

Exploratory

Experimental/Survey Style