14
2016
02

Statistical Methods for Data Mining

Statistical Methods for Data Mining


Course description:This course provides an accessible overview of the field of data mining and statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This course presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this course is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.

 

Prerequisites: Probability and Mathematical Statistics, R programming skill

Class time&place: Mon:10.10-11.50  N301

                              Wed: 10.10-11.50  D235

Course Text:

1.     Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. 

2.     James G, Witten D, Hastie T, et al. An introduction to statistical learning. New York: springer, 2013.

Contents:

1.          Introduction

2.          Statistical Learning 

3.          Linear Regression

4.          Classification

5.          Resampling Methods

6.          Linear Model Selection and Regularization

7.          Moving Beyond Linearity

8.          Tree-Based Methods

9.          Support Vector Machines

10.       Unsupervised Learning


Syllabus: 

syllabus.pdf


Slides: 

ch1 introduction.pdf   

ch2 statistical_learning.pdf

ch3 linear_regression.pdf

ch4 classification.pdf

ch5 cv_boot.pdf

ch6 model_selection2.pdf

ch7 nonlinear2.pdf

ch8 trees.pdf

ch9 svm.pdf

ch10 unsupervised.pdf

ch11 AR.pdf

KNN.pdf




Code:

KNN.rar

ISL_ch3.R.zip

ISL_ch4.R.zip

ISL_ch5.R.zip

ISL_ch6.R.zip

ISL_ch7.rar

ISL_ch8.rar

ISL_ch 9_svm.rar

ISL_ch10.rar

ISL_ch11 Apriori.rar

Data:

credit.csv.zip

advertising.csv.zip

BASKETS1n.zip


Homework:

HW 1:  3.7 Excercise 13 &15

HW2:  4.7 Excercise 3 &5&11

HW3:  5.4 Excercise 8 &9

HW4:  6.8 Excercise 8&9

HW5:  7.9  Excercise 5&11

HW6:  8.4  Excercise 9&10

HW7: 9.7  Excercise 3&7


电子版作业请发送到 dataminingxmu@163.com


期末考 : 

  期末考主要是考核project,分为两部分,即presentation 和最终的project分析报告。

  成绩:综合presentation(40%)和最终的project分析报告(60%)两部分成绩作为期末考成绩,再综合平时考勤、quiz、作业的成绩作为本门课的最终成绩。

  说明:(1)每个小组1-2人 (2) project题目自选,可以做方法创新也可以做应用案例


1. Presentation:  

   (1)从第14周开始,预计需要3次课,每次课8组,每组presentation10分钟。

   (2)以PPT或者latex slides形式汇报

   (3)方法创新的需要报告 研究动机,文献综述,研究方法,模拟,(如有理论证明更好),应用案例,总结

   (4)应用案例需要报告    研究动机(尤其是商业意义),文献综述,数据说明,不同方法的应用比较,总结

   (5)presentation的顺序如下


第14周周一第14周周三第15周周一
范新妍、刘蒙阙
丁逸飞
王泽贤、黄俊锋赵雪
吴培堃、李森雷
陈子岚、陈梦莹林颖范菊逸
林静 吴越林洁然杨少颖
韦薇刘嘉璐 余婉露何琪琪
陈芸芸林双全赵梦峦、周峙利
张书张晓晨、汪清扬林文胤,陈鸣
李衍杨梦




2. Project报告:

    (1)研究报告(可以是word也可以是latex的tex文档)

      (2 )  code文档(为了可重复分析结果,最好用R做,也可以接受python等)

    (3)数据说明(说明数据的来源,商业意义,对应的变量含义)

    (4)数据

  Project请在6.17号之前发到dataminingxmu@163.com,如果数据文档太大了,可以用超大附件,或提供下载链接










« 上一篇 下一篇 »