AMS 598, Big Data Analysis
This course introduces the application of the supercomputing to statistical data analyses, particularly on big data. Implementations of various statistical methodologies within parallel computing framework are demonstrated through all lectures. The course will cover (1) parallel computing basics, including architecture on interconnection networks, communications methodologies, algorithm and performance measurements, and (2) their applications to modern data mining techniques, including modern variable selection/Dimension reduction, linear/logistical regression, tree-based classification methods, Kernel-based methods, non-linear statistical models, and model inference/Resampling methods.
Prerequisite: AMS 507, AMS 580 and AMS 597
3 credits, ABCF grading
Text:
"Applied Parallel Computing"; by Yuefan Deng; 2012; World Scientific Publishing Company;
ISBN: 9789814307604 (recommended/optional)
"The Elements of Statistical Learning: Data Mining, Inference, and Prediction", by Trevor Hastie, Robert Tibshirani, and Jerome Friedman; Second Edition; 2011; Springer Series in Statistics; ISBN: 9780387848570 (hardcover) (recommended/optional)
"Mining of Massive Datasets", by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman; 2nd edition, 2014; Cambridge University Press; ISBN: 9781107077232 (recommended/optional)
Offered every Fall semester
Learning Outcomes:
Demonstrate knowledge of parallel computing basics:
- Node architecture, central processing units, and accelerators;
- Distributed – and shared-memory
Demonstrate skills with software architecture and R:
- Communication patterns and protocols;
- Process creation and management;
- Mapreduce framework;
- Hapdoop in R;
- Demonstrate mastery of basic tools for big data analysis:
- Linear regression
- Logistic regression
- Dimension reduction
Demonstrate understanding of advanced methods for big data analysis:
- Classification and regression trees
- Random forest
- Gradient boosting
- Support vector machine
- Neural network
Demonstrate understanding of model selection and performance evaluation:
- Best subset; forward selection; backward selection
- Cross-validation
- Bootstrap