Data Analysis and Visualisation Courses

Introduction to Data Mining with R

Duration: 5 Days

Course Background

In general terms, Data Mining comprises techniques and algorithms, for determining interesting patterns from large datasets. There are currently hundreds (or even more) algorithms that perform tasks such as frequent pattern mining, clustering, and classification, among others. Understanding how these algorithms work and how to use them effectively is a continuous challenge faced by data mining analysts, researchers, and practitioners, in particular because the algorithm behavior and patterns it provides may change significantly as a function of its parameters. In practice, most of the data mining literature is too abstract regarding the actual use of the algorithms and parameter tuning is usually a frustrating task. On the other hand, there are a large number of implementations available, such as those in the R project, but their documentation focus mainly on implementation details without providing a good discussion about parameter-related trade-offs associated with each of them. This course aims to provide a mix of both practice and theory, as well as "filling in knowledge gaps" and "dusting away cobwebs of topics not visited for several years".

Course Prerequisites and Target Audience

Attendees are expected to have a sound knowledge of R programming, Statistics and Relational Databases

Course Outline

Overview of R
Overview of PostgreSQL / MySQL
Overview of Data Mining and various aspects of using data mining techniques

Historical data vs. operational data
Data warehouses and data marts
Philosophy and concepts of data mining
Data granularity issues
Star Schemas
Data quality issues
Data complexity issues
Computational complexity issues

Data Mining Project Life Cycle

Problem definition - characterisation of problem and possible solutions / approaches
Data evaluation - accessibility, evaluation and data quality
Feature extraction and enhancement
Prototyping - prototype planning and model development
Model evaluation
Implementation
Iteration

Methodologies for mining classification and prediction patterns and available R packages

Regression models
Bayes classifiers
Decision trees
Multi-layer feedforward artificial neural networks
Support vector machines
Supervised clustering

Methodologies for mining clustering and association patterns - Theory and R Practice

Hierarchical clustering
Partitional clustering
Self-organising maps
Probability distribution estimation
Association rules
Bayesian networks

Methodologies for mining data reduction patterns - Theory and R Practice

Principal components analysis
Multi-dimensional scaling
Latent variable analysis

Methodologies for mining outlier and anomaly patterns - Theory and R Practice

Univariate control charts
Multivariate control charts

Methodologies for mining sequential and time series patterns - Theory and R Practice

Autocorrelation based time series analysis
Hidden Markov models for sequential pattern mining
Wavelet analysis
Hilbert transform
Nonlinear time series analysis

Data Mining Genres - case studies