Understanding Data and Machine Training
Content
- Classification vs. regression, parametric and non-parametric supervised learning, regularization to avoid overfitting, minimum description length
- Cluster analysis, market basket analysis, recommendations
- Statistical principles: Samples, optimal estimators, distribution, density, cumulative distribution, scales: Nominal, ordinal, interval and ratio scales, hypothesis tests,
- Confidence intervals
- Computational networks of differentiable parameterized elementary units, learning of network parameters with gradient descent, back propagation, deep learning: embedding spaces and autoencoders, unsupervised learning
- Stochastic or probabilistic basics: probabilities, random variables, conditional probabilities, independence, distributions,
- Bayesian networks for specifying distributions by factoring, blackboard notation, queries, query answering algorithms, learning methods for complete data, regularization from a probabilistic perspective
- Inductive learning: version space, concept of entropy, decision trees, learning of rules
- Ensemble methods: Bagging (random forests), boosting (XGBoost, CatBoost)
- Clustering: K-Means, DBSCAN, analysis of variation (ANOVA), t-test, linear discriminant analysis
- Prediction by evaluating time series (ARIMA, Auto-Regressive Integrated Moving Average)
Practical work
- Programming language Python with associated libraries from the field of data science (NumPy, SciPy, Pandas, matplotlib, NLTK) as well as the basics of databases
- Machine learning with Python (scikit-learn)
- Deep learning with Python (PyTorch)
- Tools for scientific work: Markup languages (LaTeX, Markdown), version management (git), development environments