This post considers the very basics of the SVM problem in the context of hard margin classification and linearly separable data. There is minimal discussion of soft margins and no discussion of kernel tricks, the dual formulation, or more advanced solving techniques. Some prior knowledge of linear algebra, calculus, and machine learning objectives is necessary.

The first thing to understand about SVMs is what exactly a “support vector” is. To understand this, it is also important to understand what the goal of SVM is, as it is slightly different from Logistic Regression and other non-parametric techniques.

SVM aims to draw a decision boundary through linearly separable classes such that the boundary is as robust as possible. This means that the position of the boundary is determined by those points which lie nearest to it. The decision boundary is a line or hyperplane that has as large a distance as possible from the nearest training instance of either class, as shown in the plot below. …

Economics and data science can look very similar at times. Many techniques that have been used by economists for decades are imperative to data science and machine learning. The intersection comes not only from the close relationship of both fields to statistics, but also from the mathematics that drives their modeling processes.

In this post I will show the usefulness of applying economic methods to a data science-like problem. Although the technique used in this blog post is slightly more complicated than necessary for the problem, practicing on an easy task is a great way to learn a complicated procedure. …

This post discusses why logistic regression necessarily uses a different loss function than linear regression. First, the simple yet inefficient way to solve logistic regression will be presented, then the slightly less simple but much more efficient way will be explained and compared.

Linear regression is the predecessor of logistic regression for most people studying statistics or machine learning. …

I recently watched the 2020 Netflix docuseries entitled *Challenger: The Final Flight*, which tells the fascinating yet heartbreaking story of the tragedy of NASA’s Challenger space shuttle. After watching the series, I was inspired to explore the simple statistical modeling that describes the event. For those who have yet to watch the documentary or who are unfamiliar with Challenger’s story, I’ll start with a brief overview.

The 1980s marked a very exciting time for NASA and space exploration. NASA was experiencing success after success in their relatively new Space Shuttle program. Astronauts were being sent to and from orbit on the same vessel, edging NASA closer and closer to their goal of commercial space flight. …

The presence of imbalanced class sizes when discriminating class membership in a body of data can be a large problem if one’s results are not interpreted appropriately. Achieving high accuracy, the so-called “white whale” of most classification problems, becomes a trivial task if an imbalance is not properly addressed. Although it is often better to optimize metrics such as sensitivity and specificity, this can be difficult with many of the popular supervised learning models. For this reason, one might consider turning to unsupervised/semi-supervised methods instead.

One common application of unsupervised/semi-supervised learning is anomaly detection. In this specific context, unsupervised learning focuses on outlier detection, or identifying anomalies within the known data, while semi-supervised learning focuses on novelty detection, or looking for anomalies that come from new data. …

Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multicollinearity, exogeneity/endogeneity, and threats to external validity. Such problems become difficult to understand and control in the presence of a large number of features. Professors will often tell you to “let theory be your guide” when going about feature selection, but that is not always so easy.

This blog considers the issue of multicollinearity and suggests a method of avoiding it. Proposed here is not a “solution” to collinear variables, nor is it a perfect way of identifying them. …

Classification problems in supervised machine learning are often troubled by the issue of imbalanced class sizes. Given binary classified data, an imbalanced stratification of the two classes will bias the predictions of a model fit to it. A model trained on data made up of 1,000 samples labeled class “0” and 100 samples labeled class “1” could naively predict class “0” for every test instance and report 90% accuracy. Such an accuracy score is deceptive, as the model is not actually “learning” any trends from the data. This can cause serious problems in deployment. …