Cogs and Levers A blog full of technical stuff

Dependency Free Rust Binary

Introduction

In some situations, you may need to build yourself a bare machine binary file. Some embedded applications can require this, as well as systems programming where you might be building for scenarios where you don’t have libraries available to you.

In today’s post, we’ll go through building one of these binaries.

Getting Started

Let’s create a standard binary project to start with.

cargo new depfree

This will produce a project that will have the following structure:

.
├── Cargo.toml
└── src
    └── main.rs

Your application should have no dependencies:

[package]
name = "depfree"
version = "0.1.0"
edition = "2021"

[dependencies]

and, you shouldn’t have much in the way of code:

fn main() {
    println!("Hello, world!");
}

We build and run this, we should see the very familiar message:

➜ cargo build
   Compiling depfree v0.1.0 (/home/michael/src/tmp/depfree)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.92s
➜ cargo run  
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/depfree`
Hello, world!

This is already a pretty minimal program. Now our job starts!

Standard Library

When you build an application, by default all Rust crates will link to the standard library.

We can get rid of this by using the no_std attribute like so:

#![no_std]
fn main() {
    println!("Hello, world!");
}

After a quick re-build, we quickly run into some issues.

error: cannot find macro `println` in this scope
 --> src/main.rs:3:5
  |
3 |     println!("Hello, world!");
  |     ^^^^^^^

error: `#[panic_handler]` function required, but not found

error: unwinding panics are not supported without std

Clearly, println is no longer available to us, so we’ll ditch that line.

#![no_std]
fn main() {
}

We also need to do some extra work around handling our own panics.

Handling Panics

Without the no_std attribute, Rust will setup a panic handler for you. When you have no_std specified, this implementation no longer exists. We can use the panic_handler attribute to nominate a function that will handle our panics.

#![no_std]

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop { }
}

fn main() {
}

Now we’ve defined a panic handler (called panic) that will do nothing more than just spin-loop forever. The return type of ! means that the function won’t ever return.

We’re also being told that unwinding panics are not supported when we’re not using the standard library. To simplify this, we can just force panics to abort. We can control this in our Cargo.toml:

[package]
name = "depfree"
version = "0.1.0"
edition = "2021"

[profile.release]
panic = "abort"

[profile.dev]
panic = "abort"

[dependencies]

We’ve just disabled unwinding panics in our programs.

If we give this another rebuild now, we get the following:

error: using `fn main` requires the standard library
  |
  = help: use `#![no_main]` to bypass the Rust generated entrypoint and declare a platform specific entrypoint yourself, usually with `#[no_mangle]`

This is progress, but it looks like we can’t hold onto our main function anymore.

Entry Point

We need to define a new entry point. By using the no_main attribute, we are free to no longer define a main function in our program:

#![no_std]
#![no_main]

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop { }
}

We really have no entry point now. Building this will give you a big horrible error and basically boils down to a linker error:

(.text+0x1b): undefined reference to `main'
/usr/bin/ld: (.text+0x21): undefined reference to `__libc_start_main'

Fair enough. Our linker is taking exception to the fact that we don’t have a _start function which is what the underlying runtime is going to want to call to start up. The linker will look for this function by default.

So, we can fix that by defining a _start function.

#![no_std]
#![no_main]

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop { }
}

#[no_mangle]
pub extern "C" fn _start() -> ! {
    loop { }
}

The no_mangle attribute makes sure that the _start function maintains its name, otherwise the compiler will use its own creativity and generate a name for you. When it does this, it mangles the name so bad that the linker can no longer find it.

The extern "C" is as you’d expect, giving this function C calling conventions.

The C Runtime

After defining our own _start entrypoint, we can give this another build.

You should see a horrific linker error.

The program that the compiler and linker is trying to produce (for my system here at least) is trying to do so using the C runtime. As we’re trying to get dependency-free, we need to tell the build chain that we don’t want to use this.

In order to do that, we need to build our program for a bare metal target. It’s worth understanding what a “target triple” is and what one is made up of that you can start using. The rust lang book has a great section on this.

These take the structure of cpu_family-vendor-operating_system. A target triple encodes information about the target of a compilation session.

You can see all of the targets available for you to install with the following:

rustc --print=target-list

You need to find one of those many targets that doesn’t have any underlying dependencies.

In this example, I’ve found x86_64-unknown-none. A 64-bit target produced by unknown for not particular operating system: none. Install this runtime:

rustup target add x86_64-unknown-none

Let’s build!

➜ cargo build --target x86_64-unknown-none  
   Compiling depfree v0.1.0 (/home/michael/src/tmp/depfree)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.14s

We’ve got a build!

Output

Now we can inspect the binary that we’ve just produced. objdump tells us that we’ve at least made an elf64:

target/x86_64-unknown-none/debug/depfree:     file format elf64-x86-64

Taking a look at our _start entrypoint:

Disassembly of section .text:

0000000000001210 <_start>:
    1210:       eb 00                   jmp    1212 <_start+0x2>
    1212:       eb fe                   jmp    1212 <_start+0x2>

There’s our infinite loop.

Running, and more

Did you try running that thing?

As expected, the application just stares at you doing nothing. Excellent. It’s working.

Let’s add some stuff back in. We can start writing a little inline assembly language easy enough to start to do some things.

We can import asm from the core::arch crate:

use core::arch::asm;

pub unsafe fn exit(code: i32) -> ! {
    let syscall_number: u64 = 60;

    asm!(
        "syscall",
        in("rax") syscall_number,
        in("rdi") code,
        options(noreturn)
    );
}

The syscall at 60 is sys_exit. In 64-bit style, we load it up in rax and put the exit code in rdi.

We can relax in _start point now that it’s unsafe:

#[no_mangle]
pub unsafe fn _start() {
    exit(0);
}

We can now build this one:

➜ cargo build --target x86_64-unknown-none
   Compiling depfree v0.1.0 (/home/michael/src/tmp/depfree)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.24s

We can crack this one open now, and take a look at the underlying implementation.

Disassembly of section .text:

0000000000001210 <_ZN7depfree4exit17h5d41f4f9db19d099E>:
    1210:       48 83 ec 18             sub    $0x18,%rsp
    1214:       48 c7 44 24 08 3c 00    movq   $0x3c,0x8(%rsp)
    121b:       00 00 
    121d:       89 7c 24 14             mov    %edi,0x14(%rsp)
    1221:       b8 3c 00 00 00          mov    $0x3c,%eax
    1226:       0f 05                   syscall
    1228:       0f 0b                   ud2
    122a:       cc                      int3
    122b:       cc                      int3
    122c:       cc                      int3
    122d:       cc                      int3
    122e:       cc                      int3
    122f:       cc                      int3

0000000000001230 <_start>:
    1230:       50                      push   %rax
    1231:       31 ff                   xor    %edi,%edi
    1233:       e8 d8 ff ff ff          call   1210 <_ZN7depfree4exit17h5d41f4f9db19d099E>

Unsurprisingly, we’re calling our exit implementation which has been mangled - you’ll notice.

Let’s give it a run.

➜ ./depfree           
➜ echo $?
0

Conclusion

Success - we’ve made some very bare-bones software using Rust and are ready to move onto other embedded and/or operating system style applications.

Pixel Buffer Rendering in WASM with Rust

Introduction

In our previous post, we introduced writing WebAssembly (WASM) programs using Rust. This time, we’ll dive into pixel buffer rendering, a technique that allows direct manipulation of image data for dynamic graphics. This method, inspired by old-school demo effects, is perfect for understanding low-level rendering concepts and building your first custom graphics renderer.

By the end of this tutorial, you’ll have a working Rust-WASM project that renders graphics to a <canvas> element in a web browser.

Setting Up

Start by creating a new Rust project.

wasm-pack new randypix

Ensure that your Cargo.toml is configured for WASM development:

[package]
name = "randypix"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib", "rlib"]

[dependencies]
wasm-bindgen = "0.2"
web-sys = { version = "0.3", features = ["Window", "Document", "HtmlCanvasElement", "CanvasRenderingContext2d", "ImageData"] }
js-sys = "0.3"

[dev-dependencies]
wasm-bindgen-cli = "0.2"

Writing the Code

The heart of our implementation is the lib.rs file, which handles all interactions between Rust, WebAssembly, and the browser.

Here’s the complete code:

use wasm_bindgen::prelude::*;
use wasm_bindgen::Clamped;
use wasm_bindgen::JsCast;
use web_sys::{CanvasRenderingContext2d, HtmlCanvasElement, ImageData};

#[wasm_bindgen(start)]
pub fn start() -> Result<(), JsValue> {
    // Access the document and canvas
    let document = web_sys::window().unwrap().document().unwrap();
    let canvas = document
        .get_element_by_id("demo-canvas")
        .unwrap()
        .dyn_into::<HtmlCanvasElement>()
        .unwrap();

    let context = canvas
        .get_context("2d")?
        .unwrap()
        .dyn_into::<CanvasRenderingContext2d>()
        .unwrap();

    let width = canvas.width() as usize;
    let height = canvas.height() as usize;

    // Create a backbuffer with RGBA pixels
    let mut backbuffer = vec![0u8; width * height * 4];

    // Fill backbuffer with a simple effect (e.g., gradient)
    for y in 0..height {
        for x in 0..width {
            let offset = (y * width + x) * 4;
            backbuffer[offset] = (x % 256) as u8;        // Red
            backbuffer[offset + 1] = (y % 256) as u8;    // Green
            backbuffer[offset + 2] = 128;               // Blue
            backbuffer[offset + 3] = 255;               // Alpha
        }
    }

    // Create ImageData from the backbuffer
    let image_data = ImageData::new_with_u8_clamped_array_and_sh(
        Clamped(&backbuffer), // Wrap the slice with Clamped
        width as u32,
        height as u32,
    )?;

    // Draw the ImageData to the canvas
    context.put_image_data(&image_data, 0.0, 0.0)?;

    Ok(())
}

Explanation:

  1. Canvas Access:
    • The HtmlCanvasElement is retrieved from the DOM using web_sys.
    • The 2D rendering context (CanvasRenderingContext2d) is obtained for drawing.
  2. Backbuffer Initialization:
    • A Vec<u8> is used to represent the RGBA pixel buffer for the canvas.
  3. Filling the Buffer:
    • A simple nested loop calculates pixel colors to create a gradient effect.
  4. Drawing the Buffer:
    • The pixel data is wrapped with Clamped, converted to ImageData, and drawn onto the canvas with put_image_data.

Setting Up the Frontend

The frontend consists of a single index.html file, which hosts the canvas and loads the WASM module:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Rust WebAssembly Demo</title>
</head>
<body>
<canvas id="demo-canvas" width="800" height="600"></canvas>
<script type="module">
    import init from './pkg/randypix.js';
    init();
</script>
</body>
</html>

Building and Running the Project

Follow these steps to build and run your project:

  1. Build the WASM Module: Use wasm-pack to compile your Rust project into a WASM package:
   wasm-pack build --target web
   
  1. Serve the Project: Use a simple HTTP server to serve the index.html and the generated pkg folder:
   python -m http.server
   
  1. Open in Browser: Navigate to http://localhost:8000 in your browser. You should see a gradient rendered on the canvas.

Conclusion

In this tutorial, we demonstrated how to create and render a pixel buffer to a canvas using Rust and WebAssembly. By leveraging wasm-bindgen and web-sys, we seamlessly integrated Rust with web APIs, showcasing its potential for high-performance graphics programming in the browser.

This example serves as a foundation for more advanced rendering techniques, such as animations, interactive effects, or even game engines. Experiment with the backbuffer logic to create unique visuals or introduce dynamic updates for an animated experience!

WASM in Rust

Introduction

WebAssembly (WASM) is a binary instruction format designed for fast execution in web browsers and other environments. It enables developers to write code in languages like C, C++, or Rust, compile it to a highly efficient binary format, and execute it directly in the browser. This makes WASM an exciting technology for building high-performance applications that run alongside JavaScript.

Rust, with its emphasis on safety, performance, and WebAssembly support, has become a popular choice for developers working with WASM. In this tutorial, we’ll explore how to use Rust to produce and interact with WASM modules, showcasing its ease of integration with JavaScript.

Setup

To get started, we’ll use Rust’s nightly version, which provides access to experimental features. You can install it via rustup:

rustup install nightly

Next, install wasm-pack.

This tool seeks to be a one-stop shop for building and working with rust- generated WebAssembly that you would like to interop with JavaScript, in the browser or with Node.js.

cargo install wasm-pack

Now we’re ready to set up our project. Create a new WASM project using wasm-pack:

wasm-pack new hello-wasm

This will generate a new project in a folder named hello-wasm.

Project Structure

Once the project is created, you’ll see the following directory structure:

.
├── Cargo.toml
├── LICENSE_APACHE
├── LICENSE_MIT
├── README.md
├── src
│   ├── lib.rs
│   └── utils.rs
└── tests
    └── web.rs

3 directories, 7 files

To ensure the project uses the nightly version of Rust, set an override for the project directory:

rustup override set nightly

This tells Rust tools to use the nightly toolchain whenever you work within this directory.

The Code

Let’s take a look at the code generated in ./src/lib.rs:

mod utils;

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
extern "C" {
    fn alert(s: &str);
}

#[wasm_bindgen]
pub fn greet() {
    alert("Hello, hello-wasm!");
}

This code introduces WebAssembly bindings using the wasm-bindgen crate. It defines an external JavaScript function, alert, and creates a public Rust function, greet, which calls this alert. This demonstrates how Rust code can interact seamlessly with JavaScript.

Building the WASM Module

To compile the project into a WASM module, run the following command:

wasm-pack build --target web

After a successful build, you’ll see a pkg folder containing the WASM file (hello_wasm_bg.wasm) and JavaScript bindings (hello_wasm.js).

Hosting and Running the Module

To test the WASM module in the browser, we need an HTML file to load and initialize it. Create a new index.html file in your project root:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>WASM Example</title>
</head>
<body>
    <script type="module">
        import init, { greet } from "./pkg/hello_wasm.js";

        // Initialize the WASM module and call the function
        (async () => {
            await init();
            greet();
        })();
    </script>
</body>
</html>

This script:

  1. Imports the init function and the greet function from the WASM module.
  2. Initializes the WASM module using init.
  3. Calls greet, which triggers the JavaScript alert.

To serve the project locally, start a simple HTTP server:

python -m http.server

Visit http://localhost:8000 in your browser. You should see a JavaScript alert box with the message "Hello, hello-wasm!".

Conclusion

WebAssembly, combined with Rust, opens up exciting possibilities for writing high-performance web applications. In this guide, we walked through the process of setting up a Rust project, writing a WASM module, and interacting with it in the browser. With tools like wasm-pack and wasm-bindgen, Rust provides a seamless developer experience for building cross-language applications.

Whether you’re adding computationally intensive features to your web app or exploring the power of WebAssembly, Rust is an excellent choice for the journey.

scikit-learn

Introduction

scikit-learn is one of the most popular Python libraries for machine learning, providing tools for supervised, unsupervised, and semi-supervised learning, as well as utilities for preprocessing, model selection, and more.

This guide explores key features of the library with practical examples.

Let’s get started by installing scikit-learn into our environment:

pip install scikit-learn

Supervised Learning

Supervised learning is a type of machine learning where the model learns to map input data to labeled outputs (targets) based on a given dataset. During training, the algorithm uses these labeled examples to understand the relationship between features and outcomes, enabling it to make accurate predictions or classifications on new, unseen data. This approach is commonly used for tasks like regression (predicting continuous values) and classification (categorizing data into discrete labels).

Regression

Regression models in supervised learning are used to predict continuous outcomes. These models establish relationships between input features and a target variable. Here’s a summary of the primary types of regression models available in scikit-learn:

  • Linear Regression: A simple and interpretable model that predicts outcomes based on a linear relationship between input features and the target variable. It’s ideal for tasks like predicting house prices based on square footage.

  • Ridge and Lasso Regression: These are regularized versions of linear regression that handle multicollinearity and high-dimensional data by adding penalties to large coefficients. Common applications include gene expression analysis and other domains with many correlated features.

  • Support Vector Regression (SVR): A kernel-based approach that captures non-linear relationships between inputs and outputs, making it effective for problems like stock price prediction.

  • Random Forest Regressor: An ensemble method that uses multiple decision trees to make robust predictions. It excels in tasks such as forecasting temperature or sales trends.

  • Gradient Boosting: This method iteratively improves predictions by focusing on poorly predicted samples. It’s commonly used for complex tasks like predicting customer lifetime value.

  • K-Neighbors Regressor: This algorithm predicts based on the average target value of the nearest neighbors in feature space, often used in property value estimation.

Regression models are essential for problems where understanding or predicting a continuous trend is the goal. scikit-learn’s implementations provide a range of options from simple to complex to handle varying levels of data complexity and feature interactions.

Linear Regression

Linear Regression predicts a target variable as a linear combination of input features.

Example: predicting house prices.

from sklearn.linear_model import LinearRegression

# Example data
X = [[1], [2], [3]]
y = [2, 4, 6]

model = LinearRegression()
model.fit(X, y)
print(model.predict([[4]]))  # Predict for new data

In this example, the LinearRegression predicts what the next value for y is when given an unstudied X value.

Predicting for [4] gives us:

[8.]

Ridge and Lasso Regression

These methods add regularization to Linear Regression to handle high-dimensional data.

from sklearn.linear_model import Ridge

# Example data
X = [[1, 2], [2, 4], [3, 6]]
y = [1, 2, 3]

model = Ridge(alpha=1.0)
model.fit(X, y)
print(model.coef_)  # Regularized coefficients

This produces the following output:

[0.18181818 0.36363636]

The coef_ value here represents the coefficients (weights) of the features in the fitted model. These coefficients provide insight into the importance and contribution of each feature to the target variable.

These are important for:

  • Understanding feature importance By examining the magnitude of the coefficients, you can determine which features have the greatest impact on the target variable. Larger absolute values typically indicate more influential features.
  • Interpret relationships The sign of each coefficient indicates the direction of the relationship. Positive implies an increase in the feature value increases the target value. Negative coefficients imply the opposite.
  • Feature selection As a feature’s coefficient approaches zero, its importance is diminished and therefore can inform your decision on selecting it as a feature
  • Predict target changes The coefficients are treated as multipliers here, allowing you to predict the change on any other coefficients

Support Vector Regression (SVR)

SVR captures non-linear relationships by using kernels.

from sklearn.svm import SVR

# Example data
X = [[1], [2], [3]]
y = [1.5, 3.7, 2.1]

model = SVR(kernel='rbf')
model.fit(X, y)
print(model.predict([[2.5]]))

The output here is:

[2.70105299]

The relationship between the features in X and the targets in y is now not linear. As a result, SVR predicts a value by finding a function that fits within a tolerance margin (called the epsilon-tube) around the data. The SVR constructor allows you to control the epsilon: passing a smaller value will aim for a more exact fit; a larger value will aim at better generalisation.

Random Forest Regressor

An ensemble method that averages predictions from multiple decision trees.

from sklearn.ensemble import RandomForestRegressor

# Example data
X = [[1], [2], [3]]
y = [10, 20, 30]

model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)
print(model.predict([[2.5]]))

The output (when I run it) of this:

[22.5]

Run this a few more times though, and you’ll see that you get different output values. As the name suggests, Random Forrest Regressor uses randomness in building its model which may lead to slightly different values each time the model is trained. This randomness helps improve the model’s generalisation ability.

Gradient Boosting

Boosting combines weak learners to achieve higher accuracy.

from sklearn.ensemble import GradientBoostingRegressor

# Example data
X = [[1], [2], [3]]
y = [10, 15, 25]

model = GradientBoostingRegressor()
model.fit(X, y)
print(model.predict([[2.5]]))

The output of this is:

[15.00004427]

Gradient Boosting builds models sequentially to improve predictions by focusing on errors that it observes from previous models. It works by doing the following:

  • Sequential model building Each subsequent model build attempts to correct residual errors from a previous model build
  • Gradient descent optimisation Rather than fitting the target variable, gradient boosting aims at minimising the loss function
  • Weight contribution Predictions from all models are combined, often using weighted sums to produce a final prediction

K-Neighbors Regressor

Predicts the target value based on the mean of the nearest neighbors.

from sklearn.neighbors import KNeighborsRegressor

# Example data
X = [[1], [2], [3]]
y = [1, 2, 3]

model = KNeighborsRegressor(n_neighbors=2)
model.fit(X, y)
print(model.predict([[2.5]]))

The output of this:

[2.5]

K-Neighbors Regressor is a non-parametric algorithm. It relies on the similarity between data points to predict the target value for a new input. It works in the following way:

  • Find neighbors The algorithm identifies the ( K ) nearest data points (neighbors) in the feature space using a distance function (Euclidean/Manhattan/Minkowski)
  • Predict the value The target value for the input is computed as the weighted average of the target value of the ( K ) neighbors

Summary

Algorithm Where It Excels Where to Avoid Additional Notes
Linear Regression - Simple, interpretable tasks.
- Problems with few features and a linear relationship.
- Non-linear relationships.
- Datasets with outliers.
- Multicollinearity.
Coefficients provide insights into feature importance.
Ridge Regression - Handling multicollinearity.
- High-dimensional datasets.
- Sparse datasets where some features are irrelevant. Adds ( L2 ) penalty (squared magnitude of coefficients).
Lasso Regression - Feature selection (shrinks irrelevant feature weights to zero).
- High-dimensional datasets.
- Scenarios needing all features for predictions.
- Datasets with high noise levels.
Adds ( L1 ) penalty (absolute value of coefficients).
ElasticNet Regression - Combines Ridge and Lasso strengths for datasets with multiple feature types. - Small datasets where simpler methods like Linear Regression suffice. Balances ( L1 ) and ( L2 ) penalties via an l1_ratio parameter.
Support Vector Regression (SVR) - Capturing non-linear relationships.
- Small to medium-sized datasets.
- Large datasets (slow training).
- Poorly scaled features (sensitive to scaling).
Uses kernels (e.g., RBF) to model non-linear relationships.
Random Forest Regressor - Robust to outliers.
- Non-linear relationships.
- Feature importance estimation.
- High-dimensional sparse data.
- Very large datasets (may require more memory).
Ensemble method combining multiple decision trees.
Gradient Boosting Regressor - Complex datasets.
- Predictive tasks with high accuracy requirements.
- Tabular data.
- Large datasets without sufficient computational resources.
- Overfitting if not regularized.
Iteratively improves predictions by focusing on poorly predicted samples.
K-Neighbors Regressor - Small datasets with local patterns.
- Non-linear relationships without feature engineering.
- Large datasets (computationally expensive).
- High-dimensional feature spaces.
Predictions are based on the mean of ( k ) nearest neighbors in the feature space.

Classification

Classification is a supervised learning technique used to predict discrete labels (classes) for given input data. In scikit-learn, various classification models are available, each suited for different types of problems. Here’s a summary of some key classification models:

  • Logistic Regression A simple yet powerful model that predicts probabilities of class membership using a logistic function. It works well for both binary (e.g., spam detection) and multi-class classification tasks. Logistic Regression is interpretable and often serves as a baseline model.
  • Decision Tree Classifier A tree-based model that splits data based on feature values, creating interpretable decision rules. Decision trees excel in explaining predictions and handling non-linear relationships. They are prone to overfitting but can be controlled with pruning or parameter constraints.
  • Random Forest Classifier An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting. Random forests are robust and handle high-dimensional data well. They’re commonly used in applications like disease diagnosis and image classification.
  • Support Vector Machine (SVM) SVMs create a hyperplane that separates classes while maximizing the margin between them. They are effective for both linear and non-linear classification tasks and work well for problems like handwriting recognition. SVMs are sensitive to feature scaling and require tuning parameters like the kernel type.
  • Naive Bayes A probabilistic model based on Bayes’ theorem, assuming independence between features. Naive Bayes is fast and efficient for high-dimensional data, making it ideal for text classification problems like spam filtering.
  • k-Nearest Neighbors (k-NN) A simple and intuitive algorithm that classifies based on the majority label of ( k ) nearest neighbors in the feature space. It works well for recommendation systems and other tasks where the decision boundary is complex but local patterns are important.
  • Gradient Boosting Classifier A powerful ensemble technique that iteratively improves predictions by correcting errors of previous models. Gradient Boosting achieves high accuracy on structured/tabular data and is often used in competitions and real-world applications like fraud detection.

Logistic Regression

A simple classifier for binary or multi-class problems.

from sklearn.linear_model import LogisticRegression

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = LogisticRegression()
model.fit(X, y)
print(model.predict_proba([[2.5]]))

This will give you a list of values that are the probability of the [2.5] value is in the classification of 0 or 1. At these small data levels, the output probabilities make some boundary decisions that don’t appear correct at first.

As the sample set grows and becomes more diverse, the classifications normalise.

Decision Tree Classifier

A rule-based model for interpretable predictions.

from sklearn.tree import DecisionTreeClassifier

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = DecisionTreeClassifier()
model.fit(X, y)
print(model.predict([[2.5]]))

This output of this is:

[1]

As the input value moves further away from 2, the output starts snapping to 0.

This algorithm starts at the root node and selects the feature and threshold that best divide the dataset into subsets with the most homogeneous class labels.

The process is repeated for each subset until a stopping criterion is met, such as:

  • Maximum tree depth.
  • Minimum samples in a leaf node.
  • All samples in a subset belong to the same class.

Random Forest Classifier

An ensemble method that reduces overfitting.

from sklearn.ensemble import RandomForestClassifier

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = RandomForestClassifier()
model.fit(X, y)
print(model.predict([[2.5]]))

This output of this is:

[1]

Having a look at predict_proba from the model object, we can see the probabilities of the value being classified:

[[0.32 0.68]]

2.5 gives us a 68% chance according to the model that we should classify as a 1.

This algorithm works by:

  • Building multiple decision trees during training.
  • Each tree is trained on a random subset of data (bagging) and considers a random subset of features at each split.
  • Predictions are made by majority voting across all trees.

Support Vector Machine (SVM)

Maximizes the margin between classes for classification.

from sklearn.svm import SVC

# Example data
X = [[1], [2], [3], [2.5]]
y = [0, 1, 0, 1]

model = SVC(kernel='linear')
model.fit(X, y)
print(model.predict([[2.]]))

The output of this is:

[1]

SVM is a supervised learning algorithm used for classification and regression tasks. SVM finds the hyperplane that best separates classes while maximizing the margin between them.

Naive Bayes

A probabilistic model based on Bayes’ theorem.

from sklearn.naive_bayes import GaussianNB

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = GaussianNB()
model.fit(X, y)
print(model.predict([[2.5]]))

The output of this is:

[0]

Naive Bayes (based on Bayes’ theorm) calculates the posterior probability of each class for a given input and assigns the class with the highest probability.

Types of Naive Bayes Models:

  • Gaussian Naive Bayes: Assumes features follow a normal distribution (continuous data).
  • Multinomial Naive Bayes: Suitable for discrete data, commonly used for text classification (e.g., word counts).
  • Bernoulli Naive Bayes: Handles binary/boolean features (e.g., word presence/absence).

k-Nearest Neighbors (k-NN)

Classifies based on the majority label of neighbors.

from sklearn.neighbors import KNeighborsClassifier

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = KNeighborsClassifier(n_neighbors=2)
model.fit(X, y)
print(model.predict([[2.5]]))

Gradient Boosting Classifier

Gradient Boosting Classifier is a supervised learning algorithm that builds an ensemble of weak learners (typically decision trees) sequentially, with each new tree correcting the errors of the previous ones.

from sklearn.ensemble import GradientBoostingClassifier

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = GradientBoostingClassifier()
model.fit(X, y)
print(model.predict([[2.5]]))

How it works:

  1. Start with a Simple Model: The process begins with a weak learner (e.g., a small decision tree) that makes initial predictions.

  2. Compute Residuals: The errors (residuals) from the previous predictions are calculated.

  3. Fit the Next Model: A new weak learner is trained to predict the residuals.

  4. Combine Models: The predictions from all learners are combined (weighted sum) to form the final output.

  5. Gradient Descent: The algorithm minimizes the loss function (e.g., log loss for classification) by iteratively updating the predictions.

Summary

Algorithm Where It Excels Where to Avoid Additional Notes
Logistic Regression - Binary or multi-class classification.
- Interpretable and simple problems.
- Linearly separable data.
- Non-linear decision boundaries.
- Complex datasets with many features.
Outputs probabilities for class membership. Often used as a baseline model.
Decision Tree Classifier - Interpretable models.
- Handling non-linear relationships.
- Small to medium-sized datasets.
- Prone to overfitting on noisy data.
- Large datasets without pruning or constraints.
Creates human-readable decision rules. Can be controlled using parameters like max_depth.
Random Forest Classifier - Robust to overfitting.
- High-dimensional data.
- Tasks requiring feature importance ranking.
- Sparse datasets.
- Very large datasets (can require significant memory).
Ensemble method combining multiple decision trees. Uses bagging for improved performance.
Support Vector Machine (SVM) - Binary or multi-class problems with complex boundaries.
- High-dimensional feature spaces.
- Very large datasets (slow training).
- Datasets requiring soft predictions (probabilities).
Effective for small to medium-sized datasets. Requires scaling of features for optimal performance.
Naive Bayes - High-dimensional data.
- Text classification (e.g., spam detection).
- Multiclass problems.
- Strong feature dependencies.
- Continuous numerical features without preprocessing.
Assumes feature independence. Fast and efficient for large-scale problems.
k-Nearest Neighbors (k-NN) - Small datasets with complex decision boundaries.
- Non-parametric problems.
- Large datasets (computationally expensive).
- High-dimensional feature spaces.
Relies on distance metrics (e.g., Euclidean). Sensitive to feature scaling.
Gradient Boosting Classifier - Tabular data.
- High accuracy requirements for structured datasets.
- Imbalanced data with class weighting.
- Large datasets with limited resources.
- Risk of overfitting if not regularized.
Ensemble of weak learners that iteratively improves predictions. Requires careful hyperparameter tuning.
Multilayer Perceptron (MLP) - Non-linear decision boundaries.
- Complex datasets with many features.
- Large datasets without sufficient computational resources.
- Requires careful tuning.
Neural network-based classifier. Requires scaling of features. Can model complex patterns.

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model identifies patterns, structures, or relationships in data without labeled outputs. Instead of predicting specific outcomes, the algorithm organizes or simplifies the data based on inherent similarities or differences. Common applications include clustering (grouping similar data points) and dimensionality reduction (compressing high-dimensional data for visualization or analysis).

Clustering

Clustering is a technique in unsupervised learning that involves grouping data points into clusters based on their similarities or proximity in feature space. The goal is to organize data into meaningful structures, where points within the same cluster are more similar to each other than to those in other clusters. It is commonly used for tasks like customer segmentation, anomaly detection, and exploratory data analysis.

K-Means

K-Means is an unsupervised learning algorithm that partitions a dataset into \(K\) clusters based on feature similarity.

from sklearn.cluster import KMeans

# Example data
X = [[1], [2], [10], [11]]

model = KMeans(n_clusters=2)
model.fit(X)
print(model.labels_)

The output of this is:

[0 0 1 1]

This tells us that [1] and [2] are assigned the label of 0, with [10] and [11] being assigned 1.

How it works:

  1. Choose Initial Centroids: Randomly select \(K\) points as the initial cluster centroids.

  2. Assign Data Points to Clusters: Assign each data point to the nearest centroid based on a distance metric (e.g., Euclidean distance).

  3. Update Centroids: Recalculate the centroids as the mean of all points assigned to each cluster.

  4. Iterate: Repeat steps 2 and 3 until centroids stabilize or a maximum number of iterations is reached.

DBSCAN

DBSCAN is an unsupervised clustering algorithm that groups data points based on density, identifying clusters of arbitrary shapes and detecting outliers.

from sklearn.cluster import DBSCAN

# Example data
X = [[1], [2], [10], [11], [50000], [50001]]

model = DBSCAN(eps=10, min_samples=2)
model.fit(X)
print(model.labels_)

Clearly the 50000 and 500001 are should clearly be clustered together. The output here:

[0 0 0 0 1 1]

How it works:

  1. Core Points: A point is a core point if it has at least min_samples neighbors within a specified radius (eps).

  2. Reachable Points: A point is reachable if it lies within the eps radius of a core point.

  3. Noise Points: Points that are neither core points nor reachable are classified as noise (outliers).

  4. Cluster Formation: Clusters are formed by connecting core points and their reachable points.

Agglomerative Clustering

Agglomerative Clustering is a hierarchical, bottom-up clustering algorithm that begins with each data point as its own cluster and merges clusters iteratively based on a linkage criterion until a stopping condition is met.

from sklearn.cluster import AgglomerativeClustering

# Example data
X = [[1], [2], [10], [11]]

model = AgglomerativeClustering(n_clusters=2)
model.fit(X)
print(model.labels_)

This outputs:

[1 1 0 0]

How it works:

  1. Start with Individual Clusters: Each data point is treated as its own cluster.

  2. Merge Clusters: Clusters are merged step-by-step based on a similarity metric and a linkage criterion.

  3. Stop Merging: The process continues until the desired number of clusters is reached, or all points are merged into a single cluster.

  4. Dendrogram: A tree-like diagram (dendrogram) shows the hierarchical relationship between clusters.

Summary

Algorithm Where It Excels Where to Avoid Additional Notes
K-Means - Partitioning data into well-defined, compact clusters.
- Large datasets with distinct clusters.
- Non-spherical clusters.
- Highly imbalanced cluster sizes.
- Datasets with noise or outliers.
Relies on centroids; sensitive to initializations. Requires the number of clusters (k) to be specified beforehand.
DBSCAN - Finding clusters of arbitrary shapes.
- Detecting outliers.
- Spatial data analysis.
- High-dimensional data.
- Datasets with varying densities.
- Requires careful tuning of eps and min_samples.
Density-based approach. Does not require the number of clusters to be predefined. Can identify noise points as outliers.
Agglomerative Clustering - Hierarchical relationships between clusters.
- Small to medium-sized datasets.
- Large datasets (computationally expensive).
- Very high-dimensional data.
Hierarchical clustering. Outputs a dendrogram for visualizing cluster merges.

Dimensionality Reduction

PCA

PCA is an unsupervised dimensionality reduction technique that transforms data into a lower-dimensional space while preserving as much variance as possible.

from sklearn.decomposition import PCA

# Example data
X = [[1, 2], [3, 4], [5, 6]]

model = PCA(n_components=1)
transformed = model.fit_transform(X)
print(transformed)

This outputs the following:

[[-2.82842712]
 [ 0.        ]
 [ 2.82842712]]

t-SNE

Visualizes high-dimensional data.

import numpy as np
from sklearn.manifold import TSNE

# Example data
X = np.array([[1, 2], [3, 4], [5, 6]])

model = TSNE(n_components=2, perplexity=2, random_state=10)
transformed = model.fit_transform(X)
print(transformed)

The output from this is:

[[-200.7746     0.     ]
 [ 139.41475    0.     ]
 [ 479.60336    0.     ]]

NMF

Non-negative matrix factorization for feature extraction.

from sklearn.decomposition import NMF

# Example data
X = [[1, 2], [3, 4], [5, 6]]

model = NMF(n_components=2)
transformed = model.fit_transform(X)
print(transformed)

The output of this is:

[[1.19684855 0.        ]
 [0.75282266 0.72121572]
 [0.1905593  1.48561981]]

3. Semi-Supervised Learning

Semi-supervised learning bridges the gap between supervised and unsupervised learning by utilizing a small amount of labeled data alongside a large amount of unlabeled data.

Label Propagation

Label Propagation spreads label information from labeled to unlabeled data.

from sklearn.semi_supervised import LabelPropagation

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 1, -1, -1, -1]  # -1 indicates unlabeled data

model = LabelPropagation()
model.fit(X, y)
print(model.transduction_)  # Predicted labels for unlabeled data

The output of this shows the remainder of the data getting labelled:

[0 1 1 1 1]

Self-Training

Self-Training generates pseudo-labels for unlabeled data using a supervised model.

from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier

# Example data
X = [[1], [2], [3], [4], [5]]
y = [0, 1, -1, -1, -1]  # -1 indicates unlabeled data

base_model = RandomForestClassifier()
model = SelfTrainingClassifier(base_model)
model.fit(X, y)
print(model.predict([[3.5]]))  # Predict for unlabeled data

The unlabelled [3] value now is returned with a value:

[1]

4. Model Selection

Model selection is the process of identifying the best machine learning model and its optimal configuration for a given dataset and problem. It involves comparing different models, evaluating their performance using metrics (e.g., accuracy, F1-score, or RMSE), and tuning their hyperparameters to maximize predictive accuracy or minimize error.

Cross-Validation

Cross-validation is a model evaluation technique that assesses a model’s performance by dividing the dataset into multiple subsets (folds) for training and testing.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Example data
X = [[1], [2], [3], [4]]
y = [0, 1, 0, 1]

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=2)
print(scores)  # Accuracy scores for each fold

The output here is:

[0.5 0.5]

How it works:

  • Split the Data: The dataset is split into \(k\) folds (subsets).
  • Train and Test: The model is trained on \(k - 1\) folds and tested on the remaining fold. This process repeats ( k ) times, with each fold used as the test set once.
  • Aggregate Results: Performance metrics (e.g., accuracy, F1-score) from all folds are averaged to provide an overall evaluation.

GridSearchCV

GridSearchCV is a tool in scikit-learn for hyperparameter tuning that systematically searches for the best combination of hyperparameters by evaluating all possible parameter combinations in a given grid.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Example data
X = [[1], [2], [3], [4]]
y = [0, 1, 0, 1]

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
model = GridSearchCV(SVC(), param_grid, cv=2)
model.fit(X, y)
print(model.best_params_)  # Best parameters

The output of this is:

{'C': 0.1, 'kernel': 'linear'}

How it works:

  • Define a Parameter Grid: Specify the hyperparameters and their possible values (e.g., {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']} for an SVM).
  • Cross-Validation: For each combination of parameters, the model is evaluated using cross-validation to estimate its performance.
  • Select the Best Model: The combination of hyperparameters that produces the best cross-validation score is chosen.

RandomizedSearchCV

RandomizedSearchCV is a hyperparameter tuning tool in scikit-learn that randomly samples a fixed number of parameter combinations from a specified grid and evaluates them using cross-validation.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Example data
X = [[1], [2], [3], [4]]
y = [0, 1, 0, 1]

param_distributions = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
model = RandomizedSearchCV(RandomForestClassifier(), param_distributions, n_iter=5, cv=2)
model.fit(X, y)
print(model.best_params_)  # Best parameters

The output of this is:

{'n_estimators': 100, 'max_depth': None}

How it works:

  • Define a Parameter Distribution: Specify the hyperparameters and their possible ranges (distributions or lists) to sample from.
  • Random Sampling: A fixed number of parameter combinations is randomly selected and evaluated.
  • Cross-Validation: For each sampled combination, the model is evaluated using cross-validation.
  • Select the Best Model: The parameter combination that yields the best performance is chosen.

5. Feature Selection

Feature selection is the process of identifying the most relevant features in a dataset for improving a machine learning model’s performance. By reducing the number of features, it helps eliminate redundant, irrelevant, or noisy data, leading to simpler, faster, and more interpretable models

SelectKBest

SelectKBest is a feature selection method in scikit-learn that selects the top \(k\) features from the dataset based on univariate statistical tests.

from sklearn.feature_selection import SelectKBest, f_classif

# Example data
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [0, 1, 0]

model = SelectKBest(f_classif, k=2)
X_new = model.fit_transform(X, y)
print(X_new)  # Selected features

The output of this is:

[[2 3]
 [5 6]
 [8 9]]

How it works:

  • Choose a Scoring Function: Select a statistical test (e.g., ANOVA, chi-square, mutual information) to evaluate feature relevance.
  • Compute Scores: Each feature is scored based on its relationship with the target variable.
  • Select Top \(k\) Features: The \(k\) highest-scoring features are retained for the model.

Recursive Feature Elimination (RFE)

RFE is a feature selection method that recursively removes the least important features based on a model’s performance until the desired number of features is reached.

from sklearn.feature_selection import RFE
from sklearn.svm import SVC

# Example data
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [0, 1, 0]

model = RFE(SVC(kernel='linear'), n_features_to_select=2)
X_new = model.fit_transform(X, y)
print(X_new)  # Selected features

The output of this is:

[[1 3]
 [4 6]
 [7 9]]

How it works:

  • Fit a Model: Train a machine learning model (e.g., Logistic Regression, SVM) on the dataset.
  • Rank Features: The model assigns importance scores to the features (e.g., weights or coefficients).
  • Remove Features: Eliminate the least important features (based on the scores) and refit the model.
  • Repeat: Continue the process until the specified number of features is retained.

VarianceThreshold

VarianceThreshold is a simple feature selection method in scikit-learn that removes features with low variance, assuming that low-variance features do not carry much information.

from sklearn.feature_selection import VarianceThreshold

# Example data
X = [[0, 2, 0], [0, 3, 0], [0, 4, 0]]

model = VarianceThreshold(threshold=0.5)
X_new = model.fit_transform(X)
print(X_new)  # Features with variance above threshold

The output of this is:

[[2]
 [3]
 [4]]

The zeros don’t change, so they’re stripped from the result.

How it works:

  • Compute Variance: For each feature, calculate the variance across all samples.
  • Apply Threshold: Remove features whose variance falls below a specified threshold.

6. Preprocessing

Preprocessing transforms raw data to make it suitable for machine learning algorithms.

StandardScaler

StandardScaler is a preprocessing technique in scikit-learn that standardizes features by removing the mean and scaling them to unit variance (z-score normalization).

from sklearn.preprocessing import StandardScaler

# Example data
X = [[1], [2], [3]]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)

This outputs the following:

[[-1.22474487]
 [ 0.        ]
 [ 1.22474487]]

How it works:

  • Compute Mean and Standard Deviation:
    For each feature, calculate its mean \((\mu)\) and standard deviation \((\sigma)\).

  • Transform Features:
    Scale each feature \(( x )\) using the formula:
    \(z = \frac{x - \mu}{\sigma}\)

This results in features with a mean of 0 and a standard deviation of 1.

MinMaxScaler

MinMaxScaler is a preprocessing technique in scikit-learn that scales features to a fixed range, typically [0, 1]. It preserves the relationships between data points while ensuring all features are within the specified range.

from sklearn.preprocessing import MinMaxScaler

# Example data
X = [[1], [2], [3]]

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)

This outputs the following:

[[0. ]
 [0.5]
 [1. ]]

How it works:

How It Works:

  • Compute Minimum and Maximum Values: For each feature, calculate its minimum \(( \text{min} )\) and maximum \(( \text{max} )\) values.
  • Transform Features: Scale each feature \(x\) using the formula:
    \(x' = \frac{x - \text{min}}{\text{max} - \text{min}} \times (\text{max}_{\text{scale}} - \text{min}_{\text{scale}}) + \text{min}_{\text{scale}}\)

    By default, \(\text{min}_{\text{scale}} = 0 \) and \( \text{max}_{\text{scale}} = 1\).

PolynomialFeatures

PolynomialFeatures is a preprocessing technique in scikit-learn that generates new features by adding polynomial combinations of existing features up to a specified degree.

from sklearn.preprocessing import PolynomialFeatures

# Example data
X = [[1], [2], [3]]

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
print(X_poly)

This outputs the following:

[[1. 1. 1.]
 [1. 2. 4.]
 [1. 3. 9.]]

How it works:

  • Generate Polynomial Features:
    Creates polynomial terms (e.g., \(( x_1^2, x_1 \cdot x_2, x_2^3 )\) for the input features up to the specified degree.

  • Include Interaction Terms:
    Optionally includes interaction terms (e.g., \(( x_1 \cdot x_2 )\) to capture feature interactions.

  • Expand the Feature Space:
    Transforms the input dataset into a higher-dimensional space to model non-linear relationships.

LabelEncoder

LabelEncoder is a preprocessing technique in scikit-learn that encodes categorical labels as integers, making them suitable for machine learning algorithms that require numerical input.

from sklearn.preprocessing import LabelEncoder

# Example data
y = ['cat', 'dog', 'cat']

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded)

The output of is this:

[0 1 0]
  • Fit to Labels:
    Maps each unique label in the dataset to an integer.
    Example: ['cat', 'dog', 'mouse'][0, 1, 2].

  • Transform Labels:
    Converts the original labels into their corresponding integer representation.

  • Inverse Transform:
    Converts encoded integers back into their original labels.

OneHotEncoder

OneHotEncoder is a preprocessing technique in scikit-learn that converts categorical data into a binary matrix (one-hot encoding), where each category is represented by a unique binary vector.

from sklearn.preprocessing import OneHotEncoder

# Example data
X = [['cat'], ['dog'], ['cat']]

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X).toarray()
print(X_encoded)

The output of this:

[[1. 0.]
 [0. 1.]
 [1. 0.]]

How it works:

  • Fit to Categories:
    Identifies the unique categories in each feature.

  • Transform Features:
    Converts each category into a binary vector, with a 1 indicating the presence of the category and 0 elsewhere.

  • Sparse Representation:
    By default, the output is a sparse matrix to save memory for large datasets with many categories.

Imputer

Imputer fills in missing values in datasets.

from sklearn.impute import SimpleImputer

# Example data
X = [[1, 2], [None, 3], [7, 6]]

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print(X_imputed)

The output of this, you can see the None has been filled in:

[[1. 2.]
 [4. 3.]
 [7. 6.]]

How it works:

  • Identify Missing Values:
    Detects missing values in the dataset (default: np.nan).

  • Compute Replacement Values:
    Based on the chosen strategy (e.g., mean or median), calculates replacement values for each feature.

  • Fill Missing Values:
    Replaces missing values in the dataset with the computed replacements.

7. Pipelines and Utilities

Pipelines streamline workflows by chaining preprocessing steps with modeling.

Pipeline

A Pipeline in scikit-learn is a sequential workflow that chains multiple preprocessing steps and a final estimator into a single object, simplifying and automating machine learning workflows.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear'))
])
pipeline.fit(X, y)
print(pipeline.predict([[2.5]]))

ColumnTransformer

ColumnTransformer in scikit-learn is a tool that applies different preprocessing steps to specific columns of a dataset, enabling flexible and efficient handling of mixed data types.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Example data
X = [[1, 'cat'], [2, 'dog'], [3, 'cat']]

transformer = ColumnTransformer([
    ('num', StandardScaler(), [0]),
    ('cat', OneHotEncoder(), [1])
])
X_transformed = transformer.fit_transform(X)
print(X_transformed)

FunctionTransformer

FunctionTransformer in scikit-learn allows you to apply custom or predefined functions to transform data as part of a machine learning pipeline.

from sklearn.preprocessing import FunctionTransformer
import numpy as np

# Example data
X = [[1], [2], [3]]

log_transformer = FunctionTransformer(np.log1p)
X_transformed = log_transformer.fit_transform(X)
print(X_transformed)

8. Neural Network Integration

Though scikit-learn is not primarily designed for deep learning, it includes simple neural network models.

MLPClassifier/MLPRegressor

MLPClassifier (for classification) and MLPRegressor (for regression) are multi-layer perceptron models in scikit-learn that implement neural networks with backpropagation. They are part of the feedforward neural network family.

from sklearn.neural_network import MLPClassifier

# Example data
X = [[1], [2], [3]]
y = [0, 1, 0]

model = MLPClassifier(hidden_layer_sizes=(10,), max_iter=500)
model.fit(X, y)
print(model.predict([[2.5]]))

How it works:

  • Layers:
    Composed of an input layer, one or more hidden layers, and an output layer. Hidden layers process weighted inputs using activation functions.

  • Activation Functions:
    • Hidden Layers: Use non-linear activation functions like ReLU ('relu') or tanh ('tanh').
    • Output Layer:
      • MLPClassifier: Uses softmax for multi-class classification or logistic sigmoid for binary classification.
      • MLPRegressor: Uses linear activation.
  • Optimization:
    Parameters are learned through backpropagation using stochastic gradient descent (SGD) or adaptive optimizers like Adam.

Key Parameters:

  • hidden_layer_sizes:
    Tuple specifying the number of neurons in each hidden layer.
    Example: hidden_layer_sizes=(100, 50) creates two hidden layers with 100 and 50 neurons respectively.

  • activation:
    Activation function for hidden layers:
    • 'relu' (default): Rectified Linear Unit.
    • 'tanh': Hyperbolic tangent.
    • 'logistic': Sigmoid function.
  • solver:
    Optimization algorithm:
    • 'adam' (default): Adaptive Moment Estimation (fast and robust).
    • 'sgd': Stochastic Gradient Descent.
    • 'lbfgs': Quasi-Newton optimization (good for small datasets).
  • alpha:
    Regularization term to prevent overfitting (default: 0.0001).

  • learning_rate:
    Determines how weights are updated:
    • 'constant': Fixed learning rate.
    • 'adaptive': Adjusts learning rate based on performance.
  • max_iter:
    Maximum number of iterations for training (default: 200).

Conclusion

scikit-learn is a versatile library that offers robust tools for a wide range of machine learning tasks, from regression and classification to clustering, dimensionality reduction, and preprocessing. Its simplicity and efficiency make it a great choice for both beginners and advanced practitioners. Explore its extensive documentation to dive deeper into its capabilities!

Passwordless sudo using a YubiKey

Introduction

YubiKeys are excellent multi-factor authentication (MFA) devices that can enhance your online security while simplifying your daily workflows on Linux.

In this article, we’ll walk through the process of configuring a YubiKey for secure authentication including:

  • Setting up passwordless sudo or enabling (2FA) for elevated privileges
  • Setting up 2FA on your Desktop Environment’s login
  • Setting up 2FA on your system’s TTY login
  • Setting up passwordless graphical prompts for elevated privileges

Setup

Prerequisites

First, ensure you have the libpam-u2f package (or its equivalent for your Linux distribution) installed. On Debian-based systems, use the following command:

sudo apt-get install libpam-u2f

U2F (Universal 2nd Factor) is an open standard for hardware MFA keys, and integration with Linux is made possible through Yubico’s pam-u2f module.

Adding Your YubiKey

To link your YubiKey with your system, follow these steps:

  • Connect your YubiKey: Insert the device into your computer.

  • Create the configuration directory: If it doesn’t already exist, create the directory ~/.config/Yubico:

mkdir -p ~/.config/Yubico
  • Register your YubiKey: Add the key to the list of accepted devices by running:
pamu2fcfg > ~/.config/Yubico/u2f_keys

If you’ve set a PIN for your YubiKey, you may be prompted to enter it.

  • Add additional keys (optional): If you have other YubiKeys, you can add them as follows:
pamu2fcfg -n >> ~/.config/Yubico/u2f_keys

Ensure there are no extra newlines between entries in the ~/.config/Yubico/u2f_keys file.

Before you start!

Before you start re-configuring things, it’s worth opening another terminal that is running as root. This way if you do make any mistakes, you can still use that root terminal to back-out any changes that haven’t gone to plan.

Open a new terminal, and issue the following:

sudo -i

Now leave that terminal running in the background.

Configuring sudo

After setting up your key(s), you can configure sudo to use them for authentication.

Enabling Passwordless sudo

To make sudo passwordless:

  • Edit your /etc/sudoers file: Add a line like this:
%wheel      ALL = (ALL) NOPASSWD: ALL

Ensure your user is part of the wheel group.

  • Modify /etc/pam.d/sudo: Add the following line before @include common-auth:
auth        sufficient      pam_u2f.so cue [cue_prompt=Tap your key]

This configuration makes YubiKey authentication sufficient for sudo, bypassing the need for a password.

Enabling 2FA for sudo

To enable 2FA, where both your password and YubiKey are required:

  • Edit /etc/pam.d/sudo: Add the following line after @include common-auth:
auth        required        pam_u2f.so cue [cue_prompt=Tap your key]

This ensures the usual password authentication is followed by YubiKey verification.

Configuring 2FA for your Display Manager

I’m running KDE on this particular machine.

  • Edit /etc/pam.d/kde: Add the pam_u2f.so reference:
#%PAM-1.0

auth       include                     system-local-login
auth       required                    pam_u2f.so cue [cue_prompt=Tap your key]

account    include                     system-local-login

password   include                     system-local-login

session    include                     system-local-login

You should be able to do the same with GDM, etc.

Configuring 2FA for TTY

When you change virtual TTY and go to login, we can also require a 2FA token at this point.

  • Edit /etc/pam.d/login: Add the pam_u2f.so reference:
#%PAM-1.0

auth       requisite    pam_nologin.so
auth       include      system-local-login
auth       required     pam_u2f.so cue [cue_prompt=Tap your key]
account    include      system-local-login
session    include      system-local-login
password   include      system-local-login

Configuring Passwordless polkit

The graphical prompts that you see throughout your desktop environment session are controlled using polkit.

Like me, you may need to install the polkit dependencies if you’re using KDE:

sudo apt install policykit-1 polkit-kde-agent-1

Much like the passwordless configuration for sudo above, we can control polkit in the same way.

  • Edit /etc/pam.d/polkit-1: Add the pam_u2f.so reference:
#%PAM-1.0

auth            sufficient      pam_u2f.so cue [cue_prompt=Tap your key]

auth            required        pam_env.so
auth            required        pam_deny.so

auth            include         system-auth
account         include         system-auth
password        include         system-auth
session         include         system-auth

Troubleshooting

Always keep in mind that you have that terminal sat in the background. That terminal can get you out of all sorts of trouble so that you can rewind any changes that you’ve made that might have broken authentication on your system.

Enable Debugging

If something isn’t working, add debug to the auth line in /etc/pam.d/sudo to enable detailed logging during authentication:

auth        sufficient      pam_u2f.so debug

The additional logs can help identify configuration issues.

Conclusion

Adding a YubiKey to your Linux authentication setup enhances security and can simplify your workflow by reducing the need to frequently enter passwords. Whether you choose passwordless authentication or 2FA, YubiKeys are a valuable tool for improving your overall security posture.