rakesh kumar

Posted on Nov 9

How to Make Machine Learning Models Faster and Lighter

Pruning

= delete (or zero-out) parts of a model that contribute little → fewer FLOPs, fewer parameters, faster inference.

Quantization

= store & compute with fewer bits (e.g., 8-bit instead of 32-bit floats) → smaller memory, higher cache hits, faster CPU ops.

Both aim to fit tight edge budgets (CPU-only, small RAM) while keeping accuracy good enough for real-time control.

Where they act in a network

Two flavors of pruning

Quantization pipeline (typical PTQ)

PTQ (Post-Training Quantization): No training required. Fastest path to int8.

QAT (Quantization-Aware Training): Train with fake-quant modules → better accuracy at int8, especially for CNNs.

Why robots care

Latency (ms-level) & consistency (low jitter) matter for control loops.

Size matters (RAM/flash budgets).

Robustness to noise: simpler models + calibrated quantization + structured pruning → fewer surprises.

Metrics to watch

Accuracy (task metric)

Latency (avg & p95/p99)

Model size (MB)

Sparsity (% zeros) & MACs/FLOPs

Energy (optional but relevant on battery)

ML EXAMPLES (scikit-learn)

We’ll show:

Decision Tree pruning (cost-complexity)

L1 “pruning” of linear/logistic models (drives coefficients to zero)

Note: Classic scikit-learn doesn’t do int8 quantization of models end-to-end; for edge you often export to ONNX + use runtimes that quantize, or you choose small models + pruning/L1.

1) Decision Tree – cost-complexity pruning

# ml_pruning_tree.py
import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree  # optional (for visualization)
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline (unpruned)
base = DecisionTreeClassifier(random_state=42)
base.fit(Xtr, ytr)

# Cost complexity pruning path gives candidate ccp_alphas
path = base.cost_complexity_pruning_path(Xtr, ytr)
ccp_alphas = path.ccp_alphas

best = None
best_stats = None

for ccp in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp)
    clf.fit(Xtr, ytr)
    # measure latency per sample (simple timing)
    t0 = time.time()
    ypred = clf.predict(Xte)
    t1 = time.time()
    acc = accuracy_score(yte, ypred)
    latency_ms = (t1 - t0) / len(Xte) * 1000.0
    n_nodes = clf.tree_.node_count
    stats = (acc, latency_ms, n_nodes, ccp)
    if best is None or (acc > best_stats[0]) or (acc == best_stats[0] and latency_ms < best_stats[1]):
        best, best_stats = clf, stats

print(f"Best pruned tree:")
print(f"  accuracy = {best_stats[0]:.3f}")
print(f"  latency  = {best_stats[1]:.3f} ms/sample")
print(f"  nodes    = {best_stats[2]} (ccp_alpha={best_stats[3]:.6f})")


What this gives: a smaller tree → fewer branches → faster, more stable inference on CPU.

2) L1 “pruning” (sparse coefficients)

# ml_pruning_l1.py
import time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

# L1 penalty drives many weights to zero (like pruning)
clf = LogisticRegression(penalty='l1', solver='saga', C=0.5, max_iter=500, multi_class='auto')
clf.fit(Xtr, ytr)

t0 = time.time()
yp = clf.predict(Xte)
t1 = time.time()

acc = accuracy_score(yte, yp)
latency_ms = (t1 - t0) / len(Xte) * 1000.0
sparsity = np.mean(clf.coef_ == 0.0)

print(f"Accuracy      : {acc:.3f}")
print(f"Latency       : {latency_ms:.3f} ms/sample")
print(f"Weight zeros  : {sparsity*100:.1f}%")

Takeaway: Smaller effective feature set → cache-friendly, consistent latency.

DL EXAMPLES (PyTorch)

We’ll show:

Unstructured & structured pruning via torch.nn.utils.prune

Dynamic PTQ (int8) for Linear/LSTM

Static PTQ (FX graph mode) for a tiny CNN (with calibration)

QAT sketch for best accuracy at int8

These run on CPU and illustrate what to change. Replace synthetic data with your sensor features.

Helper: latency + size + sparsity utilities

# utils_perf.py
import time
import torch
import os

def measure_latency_ms(model, inp, n_warm=10, n_runs=50):
    model.eval()
    with torch.no_grad():
        for _ in range(n_warm):
            _ = model(inp)
        t0 = time.time()
        for _ in range(n_runs):
            _ = model(inp)
        t1 = time.time()
    return (t1 - t0) / n_runs * 1000.0

def count_nonzero_params(model):
    nz = 0
    tot = 0
    for p in model.parameters():
        tot += p.numel()
        nz += (p != 0).sum().item()
    return nz, tot, 1.0 - nz / tot

def save_size_mb(model, path="temp.pth"):
    torch.save(model.state_dict(), path)
    sz = os.path.getsize(path) / (1024*1024)
    os.remove(path)
    return sz

A. PRUNING (PyTorch)
A1) Unstructured pruning (magnitude)

# dl_prune_unstructured.py
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
from utils_perf import measure_latency_ms, count_nonzero_params, save_size_mb

# Tiny MLP for demonstration
class MLP(nn.Module):
    def __init__(self, d_in=64, d_hidden=128, d_out=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden),
            nn.ReLU(),
            nn.Linear(d_hidden, d_out)
        )
    def forward(self, x):
        return self.net(x)

model = MLP()
inp = torch.randn(1, 64)

lat0 = measure_latency_ms(model, inp)
nz0, tot0, sparsity0 = count_nonzero_params(model)
size0 = save_size_mb(model)

# Apply 80% unstructured pruning on Linear weights
for m in model.modules():
    if isinstance(m, nn.Linear):
        prune.l1_unstructured(m, name="weight", amount=0.8)

# Remove reparametrization (make pruning permanent)
for m in model.modules():
    if isinstance(m, nn.Linear) and hasattr(m, "weight_orig"):
        prune.remove(m, "weight")

lat1 = measure_latency_ms(model, inp)
nz1, tot1, sparsity1 = count_nonzero_params(model)
size1 = save_size_mb(model)

print(f"Latency (ms)   before/after: {lat0:.3f} / {lat1:.3f}")
print(f"Sparsity       before/after: {sparsity0*100:.1f}% / {sparsity1*100:.1f}%")
print(f"Model size (MB)before/after: {size0:.3f} / {size1:.3f}")

Note: Unstructured zeros may not speed up on vanilla kernels; you get memory benefits and speedups only if using sparse backends. For deterministic speed on CPU, prefer structured pruning.

A2) Structured channel pruning (conv channels)

# dl_prune_structured.py
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
from utils_perf import measure_latency_ms, count_nonzero_params, save_size_mb

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(32, 10)
    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.pool(x).flatten(1)
        return self.fc(x)

model = TinyCNN()
inp = torch.randn(1, 3, 64, 64)

lat0 = measure_latency_ms(model, inp)
nz0, tot0, s0 = count_nonzero_params(model)
size0 = save_size_mb(model)

# Prune entire output channels from conv2 (e.g., 50%)
# amount is fraction of channels to remove; dim=0 means output channels
prune.ln_structured(model.conv2, name="weight", amount=0.5, n=2, dim=0)
prune.remove(model.conv2, "weight")

lat1 = measure_latency_ms(model, inp)
nz1, tot1, s1 = count_nonzero_params(model)
size1 = save_size_mb(model)

print(f"Latency (ms)   before/after: {lat0:.3f} / {lat1:.3f}")
print(f"Sparsity       before/after: {s0*100:.1f}% / {s1*100:.1f}%")
print(f"Model size (MB)before/after: {size0:.3f} / {size1:.3f}")

Why structured helps: removed channels shrink subsequent ops → actual FLOP reduction and stable CPU speedups.

B. QUANTIZATION (PyTorch)
B1) Dynamic quantization (fastest path; great for Linear/LSTM on CPU)

# dl_quant_dynamic.py
import torch
import torch.nn as nn
from utils_perf import measure_latency_ms, save_size_mb

class TinyRNN(nn.Module):
    def __init__(self, d_in=32, d_hidden=64, d_out=6):
        super().__init__()
        self.rnn = nn.LSTM(d_in, d_hidden, num_layers=1, batch_first=True)
        self.fc = nn.Linear(d_hidden, d_out)
    def forward(self, x):
        # x: [B, T, d_in]
        y, _ = self.rnn(x)
        return self.fc(y[:, -1, :])

model_fp32 = TinyRNN().eval()
inp = torch.randn(1, 20, 32)

lat0 = measure_latency_ms(model_fp32, inp)
size0 = save_size_mb(model_fp32)

# Dynamic quantize only supported layer types (Linear, LSTM)
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32, {nn.Linear, nn.LSTM}, dtype=torch.qint8
).eval()

lat1 = measure_latency_ms(model_int8, inp)
size1 = save_size_mb(model_int8)

print(f"Latency (ms)   FP32 / INT8(dynamic): {lat0:.3f} / {lat1:.3f}")
print(f"Model size (MB)FP32 / INT8(dynamic): {size0:.3f} / {size1:.3f}")

Use when: CPU edge device, MLP/RNN control heads, quick wins without retraining.

B2) Static PTQ (FX graph mode) for a tiny CNN

# dl_quant_static_ptq.py
import torch
import torch.nn as nn
from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from utils_perf import measure_latency_ms, save_size_mb

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(3, 16, 3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 16, 3, stride=1, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Linear(16, 10)
        )
    def forward(self, x):
        return self.seq(x)

model = TinyCNN().eval()
example = torch.randn(1, 3, 64, 64)

lat0 = measure_latency_ms(model, example)
size0 = save_size_mb(model)

# 1) Choose backend/qconfig
backend = "qnnpack"  # good for ARM/Android; "fbgemm" for x86
torch.backends.quantized.engine = backend
qconfig_mapping = get_default_qconfig_mapping(backend)

# 2) Prepare FX graph
prepared = prepare_fx(model, {"": qconfig_mapping})

# 3) Calibrate with a small representative set
prepared(torch.randn(1,3,64,64))
prepared(torch.randn(1,3,64,64))
prepared(torch.randn(1,3,64,64))

# 4) Convert to int8
int8_model = convert_fx(prepared).eval()

lat1 = measure_latency_ms(int8_model, example)
size1 = save_size_mb(int8_model)

print(f"Latency (ms)   FP32 / INT8(static): {lat0:.3f} / {lat1:.3f}")
print(f"Model size (MB)FP32 / INT8(static): {size0:.3f} / {size1:.3f}")

Calibrate carefully: use a few hundred real sensor samples for best accuracy.

B3) QAT sketch (best accuracy @ int8 for CNNs)

# dl_qat_skeleton.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.ao.quantization import get_default_qat_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_qat_fx, convert_fx

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(),
            nn.Conv2d(16,16,3,padding=1),   nn.ReLU(),
            nn.AdaptiveAvgPool2d(1), nn.Flatten(), nn.Linear(16, 10)
        )
    def forward(self, x): return self.seq(x)

model = TinyCNN().train()
backend = "qnnpack"
torch.backends.quantized.engine = backend
qconfig_mapping = get_default_qat_qconfig_mapping(backend)

# Insert fake-quant observers
model_qat = prepare_qat_fx(model, {"": qconfig_mapping}).train()

opt = optim.Adam(model_qat.parameters(), lr=1e-3)

# Train as usual (fake-quant active)
for step in range(200):  # demo loop
    x = torch.randn(16,3,64,64)
    y = torch.randint(0,10,(16,))
    logits = model_qat(x)
    loss = nn.CrossEntropyLoss()(logits, y)
    opt.zero_grad(); loss.backward(); opt.step()

# Convert to real int8
model_int8 = convert_fx(model_qat.eval()).eval()

Putting it together: which stages benefit most?

Perception & control networks (Conv/MLP/RNN heads) → Quantization (int8) + Structured pruning (channels) for real CPU gains.

Feature encoders (heavy backbones) → QAT (keep accuracy) + careful structured pruning of later blocks.

Classical ML pieces (trees, linear heads) → prune (cost-complexity / L1) and/or replace with smaller models; quantization usually handled by deployment runtime if needed.

Practical workflow (ASCII diagram)

Train FP32  --->  Profile (lat/size/acc)  --->  Choose targets
                                         \
                                          \--> Structured prune (channels/heads) -> Fine-tune
                                              -> PTQ (dynamic/static) or QAT
                                              -> Re-profile (p50/p95 latency, acc, size)
                                              -> Iterate until SLA met

Which Stages Benefit Most from Quantization or Pruning?

Best Stage to Apply Quantization/Pruning

The Model / Inference Stage
This is where matrix multiplications and decision logic happen.

❌ Not Useful to Quantize/Prune

So:

The parts that benefit the most are:

Neural network layers (Linear, Conv, GRU/LSTM)

Decision trees / Random Forests (by pruning depth or removing weak nodes)

Large linear models (by pruning small-magnitude weights)

Part 1: Classical ML Example

We will:

Train Logistic Regression

Prune small weights

Quantize model weights to float16 to reduce memory + improve speed

# ===== classical_prune_quant.py =====
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Build pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=500, multi_class='multinomial'))
])

pipe.fit(X, y)
clf = pipe.named_steps["clf"]

print("Original weights shape:", clf.coef_.shape)

# ----- PRUNING: remove small weights -----
threshold = np.percentile(np.abs(clf.coef_), 20)   # prune 20% smallest weights
mask = np.abs(clf.coef_) > threshold
clf.coef_ = clf.coef_ * mask

print("Pruned weights mean magnitude:", np.mean(np.abs(clf.coef_)))

# ----- QUANTIZATION: convert to float16 -----
clf.coef_  = clf.coef_.astype(np.float16)
clf.intercept_ = clf.intercept_.astype(np.float16)

print("After quantization dtype:", clf.coef_.dtype)

# Check accuracy
y_pred = pipe.predict(X)
print("Accuracy after prune + quant:", accuracy_score(y, y_pred))

Result:

Model is lighter, smaller, and usually faster on edge CPUs.

Accuracy remains similar (because small weights typically don’t matter).

✅

Part 2: Deep Learning Example (PyTorch)

We will:

Train a tiny MLP

Prune weights with torch.nn.utils.prune

Quantize using PyTorch dynamic quantization

Compare latency

# ===== dl_prune_quant.py =====
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
import time
import numpy as np

# Create dummy dataset
X = torch.randn(300, 16)
y = torch.randint(0, 3, (300,))

# Tiny MLP model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(16, 32)
        self.fc2 = nn.Linear(32, 3)
    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))

model = MLP()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

# Train briefly
for _ in range(200):
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()
    optimizer.step()

# -------- PRUNING --------
prune.l1_unstructured(model.fc1, name="weight", amount=0.3)  # prune 30% of weights
prune.remove(model.fc1, 'weight')  # finalize mask

# -------- QUANTIZATION --------
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

# -------- LATENCY TEST --------
def bench(model):
    model.eval()
    times=[]
    for _ in range(500):
        inp = torch.randn(1,16)
        t0=time.time()
        _=model(inp)
        times.append((time.time()-t0)*1000)
    return np.mean(times), np.percentile(times,95)

fp32_mean, fp32_p95 = bench(model)
int8_mean, int8_p95 = bench(quantized_model)

print("FP32 latency avg:", fp32_mean, "ms  p95:", fp32_p95)
print("INT8 latency avg:", int8_mean, "ms  p95:", int8_p95)