AI & ML Comprehensive Knowledge Hub

Complete reference: ML algorithms, evaluation metrics, clustering, AI agents, and real-world DoD applications

🎉 New: Interactive ML AI Knowledge Hub App

This comprehensive guide is now available as a standalone web app at mlaithing.vercel.app — featuring all 8+ algorithms with production code, clustering techniques, AI agents framework, and real-world DoD use cases in an interactive format.

Visit ML AI Hub App

Machine Learning Overview

Supervised Learning

Learn from labeled data

  • • Classification: Categories
  • • Regression: Continuous values

Unsupervised Learning

Find patterns in data

  • • Clustering: Grouping
  • • Dimensionality reduction

Reinforcement Learning

Learn through rewards

  • • Agent takes actions
  • • Receives feedback

Evaluation Metrics Guide

Classification

Accuracy: (TP+TN)/Total

Use only with balanced classes

Precision: TP/(TP+FP)

Of predicted positives, how many correct?

Recall: TP/(TP+FN)

Of actual positives, how many caught?

F1 Score: Harmonic mean of P&R

Balance precision and recall

Regression

MAE: Mean Absolute Error

Average absolute difference

RMSE: Root Mean Squared Error

Penalizes large errors more

R²: Coefficient of determination

Variance explained (0-1)

Key Insight: With imbalanced data (e.g., 99% negative class), accuracy is misleading. Always use Precision, Recall, F1, and AUC-ROC for imbalanced datasets.

Classification Algorithms

Supervised learning for categorical predictions

🎯

Logistic Regression

Binary classifier using sigmoid function to map outputs to [0,1] probability range. Linear model suitable for linearly separable problems.

DoD/Federal Use Cases

  • • FIAR audit pass/fail prediction
  • • Medical diagnosis classification
  • • Email spam/ham filtering
  • • Credit approval decisions
  • • Fraud detection in transactions

Advantages

  • ✓ Provides calibrated probability scores
  • ✓ Interpretable coefficients (feature importance)
  • ✓ Fast training and prediction
  • ✓ Low computational requirements
  • ✓ Works well with high-dimensional data

Disadvantages

  • ✗ Assumes linear decision boundary
  • ✗ Requires feature scaling for optimal performance
  • ✗ Cannot capture complex feature interactions
  • ✗ Sensitive to multicollinearity
python - FIAR Audit Risk Prediction
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Load audit data
X = df[['control_deficiencies', 'prior_findings', 'complexity_score']]
y = df['audit_failure']  # 1=failed, 0=passed

# CRITICAL: Feature scaling for logistic regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train with L2 regularization
model = LogisticRegression(
    C=0.1,              # Regularization strength (inverse)
    penalty='l2',       # L2 regularization
    max_iter=1000,
    solver='lbfgs',
    random_state=42
)

# Cross-validation for robust evaluation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='f1')
print(f"CV F1 Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Train final model
model.fit(X_scaled, y)

# Predictions with probability
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = model.predict(X_test_scaled)

# Comprehensive evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")

Support Vector Machine (SVM)

Finds optimal hyperplane maximizing margin between classes. Kernel trick enables non-linear classification. Effective in high-dimensional spaces.

Use Cases

  • • Text classification (high-dimensional)
  • • Image recognition and computer vision
  • • Bioinformatics (gene classification)
  • • Handwriting recognition
  • • Face detection

Advantages

  • ✓ Effective in high-dimensional spaces
  • ✓ Memory efficient (uses support vectors only)
  • ✓ Versatile (multiple kernel functions)
  • ✓ Strong with clear margin of separation
  • ✓ Robust to overfitting in high dimensions

Disadvantages

  • ✗ Slow on large datasets (O(n²) to O(n³))
  • ✗ Sensitive to feature scaling
  • ✗ No direct probability estimates
  • ✗ Difficult to interpret (black box)
  • ✗ Choice of kernel requires domain knowledge
python - SVM with RBF Kernel
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Feature scaling is CRITICAL for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# RBF kernel for non-linear classification
svm = SVC(
    kernel='rbf',           # Radial Basis Function
    C=1.0,                  # Regularization parameter
    gamma='scale',          # Kernel coefficient (auto-tuned)
    probability=True,       # Enable probability estimates
    random_state=42
)

# Train
svm.fit(X_train_scaled, y_train)

# Predict
y_pred = svm.predict(X_test_scaled)
y_proba = svm.predict_proba(X_test_scaled)

# Model info
print(f"Support vectors: {svm.n_support_}")
print(f"Classes: {svm.classes_}")

# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(confusion_matrix(y_test, y_pred))
🌲

Random Forest

Ensemble method combining multiple decision trees through bootstrap aggregating (bagging). Each tree votes; majority wins. Reduces overfitting through randomization.

Use Cases

  • • Federal contract award classification
  • • Credit risk assessment
  • • Customer churn prediction
  • • Disease diagnosis (medical)
  • • Stock market prediction

Advantages

  • ✓ Handles non-linear relationships naturally
  • ✓ Robust to outliers and noise
  • ✓ Provides feature importance rankings
  • ✓ No feature scaling required
  • ✓ Works with mixed data types
  • ✓ Parallel training possible

Disadvantages

  • ✗ Slower than single decision tree
  • ✗ Large memory footprint (stores all trees)
  • ✗ Less interpretable than single tree
  • ✗ Can overfit on noisy data
python - Federal Contract Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

# Prepare features (categorical encoding)
X = pd.get_dummies(df[['agency', 'naics_code', 'amount', 'location']])
y = df['award_type']  # 0=competitive, 1=sole-source

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,       # Number of trees
    max_depth=10,           # Prevent overfitting
    min_samples_split=20,   # Minimum samples to split
    min_samples_leaf=10,    # Minimum samples per leaf
    max_features='sqrt',    # Features per split
    random_state=42,
    n_jobs=-1               # Use all CPU cores
)

# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Train final model
rf.fit(X_train, y_train)

# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance.head(10))

# Predictions
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)

# Evaluate
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
⚙️

SGD Classifier

Stochastic Gradient Descent for large-scale learning. Updates model incrementally using one sample at a time. Efficient for massive datasets and online learning.

Use Cases

  • • Large-scale text classification
  • • Online learning (streaming data)
  • • Real-time prediction systems
  • • Datasets too large for memory

Loss Functions:

  • • "hinge" → SVM (max margin)
  • • "log_loss" → Logistic Regression
  • • "perceptron" → Perceptron
  • • "squared_error" → Linear Regression

Advantages

  • ✓ Extremely efficient on large datasets
  • ✓ Supports online/incremental learning
  • ✓ Memory efficient (processes one sample)
  • ✓ Multiple loss functions available
  • ✓ Easy to implement and understand

Disadvantages

  • ✗ Requires feature scaling
  • ✗ Sensitive to learning rate
  • ✗ May not converge without tuning
  • ✗ Requires many hyperparameters
python - Binary Classification with SGD
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

# Binary target (e.g., digit 5 detection)
binary_target = (digits.target == 5).astype(int)

# MUST scale features for SGD
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SGD with SVM loss
sgd = SGDClassifier(
    loss='hinge',           # SVM with hinge loss
    penalty='l2',           # L2 regularization
    alpha=0.0001,           # Regularization strength
    max_iter=1000,          # Maximum iterations
    tol=1e-3,               # Stopping criterion
    learning_rate='optimal',# Adaptive learning rate
    random_state=42
)

# Train
sgd.fit(X_train_scaled, y_train)

# Cross-validation for robust evaluation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(sgd, X_train_scaled, y_train, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Predict
y_pred = sgd.predict(X_test_scaled)

# For online learning (incremental):
# sgd.partial_fit(X_new_batch, y_new_batch, classes=[0, 1])
📍

K-Nearest Neighbors (KNN)

Instance-based learning: classifies based on K closest training examples. Simple, intuitive, no training phase. Distance-based algorithm.

Use Cases

  • • Recommendation systems
  • • Pattern recognition
  • • Small-scale classification tasks
  • • Anomaly detection

Advantages

  • ✓ Simple to understand and implement
  • ✓ No training phase required
  • ✓ Adapts to new data easily
  • ✓ Non-parametric (no assumptions)

Disadvantages

  • ✗ Slow prediction (O(n) per query)
  • ✗ Sensitive to feature scaling
  • ✗ Curse of dimensionality
  • ✗ Requires storing all training data
📊

Naive Bayes

Probabilistic classifier based on Bayes' theorem with independence assumption between features. Fast, simple, works well with high-dimensional data.

Use Cases

  • • Text classification (spam filtering)
  • • Sentiment analysis
  • • Document categorization
  • • Real-time prediction

Advantages

  • ✓ Very fast training and prediction
  • ✓ Works well with small datasets
  • ✓ Handles high-dimensional data well
  • ✓ Simple and interpretable

Disadvantages

  • ✗ Independence assumption often violated
  • ✗ "Zero frequency" problem
  • ✗ Poor probability estimates

Clustering Algorithms

Unsupervised learning for pattern discovery

🔵

K-Means Clustering

Partitions data into K clusters by minimizing within-cluster variance. Iteratively assigns points to nearest centroid and updates centroids.

Use Cases

  • • Customer segmentation by behavior
  • • Federal agency spending pattern grouping
  • • Document clustering
  • • Image compression
  • • Anomaly detection (outliers from clusters)

Advantages

  • ✓ Fast and scalable (O(n))
  • ✓ Simple to implement
  • ✓ Works well with spherical clusters
  • ✓ Guaranteed convergence

Disadvantages

  • ✗ Must specify K in advance
  • ✗ Sensitive to initial centroid placement
  • ✗ Assumes spherical, similar-size clusters
  • ✗ Sensitive to outliers
python - Agency Spending Clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Prepare features
X = df[['total_budget', 'execution_rate', 'variance_score']]

# Scale features (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow method to find optimal K
inertias = []
K_range = range(2, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')

# Train with optimal K (e.g., K=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels to dataframe
df['cluster'] = clusters

# Analyze clusters
cluster_summary = df.groupby('cluster').agg({
    'total_budget': 'mean',
    'execution_rate': 'mean',
    'variance_score': 'mean'
}).round(2)

print(cluster_summary)
🎲

Gaussian Mixture Models (GMM)

Probabilistic model assuming data is generated from mixture of Gaussian distributions. Soft clustering with probability assignments. Uses EM algorithm.

Use Cases

  • • Anomaly detection
  • • Density estimation
  • • Soft clustering (probability-based)
  • • Background subtraction in images

Advantages

  • ✓ Soft clustering (probabilities)
  • ✓ Flexible cluster shapes (ellipsoidal)
  • ✓ Can model complex distributions
  • ✓ Provides density estimates

Disadvantages

  • ✗ Slower than K-Means
  • ✗ Sensitive to initialization
  • ✗ Can get stuck in local optima
  • ✗ Struggles with non-ellipsoidal clusters
python - GMM for Anomaly Detection
from sklearn.mixture import GaussianMixture
import numpy as np

# Train GMM
gm = GaussianMixture(n_components=3, n_init=10, random_state=42)
gm.fit(X)

# Model parameters
print(f"Converged: {gm.converged_}")
print(f"Iterations: {gm.n_iter_}")
print(f"Weights: {np.round(gm.weights_, 2)}")

# Soft clustering (probabilities)
probabilities = gm.predict_proba(X)
print(f"Sample belongs to cluster 0 with prob: {probabilities[0, 0]:.3f}")

# Hard clustering (assign to most probable cluster)
clusters = gm.predict(X)

# Anomaly detection using density
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)  # Bottom 4%
anomalies = X[densities < density_threshold]

print(f"Detected {len(anomalies)} anomalies")

# Use BIC/AIC to select optimal number of components
bics = []
for k in range(1, 10):
    gm = GaussianMixture(n_components=k, n_init=10)
    gm.fit(X)
    bics.append(gm.bic(X))

optimal_k = np.argmin(bics) + 1
print(f"Optimal number of components: {optimal_k}")

AI Agents Framework

Building production-ready agentic systems

Agent Architecture Overview

AI Agents are autonomous systems that can reason, plan, and use tools to accomplish tasks. They combine LLM intelligence with function calling to interact with external systems, databases, and APIs. Based on ReAct (Reasoning + Acting) pattern.

1️⃣

Define Tools

Create functions the agent can call: search databases, fetch data, send emails, make API calls. Each tool has name, description, and parameter schema.

2️⃣

System Prompt

Define agent's identity, knowledge domain, capabilities, and behavior. Specify when to use tools and how to format responses.

3️⃣

Agentic Loop

Agent decides autonomously: analyze request → call tool if needed → process result → reason about next step → repeat until task complete.

typescript - Production Agent Implementation
import { GoogleGenerativeAI } from '@google/genai';

// 1. Define Tools (Function Declarations)
const tools = [
  {
    name: "search_database",
    description: "Search DoD budget database for obligation data",
    parameters: {
      type: "object",
      properties: {
        account: { type: "string", description: "Appropriation account code" },
        fiscal_year: { type: "number", description: "Fiscal year (2020-2026)" }
      },
      required: ["account"]
    }
  },
  {
    name: "get_audit_status",
    description: "Get FIAR audit status for a DoD component",
    parameters: {
      type: "object",
      properties: {
        component: { type: "string", description: "DoD component name" }
      },
      required: ["component"]
    }
  }
];

// 2. System Prompt
const systemPrompt = `You are a DoD financial analyst assistant with expertise in:
- Budget execution and obligation tracking
- FIAR audit readiness
- OMB circulars (A-11, A-123)
- DoD Financial Management Regulation

Use your tools to retrieve accurate, current data. Always cite your sources.
Provide clear, professional analysis suitable for senior leadership.`;

// 3. Agentic Loop Implementation
async function runAgent(userMessage: string) {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
  
  const model = genAI.getGenerativeModel({
    model: 'gemini-2.5-flash',
    systemInstruction: systemPrompt,
    tools: tools
  });

  const chat = model.startChat({ history: [] });
  let response = await chat.sendMessage({ message: userMessage });

  // Iterative loop: agent decides when to use tools
  let iterations = 0;
  const MAX_ITERATIONS = 5;
  
  while (response.functionCalls && iterations < MAX_ITERATIONS) {
    const toolResults = [];
    
    // Execute each tool call
    for (const functionCall of response.functionCalls) {
      console.log(`Agent calling: ${functionCall.name}`);
      const result = await executeFunction(functionCall.name, functionCall.args);
      toolResults.push({
        functionResponse: {
          name: functionCall.name,
          response: result
        }
      });
    }
    
    // Send results back to agent
    response = await chat.sendMessage({ message: toolResults });
    iterations++;
  }

  return {
    text: response.text,
    iterations,
    toolCallsMade: iterations > 0
  };
}

// Tool Executor
async function executeFunction(name: string, args: any) {
  switch (name) {
    case 'search_database':
      return await searchBudgetDB(args.account, args.fiscal_year);
    case 'get_audit_status':
      return await getAuditStatus(args.component);
    default:
      return { error: "Unknown function" };
  }
}

// Example usage
const result = await runAgent("What is the obligation rate for account 97X4930 in FY2025?");
console.log(result.text);

My Production Agent: MyThing Platform

Live multi-agent system powering this site. 4 specialized agents with 5 tools, intelligent routing, and 500+ daily interactions.

Agents

💼 Portfolio Agent: Background, skills, certifications
🚀 Tech Trends Agent: Latest AI/ML from 22 sources
🏛️ DoD Policy Agent: Budget, audit, IT policy expert
📝 Notes Agent: Capture & analyze thoughts

Tools

search_tech_articles: Query 500+ articles
get_platform_stats: Real-time metrics
save_note: Store with AI analysis
get_recent_notes: Retrieve insights

Key Features: Automatic routing based on keywords, iterative function calling (up to 3 rounds), model fallback chain (Gemini 2.5 Flash → Flash Lite), conversation history management, structured outputs.

Ask My AI Agent

Use the chat widget to ask questions about ML algorithms, AI agents, evaluation metrics, or how I apply these techniques to DoD financial management.

💼 Portfolio🚀 Tech Trends🏛️ DoD Policy📝 Notes

Peter's AI Agents

Portfolio · Tech · DoD Policy · Notes

🤖

Agent Hub

Hi! I have 4 specialized agents — Portfolio 💼, Tech Trends 📡, DoD Policy 🏛️, and Notes 📝. I'll automatically route your question to the right one. What would you like to know?