AI & ML Comprehensive Knowledge Hub
Complete reference: ML algorithms, evaluation metrics, clustering, AI agents, and real-world DoD applications
🎉 New: Interactive ML AI Knowledge Hub App
This comprehensive guide is now available as a standalone web app at mlaithing.vercel.app — featuring all 8+ algorithms with production code, clustering techniques, AI agents framework, and real-world DoD use cases in an interactive format.
Visit ML AI Hub AppMachine Learning Overview
Supervised Learning
Learn from labeled data
- • Classification: Categories
- • Regression: Continuous values
Unsupervised Learning
Find patterns in data
- • Clustering: Grouping
- • Dimensionality reduction
Reinforcement Learning
Learn through rewards
- • Agent takes actions
- • Receives feedback
Evaluation Metrics Guide
Classification
Use only with balanced classes
Of predicted positives, how many correct?
Of actual positives, how many caught?
Balance precision and recall
Regression
Average absolute difference
Penalizes large errors more
Variance explained (0-1)
Key Insight: With imbalanced data (e.g., 99% negative class), accuracy is misleading. Always use Precision, Recall, F1, and AUC-ROC for imbalanced datasets.
Classification Algorithms
Supervised learning for categorical predictions
Logistic Regression
Binary classifier using sigmoid function to map outputs to [0,1] probability range. Linear model suitable for linearly separable problems.
DoD/Federal Use Cases
- • FIAR audit pass/fail prediction
- • Medical diagnosis classification
- • Email spam/ham filtering
- • Credit approval decisions
- • Fraud detection in transactions
Advantages
- ✓ Provides calibrated probability scores
- ✓ Interpretable coefficients (feature importance)
- ✓ Fast training and prediction
- ✓ Low computational requirements
- ✓ Works well with high-dimensional data
Disadvantages
- ✗ Assumes linear decision boundary
- ✗ Requires feature scaling for optimal performance
- ✗ Cannot capture complex feature interactions
- ✗ Sensitive to multicollinearity
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
# Load audit data
X = df[['control_deficiencies', 'prior_findings', 'complexity_score']]
y = df['audit_failure'] # 1=failed, 0=passed
# CRITICAL: Feature scaling for logistic regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train with L2 regularization
model = LogisticRegression(
C=0.1, # Regularization strength (inverse)
penalty='l2', # L2 regularization
max_iter=1000,
solver='lbfgs',
random_state=42
)
# Cross-validation for robust evaluation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='f1')
print(f"CV F1 Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
# Train final model
model.fit(X_scaled, y)
# Predictions with probability
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = model.predict(X_test_scaled)
# Comprehensive evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")Support Vector Machine (SVM)
Finds optimal hyperplane maximizing margin between classes. Kernel trick enables non-linear classification. Effective in high-dimensional spaces.
Use Cases
- • Text classification (high-dimensional)
- • Image recognition and computer vision
- • Bioinformatics (gene classification)
- • Handwriting recognition
- • Face detection
Advantages
- ✓ Effective in high-dimensional spaces
- ✓ Memory efficient (uses support vectors only)
- ✓ Versatile (multiple kernel functions)
- ✓ Strong with clear margin of separation
- ✓ Robust to overfitting in high dimensions
Disadvantages
- ✗ Slow on large datasets (O(n²) to O(n³))
- ✗ Sensitive to feature scaling
- ✗ No direct probability estimates
- ✗ Difficult to interpret (black box)
- ✗ Choice of kernel requires domain knowledge
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Feature scaling is CRITICAL for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# RBF kernel for non-linear classification
svm = SVC(
kernel='rbf', # Radial Basis Function
C=1.0, # Regularization parameter
gamma='scale', # Kernel coefficient (auto-tuned)
probability=True, # Enable probability estimates
random_state=42
)
# Train
svm.fit(X_train_scaled, y_train)
# Predict
y_pred = svm.predict(X_test_scaled)
y_proba = svm.predict_proba(X_test_scaled)
# Model info
print(f"Support vectors: {svm.n_support_}")
print(f"Classes: {svm.classes_}")
# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(confusion_matrix(y_test, y_pred))Random Forest
Ensemble method combining multiple decision trees through bootstrap aggregating (bagging). Each tree votes; majority wins. Reduces overfitting through randomization.
Use Cases
- • Federal contract award classification
- • Credit risk assessment
- • Customer churn prediction
- • Disease diagnosis (medical)
- • Stock market prediction
Advantages
- ✓ Handles non-linear relationships naturally
- ✓ Robust to outliers and noise
- ✓ Provides feature importance rankings
- ✓ No feature scaling required
- ✓ Works with mixed data types
- ✓ Parallel training possible
Disadvantages
- ✗ Slower than single decision tree
- ✗ Large memory footprint (stores all trees)
- ✗ Less interpretable than single tree
- ✗ Can overfit on noisy data
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
# Prepare features (categorical encoding)
X = pd.get_dummies(df[['agency', 'naics_code', 'amount', 'location']])
y = df['award_type'] # 0=competitive, 1=sole-source
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Prevent overfitting
min_samples_split=20, # Minimum samples to split
min_samples_leaf=10, # Minimum samples per leaf
max_features='sqrt', # Features per split
random_state=42,
n_jobs=-1 # Use all CPU cores
)
# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
# Train final model
rf.fit(X_train, y_train)
# Feature importance analysis
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 10 Most Important Features:")
print(feature_importance.head(10))
# Predictions
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)
# Evaluate
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))SGD Classifier
Stochastic Gradient Descent for large-scale learning. Updates model incrementally using one sample at a time. Efficient for massive datasets and online learning.
Use Cases
- • Large-scale text classification
- • Online learning (streaming data)
- • Real-time prediction systems
- • Datasets too large for memory
Loss Functions:
- • "hinge" → SVM (max margin)
- • "log_loss" → Logistic Regression
- • "perceptron" → Perceptron
- • "squared_error" → Linear Regression
Advantages
- ✓ Extremely efficient on large datasets
- ✓ Supports online/incremental learning
- ✓ Memory efficient (processes one sample)
- ✓ Multiple loss functions available
- ✓ Easy to implement and understand
Disadvantages
- ✗ Requires feature scaling
- ✗ Sensitive to learning rate
- ✗ May not converge without tuning
- ✗ Requires many hyperparameters
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
# Binary target (e.g., digit 5 detection)
binary_target = (digits.target == 5).astype(int)
# MUST scale features for SGD
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train SGD with SVM loss
sgd = SGDClassifier(
loss='hinge', # SVM with hinge loss
penalty='l2', # L2 regularization
alpha=0.0001, # Regularization strength
max_iter=1000, # Maximum iterations
tol=1e-3, # Stopping criterion
learning_rate='optimal',# Adaptive learning rate
random_state=42
)
# Train
sgd.fit(X_train_scaled, y_train)
# Cross-validation for robust evaluation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(sgd, X_train_scaled, y_train, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
# Predict
y_pred = sgd.predict(X_test_scaled)
# For online learning (incremental):
# sgd.partial_fit(X_new_batch, y_new_batch, classes=[0, 1])K-Nearest Neighbors (KNN)
Instance-based learning: classifies based on K closest training examples. Simple, intuitive, no training phase. Distance-based algorithm.
Use Cases
- • Recommendation systems
- • Pattern recognition
- • Small-scale classification tasks
- • Anomaly detection
Advantages
- ✓ Simple to understand and implement
- ✓ No training phase required
- ✓ Adapts to new data easily
- ✓ Non-parametric (no assumptions)
Disadvantages
- ✗ Slow prediction (O(n) per query)
- ✗ Sensitive to feature scaling
- ✗ Curse of dimensionality
- ✗ Requires storing all training data
Naive Bayes
Probabilistic classifier based on Bayes' theorem with independence assumption between features. Fast, simple, works well with high-dimensional data.
Use Cases
- • Text classification (spam filtering)
- • Sentiment analysis
- • Document categorization
- • Real-time prediction
Advantages
- ✓ Very fast training and prediction
- ✓ Works well with small datasets
- ✓ Handles high-dimensional data well
- ✓ Simple and interpretable
Disadvantages
- ✗ Independence assumption often violated
- ✗ "Zero frequency" problem
- ✗ Poor probability estimates
Clustering Algorithms
Unsupervised learning for pattern discovery
K-Means Clustering
Partitions data into K clusters by minimizing within-cluster variance. Iteratively assigns points to nearest centroid and updates centroids.
Use Cases
- • Customer segmentation by behavior
- • Federal agency spending pattern grouping
- • Document clustering
- • Image compression
- • Anomaly detection (outliers from clusters)
Advantages
- ✓ Fast and scalable (O(n))
- ✓ Simple to implement
- ✓ Works well with spherical clusters
- ✓ Guaranteed convergence
Disadvantages
- ✗ Must specify K in advance
- ✗ Sensitive to initial centroid placement
- ✗ Assumes spherical, similar-size clusters
- ✗ Sensitive to outliers
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Prepare features
X = df[['total_budget', 'execution_rate', 'variance_score']]
# Scale features (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Elbow method to find optimal K
inertias = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')
# Train with optimal K (e.g., K=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
# Add cluster labels to dataframe
df['cluster'] = clusters
# Analyze clusters
cluster_summary = df.groupby('cluster').agg({
'total_budget': 'mean',
'execution_rate': 'mean',
'variance_score': 'mean'
}).round(2)
print(cluster_summary)Gaussian Mixture Models (GMM)
Probabilistic model assuming data is generated from mixture of Gaussian distributions. Soft clustering with probability assignments. Uses EM algorithm.
Use Cases
- • Anomaly detection
- • Density estimation
- • Soft clustering (probability-based)
- • Background subtraction in images
Advantages
- ✓ Soft clustering (probabilities)
- ✓ Flexible cluster shapes (ellipsoidal)
- ✓ Can model complex distributions
- ✓ Provides density estimates
Disadvantages
- ✗ Slower than K-Means
- ✗ Sensitive to initialization
- ✗ Can get stuck in local optima
- ✗ Struggles with non-ellipsoidal clusters
from sklearn.mixture import GaussianMixture
import numpy as np
# Train GMM
gm = GaussianMixture(n_components=3, n_init=10, random_state=42)
gm.fit(X)
# Model parameters
print(f"Converged: {gm.converged_}")
print(f"Iterations: {gm.n_iter_}")
print(f"Weights: {np.round(gm.weights_, 2)}")
# Soft clustering (probabilities)
probabilities = gm.predict_proba(X)
print(f"Sample belongs to cluster 0 with prob: {probabilities[0, 0]:.3f}")
# Hard clustering (assign to most probable cluster)
clusters = gm.predict(X)
# Anomaly detection using density
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4) # Bottom 4%
anomalies = X[densities < density_threshold]
print(f"Detected {len(anomalies)} anomalies")
# Use BIC/AIC to select optimal number of components
bics = []
for k in range(1, 10):
gm = GaussianMixture(n_components=k, n_init=10)
gm.fit(X)
bics.append(gm.bic(X))
optimal_k = np.argmin(bics) + 1
print(f"Optimal number of components: {optimal_k}")AI Agents Framework
Building production-ready agentic systems
Agent Architecture Overview
AI Agents are autonomous systems that can reason, plan, and use tools to accomplish tasks. They combine LLM intelligence with function calling to interact with external systems, databases, and APIs. Based on ReAct (Reasoning + Acting) pattern.
Define Tools
Create functions the agent can call: search databases, fetch data, send emails, make API calls. Each tool has name, description, and parameter schema.
System Prompt
Define agent's identity, knowledge domain, capabilities, and behavior. Specify when to use tools and how to format responses.
Agentic Loop
Agent decides autonomously: analyze request → call tool if needed → process result → reason about next step → repeat until task complete.
import { GoogleGenerativeAI } from '@google/genai';
// 1. Define Tools (Function Declarations)
const tools = [
{
name: "search_database",
description: "Search DoD budget database for obligation data",
parameters: {
type: "object",
properties: {
account: { type: "string", description: "Appropriation account code" },
fiscal_year: { type: "number", description: "Fiscal year (2020-2026)" }
},
required: ["account"]
}
},
{
name: "get_audit_status",
description: "Get FIAR audit status for a DoD component",
parameters: {
type: "object",
properties: {
component: { type: "string", description: "DoD component name" }
},
required: ["component"]
}
}
];
// 2. System Prompt
const systemPrompt = `You are a DoD financial analyst assistant with expertise in:
- Budget execution and obligation tracking
- FIAR audit readiness
- OMB circulars (A-11, A-123)
- DoD Financial Management Regulation
Use your tools to retrieve accurate, current data. Always cite your sources.
Provide clear, professional analysis suitable for senior leadership.`;
// 3. Agentic Loop Implementation
async function runAgent(userMessage: string) {
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({
model: 'gemini-2.5-flash',
systemInstruction: systemPrompt,
tools: tools
});
const chat = model.startChat({ history: [] });
let response = await chat.sendMessage({ message: userMessage });
// Iterative loop: agent decides when to use tools
let iterations = 0;
const MAX_ITERATIONS = 5;
while (response.functionCalls && iterations < MAX_ITERATIONS) {
const toolResults = [];
// Execute each tool call
for (const functionCall of response.functionCalls) {
console.log(`Agent calling: ${functionCall.name}`);
const result = await executeFunction(functionCall.name, functionCall.args);
toolResults.push({
functionResponse: {
name: functionCall.name,
response: result
}
});
}
// Send results back to agent
response = await chat.sendMessage({ message: toolResults });
iterations++;
}
return {
text: response.text,
iterations,
toolCallsMade: iterations > 0
};
}
// Tool Executor
async function executeFunction(name: string, args: any) {
switch (name) {
case 'search_database':
return await searchBudgetDB(args.account, args.fiscal_year);
case 'get_audit_status':
return await getAuditStatus(args.component);
default:
return { error: "Unknown function" };
}
}
// Example usage
const result = await runAgent("What is the obligation rate for account 97X4930 in FY2025?");
console.log(result.text);My Production Agent: MyThing Platform
Live multi-agent system powering this site. 4 specialized agents with 5 tools, intelligent routing, and 500+ daily interactions.
Agents
Tools
Key Features: Automatic routing based on keywords, iterative function calling (up to 3 rounds), model fallback chain (Gemini 2.5 Flash → Flash Lite), conversation history management, structured outputs.
Kaggle AI Agents Resources
Introduction to Agents
Foundational concepts, ReAct pattern, agent loops
Tools & MCP
Function calling, Model Context Protocol, interoperability
Context & Memory
Managing context windows, session handling, state persistence
Agent Quality
Testing strategies, evaluation metrics, quality assurance
Production Deployment
Scaling, monitoring, best practices for production systems
Ask My AI Agent
Use the chat widget to ask questions about ML algorithms, AI agents, evaluation metrics, or how I apply these techniques to DoD financial management.
Peter's AI Agents
Portfolio · Tech · DoD Policy · Notes
Agent Hub