Run Full Benchmark Suite

Overview

Flow ID: run-full-benchmark
Category: Performance Benchmarking
Estimated Duration: 15-60 minutes (depends on models and hardware)
User Role: All Users (though technical users benefit most)
Complexity: Moderate

Purpose: This flow enables users to run comprehensive performance tests on their AI models to understand how well they perform on their specific hardware. Benchmarking measures speed, throughput, latency, and resource usage across various test scenarios, helping users make informed decisions about model selection and configuration.

Trigger

What initiates this flow:

User manually initiates

Specific trigger: User wants to test model performance, typically because:

They just uploaded a new model and want to know how it performs
They want to compare different models to choose the best one
They’re experiencing performance issues and want diagnostic data
They want to optimize settings (context window, memory allocation)
They’re documenting system capabilities for deployment planning

Prerequisites

Before starting, users must have:

Application installed and running
At least one chat model uploaded
At least one embedding model uploaded (for full suite)
Optional: Dataset uploaded for embedding tests
Time to let tests run uninterrupted (15-60 minutes)
Application window should remain open and in focus (recommended)

User Intent Analysis

Primary Intent

Measure and document how well AI models perform on the current hardware to make informed decisions about model selection, configuration, and deployment.

Secondary Intents

Compare multiple models to identify the best one
Identify performance bottlenecks
Validate hardware capabilities before production use
Generate performance reports for documentation
Optimize settings based on measured data
Troubleshoot slow performance issues

Subintents

Understand tokens per second (speed) for different scenarios
Measure memory usage and resource consumption
Test context window handling
Validate stability under load
Collect data for future reference

Step-by-Step Flow

Main Path (Happy Path)

Step 1: Navigate to Benchmarking Section

User Action: Click “Settings” in navigation, then click “Benchmarking” tab
System Response: Benchmarking page loads
UI Elements Visible:
- “Benchmarking” tab highlighted
- Page divided into two main areas:
  - Left panel: System information and status
  - Right panel: Benchmark results tables
- “Run Benchmark” button (prominent, likely blue)
- System hardware information displayed
- Any previous benchmark results shown in tables
Visual Cues:
- Clean dashboard layout
- Clear call-to-action button

Step 2: Review System Information

User Action: Review system specs displayed in left panel
System Response: Hardware information shown
UI Elements Visible:
- System info card showing:
  - Operating System
  - CPU model and core count
  - RAM amount
  - GPU information (model, VRAM)
  - NPU availability (Yes/No)
- Current date/time
- Previous benchmark date (if any)
Visual Cues:
- Clean card layout
- Icons for each hardware component
- Clear labels

Step 3: Check Models Configured

User Action: Verify which models will be tested (shown in results table or header)
System Response: Displays models that will be benchmarked
UI Elements Visible:
- Model sections in results table:
  - Chat model section (e.g., “Llama-3-8B”)
  - Embedding model section (e.g., “Jina Embeddings”)
- Each model has expandable test categories
- Status indicators for each test (Pending, Completed, etc.)
Visual Cues:
- Model names in section headers
- Expandable arrows or indicators

Step 4: Initiate Benchmark

User Action: Click “Run Benchmark” button
System Response:
- Button changes to “Running…” or becomes disabled
- Loading model phase begins
- System info panel updates to show “Running” status
- Progress information appears
- Stop button becomes available
UI Elements Visible:
- “Stop Benchmark” button replaces “Run Benchmark”
- Status text: “Running Benchmark…”
- Progress section showing:
  - Start time
  - Overall progress (X of Y tests completed)
  - Progress bar
- Test status cells begin updating
Visual Cues:
- Blue or animated indicators for active state
- Progress bar begins filling
- Status text updates

Step 5: Model Loading Phase

User Action: Wait while model loads (30-90 seconds)
System Response:
- Status shows “Loading model…”
- Progress updates
UI Elements Visible:
- Loading status text
- May show percentage
- Model being loaded shown
Visual Cues:
- Animated loading indicators
- Progress bar advances slightly

Step 6: LLM Tests Execute

User Action: Watch as chat model tests run automatically
System Response:
- Series of tests execute sequentially:
  1. Context Size tests (different sizes: 2048, 4096, 8192 tokens)
  2. Non-Streaming tests
  3. Chat History tests (multi-turn simulation)
  4. Memory Override tests (different VRAM allocations)
- Each test updates table cell with results
- Progress bar advances
- Real-time metrics displayed
UI Elements Visible:
- Results table with test categories expanded
- Each test row showing:
  - Test name (e.g., “Context Size 4096, 25% tokens”)
  - Status (changes from Pending → Loading → In Progress → Completed)
  - Load time (seconds)
  - First token time
  - Generation time
  - Input/output tokens
  - Tokens per second
  - CPU usage percentage
- Tests complete with green “Completed” badges
- Failed tests show red “Error” badges
- Current test highlighted or animated
Visual Cues:
- Color-coded status badges
- Animated current test
- Numbers populate in real-time
- Progress percentage increases

Step 7: Monitor LLM Test Progress

User Action: Observe tests completing; can work on other tasks in other applications
System Response: Tests continue automatically
UI Elements Visible:
- Tests completing one by one
- Results populating table
- Progress: “15 of 30 tests completed” or similar
- Elapsed time counter
- Overall progress bar (may be at 30-40% after LLM tests)
Visual Cues:
- Smooth progression through tests
- Rows change color as they complete
- Progress bar incrementally advances

Step 8: Embedding Tests Begin

User Action: Continue waiting/monitoring
System Response:
- After LLM tests complete, embedding tests start
- Embedding model loads
- Embedding test categories execute:
  1. Single Embedding tests (various text sizes)
  2. Bulk Embedding tests (different batch sizes and worker counts)
- Table updates with embedding results
UI Elements Visible:
- Embedding model section becomes active
- Test rows for embedding benchmarks
- Results showing:
  - Test name (e.g., “500 characters, NPU”)
  - Status
  - Load time
  - Embedding time
  - Workers used (NPU/CPU counts)
  - Embeddings per second
  - CPU load
- Progress continues: “20 of 30 tests completed”
Visual Cues:
- Focus shifts to embedding section
- Similar visual updates as LLM tests
- Progress bar advances further (60-80% range)

Step 9: Monitor System Performance

User Action: Optionally watch system stats in left panel during tests
System Response: Real-time system usage displayed
UI Elements Visible:
- System panel showing current:
  - CPU usage (percentage)
  - Memory usage
  - GPU usage (if applicable)
  - Temperature (possibly)
- May show real-time graphs or charts
- Updates every few seconds
Visual Cues:
- Live-updating numbers
- Graphs or meters showing resource usage
- Color coding for high usage (may turn red if excessive)

Step 10: All Tests Complete

User Action: No action required
System Response:
- Final test completes
- Progress reaches 100%
- Status changes to “Completed”
- “Stop Benchmark” button becomes “Run Benchmark” again
- Completion time and duration displayed
UI Elements Visible:
- All test rows showing “Completed” status (or some may show “Error” if failed)
- Full results table populated
- Summary statistics:
  - Total tests: 30 (or actual count)
  - Completed: X
  - Failed: Y
  - Duration: “25 minutes 43 seconds”
- Export button available
- Reset button available
Visual Cues:
- Green completion indicators
- Progress bar at 100%
- No more animations
- Results table fully populated

Step 11: Review Results

User Action: Scroll through results tables to examine performance data
System Response: Results displayed with all metrics
UI Elements Visible:
- Expandable test categories (click to expand/collapse)
- Detailed metrics for each test
- Color-coded performance indicators
- Comparison data between tests
Visual Cues:
- Data presented in organized table format
- Numbers aligned for easy comparison
- Categories clearly separated

Step 12: Optional - Export Results

User Action: Click “Export Results” or “Download” button
System Response:
- Results exported as TSV (tab-separated values) file
- File downloads to computer
UI Elements Visible:
- Export button
- Download dialog or browser notification
Visual Cues: Download progress indicator
Note: See export-benchmark-data.md for detailed export flow

Final Step: Benchmark Complete with Results

Success Indicator:
- All tests completed (or failed tests documented)
- Full results table populated with data
- Can export and reference results
- Performance characteristics understood
System State Change:
- Benchmark results saved in database
- Performance data available for system optimization
- Results used to estimate future processing times
- Can compare with future benchmarks
Next Possible Actions:
- Export results for documentation
- Compare models based on benchmark data
- Adjust settings based on findings (e.g., optimal context window size)
- Re-run specific tests to verify results
- Reset and run fresh benchmark with different models
- Use benchmark data to configure chat settings optimally

Alternative Paths & Strategies

Strategy A: Continue Partial Benchmark

When to use: Benchmark was interrupted previously

Steps:

Navigate to Benchmarking tab
See incomplete test results with some tests marked “Cancelled” or not started
Click “Continue Benchmark” button (if available)
System resumes from where it stopped
Only incomplete tests re-run
Total time reduced compared to full re-run

Strategy B: Quick Single-Test Benchmark

When to use: Just want to test one specific scenario

Steps:

Navigate to Benchmarking
Expand test category
Click “Re-test” on specific test row
Single test runs
Results update for just that test
Much faster than full suite (30 seconds - 2 minutes)

Strategy C: Model-Specific Benchmark

When to use: Only care about one model (LLM or embedding), not both

Steps:

Start benchmark normally
When first model completes, click “Stop Benchmark”
Accept partial results
Complete results for first model
Second model tests remain as “Not Started”
Saves time if only testing one model

Strategy D: Background Benchmark

When to use: Want to work on other tasks while benchmark runs

Steps:

Start benchmark
Navigate to other pages or minimize application
Benchmark continues in background
Periodically check progress via status badge or returning to benchmarking page
Return when complete to view results

QA Note: Background operation may be slower or less reliable. Keeping application in focus recommended.

Error States & Recovery

Error 1: Model Load Failure

Cause: Insufficient memory, corrupted model, or system error
User Experience:

Error during model loading phase
Message: “Failed to load model” or “Model error”
Benchmark stops
Test marked as “Error”

Recovery Steps:

Close other applications to free memory
Try running benchmark again
Try different model if specific model consistently fails
Check model file integrity
Restart application if needed

Error 2: Individual Test Timeout

Cause: Test takes longer than maximum allowed time (4 minutes)
User Experience:

Test running indicator for long time
Eventually shows “Error” or “Timeout”
Message: “Test exceeded maximum time”
Benchmark continues to next test

Recovery Steps:

Note which test timed out
Allow benchmark to continue (other tests may succeed)
Test may be too demanding for hardware
Consider not using that context size or configuration
Can manually re-test later to verify

Error 3: Benchmark Stops Unexpectedly

Cause: Application crash, system sleep, or user accidentally stopped it
User Experience:

Progress stops mid-benchmark
Some tests marked “Cancelled”
Incomplete results

Recovery Steps:

Check application status (did it crash?)
Restart application if needed
Return to Benchmarking tab
Use “Continue Benchmark” to resume (if available)
Or reset and run full benchmark again

Error 4: No Test Data File Available

Cause: Missing benchmark test data file used for consistent testing
User Experience:

Error message at top of page: “Test data not loaded” or similar
“Run Benchmark” button may be disabled
Warning that benchmarks cannot run

Recovery Steps:

Check internet connection (if test data needs download)
Refresh page to retry loading test data
Verify application installation is complete
Reinstall application if file consistently missing
Contact support if issue persists

Error 5: Out of Memory During Tests

Cause: System runs out of RAM during intensive testing
User Experience:

Application becomes unresponsive
May crash or freeze
Benchmark stops with error
Possible system warnings about low memory

Recovery Steps:

Close other applications before running benchmark
Restart application
Try benchmarking smaller model
Skip memory-intensive tests (large context sizes)
Upgrade system RAM if needed for desired models

Pain Points & Friction

Identified Issues:

Long Duration Without Clear Time Estimate
- Impact: Users don’t know if benchmark will take 15 minutes or 2 hours
- Frequency: Every benchmark run
- Potential Improvement:
  - Show estimated total time before starting
  - Display time remaining during execution
  - Warn about duration for large model/context combinations
  - Allow selecting subset of tests for faster results
Cannot Pause Mid-Benchmark
- Impact: If user needs to use computer, must stop and lose progress
- Frequency: Long benchmarks that need interruption
- Potential Improvement:
  - Add pause/resume functionality
  - Save progress to resume later
  - Allow partial completion with option to continue
Results Table Overwhelming
- Impact: Many rows and columns of numbers can be confusing
- Frequency: Viewing results, especially first time
- Potential Improvement:
  - Add summary view with key highlights
  - Visual indicators for good/bad performance
  - Comparison to typical values
  - Graphs/charts alongside tables
Unclear What “Good” Results Look Like
- Impact: Users see numbers but don’t know if performance is acceptable
- Frequency: All users without technical background
- Potential Improvement:
  - Add benchmarks or baselines for comparison
  - Color code results (green=good, yellow=ok, red=poor)
  - Provide interpretation: “This is excellent/good/poor for this hardware”
  - Show comparisons to similar systems
System Becomes Sluggish During Tests
- Impact: Difficult to use computer for other tasks while benchmark runs
- Frequency: On lower-spec systems or with large models
- Potential Improvement:
  - Warn users before starting that system will be under load
  - Provide “low-impact” mode that runs slower but allows other work
  - Recommend running overnight or during breaks
  - Better resource throttling
No Automatic Re-Test for Failed Tests
- Impact: Failed tests require manual retry
- Frequency: When occasional failures occur
- Potential Improvement:
  - Auto-retry failed tests once
  - Option to “Retry Failed Tests” in batch
  - Identify and explain failure patterns

Design Considerations

Following Contextual Design Principles:

Automation Opportunities:
- Auto-run lightweight benchmark when new model uploaded
- Auto-retry failed tests once
- Auto-export results when complete
- Auto-recommend optimal settings based on results
Simplification Opportunities:
- Provide “Quick Benchmark” with essential tests only
- Hide advanced test categories by default
- Summary view before detailed results
- One-click benchmark with smart defaults
Transition Smoothness:
- Clear progress indication throughout
- Smooth test transitions
- Results populate incrementally
- Easy to stop and resume
User Trust:
- Accurate progress estimation
- Reliable test execution
- Verifiable results
- Clear success/failure for each test
Cognitive Load:
- Don’t require understanding technical metrics
- Provide interpretations of results
- Visual progress indicators
- Clear start/stop controls

View Benchmark Results - Analyze completed benchmark data
Export Benchmark Results - Save results to file
Cancel Running Benchmark - Stop benchmark mid-run
Resume Incomplete Benchmark - Continue interrupted benchmark
Upload Large Language Model - Add models to test
Upload Embedding Model - Add embedding models
Configure Context Window Size - Optimize based on benchmark

Technical References

Knowledge Base Sections:

src/components/benchmarking/index.js - Benchmark orchestrator
src/components/benchmarking/benchmark-results.js - Results display
src/benchmarking/start-benchmark.js - Test execution coordinator
src/benchmarking/benchmark-llm.js - LLM test runner
src/benchmarking/benchmark-embeddings.js - Embedding test runner
src/components/benchmarking/system-stats-card.js - System info display

Key Components:

Benchmark test orchestration system
Real-time results tables
Progress tracking
System resource monitoring
Multi-model testing capability

Version History

Date	Version	Author	Changes
2025-10-04	1.1	Iternal Technologies	Initial comprehensive documentation

Notes

Important Considerations:

Benchmark runs CPU and GPU at high utilization; computer may become hot or loud
Results are saved and used by system for optimizing future performance
First benchmark provides baseline; can re-run to track performance over time
Different models will have vastly different results (expected)
Tests are designed to stress different aspects of model performance
Keep application in focus and avoid interruptions for best results

Test Categories Explained:

LLM Tests:

Context Size: Tests handling different amounts of conversation history
Non-Streaming: Tests batch processing mode (whole response at once)
Chat History: Tests multi-turn conversation performance
VRAM Override: Tests different memory allocations to find optimal

Embedding Tests:

Embedding Sizes: Tests processing different text lengths
Bulk Embedding: Tests parallel processing with multiple workers (NPU and CPU)

Metrics Explained:

Tokens per second: How fast the model generates text (higher is better)
First token time: Latency before response starts (lower is better)
Load time: How long model takes to initialize (lower is better)
CPU load: Processor usage during test (lower is better if adequate performance)
Embeddings per second: Throughput for embedding generation (higher is better)

Best Practices:

Run benchmarks when not using computer for other tasks
Close other applications before benchmarking for accurate results
Run benchmark after uploading new models to understand their performance
Keep benchmark results for documentation and comparison
Re-benchmark periodically to track performance changes
Use results to guide context window, VRAM, and worker configuration

Common User Questions:

“How long does benchmark take?” - Typically 15-45 minutes depending on models and hardware
“Can I stop and resume?” - Yes, use “Continue Benchmark” after stopping
“Do I need to benchmark every model?” - Recommended, but not required; helps optimize settings
“What’s a good tokens-per-second rate?” - Varies by hardware; 10-50 tokens/sec is typical for consumer hardware
“Why did some tests fail?” - Tests may exceed hardware capabilities or timeout; this helps identify limits
“Can I use the application while benchmarking?” - Technically yes, but will slow benchmark and may affect results

Run Full Benchmark Suite

Run Full Benchmark Suite

Overview

Trigger

Prerequisites

User Intent Analysis

Primary Intent

Secondary Intents

Subintents

Step-by-Step Flow

Main Path (Happy Path)

Alternative Paths & Strategies

Strategy A: Continue Partial Benchmark

Strategy B: Quick Single-Test Benchmark

Strategy C: Model-Specific Benchmark

Strategy D: Background Benchmark

Error States & Recovery

Error 1: Model Load Failure

Error 2: Individual Test Timeout

Error 3: Benchmark Stops Unexpectedly

Error 4: No Test Data File Available

Error 5: Out of Memory During Tests

Pain Points & Friction

Design Considerations

Technical References

Version History

Notes

Related Articles

Still need help?

Run Full Benchmark Suite

Run Full Benchmark Suite

Overview

Trigger

Prerequisites

User Intent Analysis

Primary Intent

Secondary Intents

Subintents

Step-by-Step Flow

Main Path (Happy Path)

Alternative Paths & Strategies

Strategy A: Continue Partial Benchmark

Strategy B: Quick Single-Test Benchmark

Strategy C: Model-Specific Benchmark

Strategy D: Background Benchmark

Error States & Recovery

Error 1: Model Load Failure

Error 2: Individual Test Timeout

Error 3: Benchmark Stops Unexpectedly

Error 4: No Test Data File Available

Error 5: Out of Memory During Tests

Pain Points & Friction

Design Considerations

Related Flows

Technical References

Version History

Notes

Related Articles

Still need help?