Run Full Benchmark Suite
Run Full Benchmark Suite
Overview
Flow ID: run-full-benchmark
Category: Performance Benchmarking
Estimated Duration: 15-60 minutes (depends on models and hardware)
User Role: All Users (though technical users benefit most)
Complexity: Moderate
Purpose: This flow enables users to run comprehensive performance tests on their AI models to understand how well they perform on their specific hardware. Benchmarking measures speed, throughput, latency, and resource usage across various test scenarios, helping users make informed decisions about model selection and configuration.
Trigger
What initiates this flow:
- User manually initiates
Specific trigger: User wants to test model performance, typically because:
- They just uploaded a new model and want to know how it performs
- They want to compare different models to choose the best one
- They’re experiencing performance issues and want diagnostic data
- They want to optimize settings (context window, memory allocation)
- They’re documenting system capabilities for deployment planning
Prerequisites
Before starting, users must have:
- Application installed and running
- At least one chat model uploaded
- At least one embedding model uploaded (for full suite)
- Optional: Dataset uploaded for embedding tests
- Time to let tests run uninterrupted (15-60 minutes)
- Application window should remain open and in focus (recommended)
User Intent Analysis
Primary Intent
Measure and document how well AI models perform on the current hardware to make informed decisions about model selection, configuration, and deployment.
Secondary Intents
- Compare multiple models to identify the best one
- Identify performance bottlenecks
- Validate hardware capabilities before production use
- Generate performance reports for documentation
- Optimize settings based on measured data
- Troubleshoot slow performance issues
Subintents
- Understand tokens per second (speed) for different scenarios
- Measure memory usage and resource consumption
- Test context window handling
- Validate stability under load
- Collect data for future reference
Step-by-Step Flow
Main Path (Happy Path)
Step 1: Navigate to Benchmarking Section
- User Action: Click “Settings” in navigation, then click “Benchmarking” tab
- System Response: Benchmarking page loads
- UI Elements Visible:
- “Benchmarking” tab highlighted
- Page divided into two main areas:
- Left panel: System information and status
- Right panel: Benchmark results tables
- “Run Benchmark” button (prominent, likely blue)
- System hardware information displayed
- Any previous benchmark results shown in tables
- Visual Cues:
- Clean dashboard layout
- Clear call-to-action button
Step 2: Review System Information
- User Action: Review system specs displayed in left panel
- System Response: Hardware information shown
- UI Elements Visible:
- System info card showing:
- Operating System
- CPU model and core count
- RAM amount
- GPU information (model, VRAM)
- NPU availability (Yes/No)
- Current date/time
- Previous benchmark date (if any)
- System info card showing:
- Visual Cues:
- Clean card layout
- Icons for each hardware component
- Clear labels
Step 3: Check Models Configured
- User Action: Verify which models will be tested (shown in results table or header)
- System Response: Displays models that will be benchmarked
- UI Elements Visible:
- Model sections in results table:
- Chat model section (e.g., “Llama-3-8B”)
- Embedding model section (e.g., “Jina Embeddings”)
- Each model has expandable test categories
- Status indicators for each test (Pending, Completed, etc.)
- Model sections in results table:
- Visual Cues:
- Model names in section headers
- Expandable arrows or indicators
Step 4: Initiate Benchmark
- User Action: Click “Run Benchmark” button
- System Response:
- Button changes to “Running…” or becomes disabled
- Loading model phase begins
- System info panel updates to show “Running” status
- Progress information appears
- Stop button becomes available
- UI Elements Visible:
- “Stop Benchmark” button replaces “Run Benchmark”
- Status text: “Running Benchmark…”
- Progress section showing:
- Start time
- Overall progress (X of Y tests completed)
- Progress bar
- Test status cells begin updating
- Visual Cues:
- Blue or animated indicators for active state
- Progress bar begins filling
- Status text updates
Step 5: Model Loading Phase
- User Action: Wait while model loads (30-90 seconds)
- System Response:
- Status shows “Loading model…”
- Progress updates
- UI Elements Visible:
- Loading status text
- May show percentage
- Model being loaded shown
- Visual Cues:
- Animated loading indicators
- Progress bar advances slightly
Step 6: LLM Tests Execute
- User Action: Watch as chat model tests run automatically
- System Response:
- Series of tests execute sequentially:
- Context Size tests (different sizes: 2048, 4096, 8192 tokens)
- Non-Streaming tests
- Chat History tests (multi-turn simulation)
- Memory Override tests (different VRAM allocations)
- Each test updates table cell with results
- Progress bar advances
- Real-time metrics displayed
- Series of tests execute sequentially:
- UI Elements Visible:
- Results table with test categories expanded
- Each test row showing:
- Test name (e.g., “Context Size 4096, 25% tokens”)
- Status (changes from Pending → Loading → In Progress → Completed)
- Load time (seconds)
- First token time
- Generation time
- Input/output tokens
- Tokens per second
- CPU usage percentage
- Tests complete with green “Completed” badges
- Failed tests show red “Error” badges
- Current test highlighted or animated
- Visual Cues:
- Color-coded status badges
- Animated current test
- Numbers populate in real-time
- Progress percentage increases
Step 7: Monitor LLM Test Progress
- User Action: Observe tests completing; can work on other tasks in other applications
- System Response: Tests continue automatically
- UI Elements Visible:
- Tests completing one by one
- Results populating table
- Progress: “15 of 30 tests completed” or similar
- Elapsed time counter
- Overall progress bar (may be at 30-40% after LLM tests)
- Visual Cues:
- Smooth progression through tests
- Rows change color as they complete
- Progress bar incrementally advances
Step 8: Embedding Tests Begin
- User Action: Continue waiting/monitoring
- System Response:
- After LLM tests complete, embedding tests start
- Embedding model loads
- Embedding test categories execute:
- Single Embedding tests (various text sizes)
- Bulk Embedding tests (different batch sizes and worker counts)
- Table updates with embedding results
- UI Elements Visible:
- Embedding model section becomes active
- Test rows for embedding benchmarks
- Results showing:
- Test name (e.g., “500 characters, NPU”)
- Status
- Load time
- Embedding time
- Workers used (NPU/CPU counts)
- Embeddings per second
- CPU load
- Progress continues: “20 of 30 tests completed”
- Visual Cues:
- Focus shifts to embedding section
- Similar visual updates as LLM tests
- Progress bar advances further (60-80% range)
Step 9: Monitor System Performance
- User Action: Optionally watch system stats in left panel during tests
- System Response: Real-time system usage displayed
- UI Elements Visible:
- System panel showing current:
- CPU usage (percentage)
- Memory usage
- GPU usage (if applicable)
- Temperature (possibly)
- May show real-time graphs or charts
- Updates every few seconds
- System panel showing current:
- Visual Cues:
- Live-updating numbers
- Graphs or meters showing resource usage
- Color coding for high usage (may turn red if excessive)
Step 10: All Tests Complete
- User Action: No action required
- System Response:
- Final test completes
- Progress reaches 100%
- Status changes to “Completed”
- “Stop Benchmark” button becomes “Run Benchmark” again
- Completion time and duration displayed
- UI Elements Visible:
- All test rows showing “Completed” status (or some may show “Error” if failed)
- Full results table populated
- Summary statistics:
- Total tests: 30 (or actual count)
- Completed: X
- Failed: Y
- Duration: “25 minutes 43 seconds”
- Export button available
- Reset button available
- Visual Cues:
- Green completion indicators
- Progress bar at 100%
- No more animations
- Results table fully populated
Step 11: Review Results
- User Action: Scroll through results tables to examine performance data
- System Response: Results displayed with all metrics
- UI Elements Visible:
- Expandable test categories (click to expand/collapse)
- Detailed metrics for each test
- Color-coded performance indicators
- Comparison data between tests
- Visual Cues:
- Data presented in organized table format
- Numbers aligned for easy comparison
- Categories clearly separated
Step 12: Optional - Export Results
- User Action: Click “Export Results” or “Download” button
- System Response:
- Results exported as TSV (tab-separated values) file
- File downloads to computer
- UI Elements Visible:
- Export button
- Download dialog or browser notification
- Visual Cues: Download progress indicator
- Note: See export-benchmark-data.md for detailed export flow
Final Step: Benchmark Complete with Results
- Success Indicator:
- All tests completed (or failed tests documented)
- Full results table populated with data
- Can export and reference results
- Performance characteristics understood
- System State Change:
- Benchmark results saved in database
- Performance data available for system optimization
- Results used to estimate future processing times
- Can compare with future benchmarks
- Next Possible Actions:
- Export results for documentation
- Compare models based on benchmark data
- Adjust settings based on findings (e.g., optimal context window size)
- Re-run specific tests to verify results
- Reset and run fresh benchmark with different models
- Use benchmark data to configure chat settings optimally
Alternative Paths & Strategies
Strategy A: Continue Partial Benchmark
When to use: Benchmark was interrupted previously
Steps:
- Navigate to Benchmarking tab
- See incomplete test results with some tests marked “Cancelled” or not started
- Click “Continue Benchmark” button (if available)
- System resumes from where it stopped
- Only incomplete tests re-run
- Total time reduced compared to full re-run
Strategy B: Quick Single-Test Benchmark
When to use: Just want to test one specific scenario
Steps:
- Navigate to Benchmarking
- Expand test category
- Click “Re-test” on specific test row
- Single test runs
- Results update for just that test
- Much faster than full suite (30 seconds - 2 minutes)
Strategy C: Model-Specific Benchmark
When to use: Only care about one model (LLM or embedding), not both
Steps:
- Start benchmark normally
- When first model completes, click “Stop Benchmark”
- Accept partial results
- Complete results for first model
- Second model tests remain as “Not Started”
- Saves time if only testing one model
Strategy D: Background Benchmark
When to use: Want to work on other tasks while benchmark runs
Steps:
- Start benchmark
- Navigate to other pages or minimize application
- Benchmark continues in background
- Periodically check progress via status badge or returning to benchmarking page
- Return when complete to view results
QA Note: Background operation may be slower or less reliable. Keeping application in focus recommended.
Error States & Recovery
Error 1: Model Load Failure
Cause: Insufficient memory, corrupted model, or system error
User Experience:
- Error during model loading phase
- Message: “Failed to load model” or “Model error”
- Benchmark stops
- Test marked as “Error”
Recovery Steps:
- Close other applications to free memory
- Try running benchmark again
- Try different model if specific model consistently fails
- Check model file integrity
- Restart application if needed
Error 2: Individual Test Timeout
Cause: Test takes longer than maximum allowed time (4 minutes)
User Experience:
- Test running indicator for long time
- Eventually shows “Error” or “Timeout”
- Message: “Test exceeded maximum time”
- Benchmark continues to next test
Recovery Steps:
- Note which test timed out
- Allow benchmark to continue (other tests may succeed)
- Test may be too demanding for hardware
- Consider not using that context size or configuration
- Can manually re-test later to verify
Error 3: Benchmark Stops Unexpectedly
Cause: Application crash, system sleep, or user accidentally stopped it
User Experience:
- Progress stops mid-benchmark
- Some tests marked “Cancelled”
- Incomplete results
Recovery Steps:
- Check application status (did it crash?)
- Restart application if needed
- Return to Benchmarking tab
- Use “Continue Benchmark” to resume (if available)
- Or reset and run full benchmark again
Error 4: No Test Data File Available
Cause: Missing benchmark test data file used for consistent testing
User Experience:
- Error message at top of page: “Test data not loaded” or similar
- “Run Benchmark” button may be disabled
- Warning that benchmarks cannot run
Recovery Steps:
- Check internet connection (if test data needs download)
- Refresh page to retry loading test data
- Verify application installation is complete
- Reinstall application if file consistently missing
- Contact support if issue persists
Error 5: Out of Memory During Tests
Cause: System runs out of RAM during intensive testing
User Experience:
- Application becomes unresponsive
- May crash or freeze
- Benchmark stops with error
- Possible system warnings about low memory
Recovery Steps:
- Close other applications before running benchmark
- Restart application
- Try benchmarking smaller model
- Skip memory-intensive tests (large context sizes)
- Upgrade system RAM if needed for desired models
Pain Points & Friction
Identified Issues:
Long Duration Without Clear Time Estimate
- Impact: Users don’t know if benchmark will take 15 minutes or 2 hours
- Frequency: Every benchmark run
- Potential Improvement:
- Show estimated total time before starting
- Display time remaining during execution
- Warn about duration for large model/context combinations
- Allow selecting subset of tests for faster results
Cannot Pause Mid-Benchmark
- Impact: If user needs to use computer, must stop and lose progress
- Frequency: Long benchmarks that need interruption
- Potential Improvement:
- Add pause/resume functionality
- Save progress to resume later
- Allow partial completion with option to continue
Results Table Overwhelming
- Impact: Many rows and columns of numbers can be confusing
- Frequency: Viewing results, especially first time
- Potential Improvement:
- Add summary view with key highlights
- Visual indicators for good/bad performance
- Comparison to typical values
- Graphs/charts alongside tables
Unclear What “Good” Results Look Like
- Impact: Users see numbers but don’t know if performance is acceptable
- Frequency: All users without technical background
- Potential Improvement:
- Add benchmarks or baselines for comparison
- Color code results (green=good, yellow=ok, red=poor)
- Provide interpretation: “This is excellent/good/poor for this hardware”
- Show comparisons to similar systems
System Becomes Sluggish During Tests
- Impact: Difficult to use computer for other tasks while benchmark runs
- Frequency: On lower-spec systems or with large models
- Potential Improvement:
- Warn users before starting that system will be under load
- Provide “low-impact” mode that runs slower but allows other work
- Recommend running overnight or during breaks
- Better resource throttling
No Automatic Re-Test for Failed Tests
- Impact: Failed tests require manual retry
- Frequency: When occasional failures occur
- Potential Improvement:
- Auto-retry failed tests once
- Option to “Retry Failed Tests” in batch
- Identify and explain failure patterns
Design Considerations
Following Contextual Design Principles:
Automation Opportunities:
- Auto-run lightweight benchmark when new model uploaded
- Auto-retry failed tests once
- Auto-export results when complete
- Auto-recommend optimal settings based on results
Simplification Opportunities:
- Provide “Quick Benchmark” with essential tests only
- Hide advanced test categories by default
- Summary view before detailed results
- One-click benchmark with smart defaults
Transition Smoothness:
- Clear progress indication throughout
- Smooth test transitions
- Results populate incrementally
- Easy to stop and resume
User Trust:
- Accurate progress estimation
- Reliable test execution
- Verifiable results
- Clear success/failure for each test
Cognitive Load:
- Don’t require understanding technical metrics
- Provide interpretations of results
- Visual progress indicators
- Clear start/stop controls
Related Flows
- View Benchmark Results - Analyze completed benchmark data
- Export Benchmark Results - Save results to file
- Cancel Running Benchmark - Stop benchmark mid-run
- Resume Incomplete Benchmark - Continue interrupted benchmark
- Upload Large Language Model - Add models to test
- Upload Embedding Model - Add embedding models
- Configure Context Window Size - Optimize based on benchmark
Technical References
Knowledge Base Sections:
- src/components/benchmarking/index.js - Benchmark orchestrator
- src/components/benchmarking/benchmark-results.js - Results display
- src/benchmarking/start-benchmark.js - Test execution coordinator
- src/benchmarking/benchmark-llm.js - LLM test runner
- src/benchmarking/benchmark-embeddings.js - Embedding test runner
- src/components/benchmarking/system-stats-card.js - System info display
Key Components:
- Benchmark test orchestration system
- Real-time results tables
- Progress tracking
- System resource monitoring
- Multi-model testing capability
Version History
| Date | Version | Author | Changes |
|---|---|---|---|
| 2025-10-04 | 1.1 | Iternal Technologies | Initial comprehensive documentation |
Notes
Important Considerations:
- Benchmark runs CPU and GPU at high utilization; computer may become hot or loud
- Results are saved and used by system for optimizing future performance
- First benchmark provides baseline; can re-run to track performance over time
- Different models will have vastly different results (expected)
- Tests are designed to stress different aspects of model performance
- Keep application in focus and avoid interruptions for best results
Test Categories Explained:
LLM Tests:
- Context Size: Tests handling different amounts of conversation history
- Non-Streaming: Tests batch processing mode (whole response at once)
- Chat History: Tests multi-turn conversation performance
- VRAM Override: Tests different memory allocations to find optimal
Embedding Tests:
- Embedding Sizes: Tests processing different text lengths
- Bulk Embedding: Tests parallel processing with multiple workers (NPU and CPU)
Metrics Explained:
- Tokens per second: How fast the model generates text (higher is better)
- First token time: Latency before response starts (lower is better)
- Load time: How long model takes to initialize (lower is better)
- CPU load: Processor usage during test (lower is better if adequate performance)
- Embeddings per second: Throughput for embedding generation (higher is better)
Best Practices:
- Run benchmarks when not using computer for other tasks
- Close other applications before benchmarking for accurate results
- Run benchmark after uploading new models to understand their performance
- Keep benchmark results for documentation and comparison
- Re-benchmark periodically to track performance changes
- Use results to guide context window, VRAM, and worker configuration
Common User Questions:
- “How long does benchmark take?” - Typically 15-45 minutes depending on models and hardware
- “Can I stop and resume?” - Yes, use “Continue Benchmark” after stopping
- “Do I need to benchmark every model?” - Recommended, but not required; helps optimize settings
- “What’s a good tokens-per-second rate?” - Varies by hardware; 10-50 tokens/sec is typical for consumer hardware
- “Why did some tests fail?” - Tests may exceed hardware capabilities or timeout; this helps identify limits
- “Can I use the application while benchmarking?” - Technically yes, but will slow benchmark and may affect results