Documentation
Benchmarks
Methodology

Benchmark Methodology

Our goal is to produce fair, reproducible, and meaningful benchmark results. This page documents our methodology in detail so anyone can understand, critique, and reproduce our tests.

Principles

  1. Reproducibility First - Every benchmark can be reproduced using our open-source tooling and documented configurations. We publish all scripts, configurations, and raw data.
  2. Two Configuration Modes - Each database is tested with default configurations first, and optimized configurations second, so users can see both out-of-box and tuned performance.
  3. Apples-to-Apples Comparison - We test on identical hardware configurations and measure the same metrics across all systems.
  4. Statistical Rigor - We run multiple iterations, report percentiles (p50, p95, p99), and document variance to ensure results are statistically meaningful.

Configuration Modes

Defaults

The "Defaults" configuration tests databases with minimal changes from their out-of-box settings:

  • Install the database using standard package manager or official Docker image
  • Only change settings required for the benchmark to run (connection limits, authentication)
  • No performance tuning applied
  • Durability settings left at defaults

This answers: "What do I get without any tuning?"

Optimized (Coming Soon)

Optimized configurations are coming soon. We're working with database communities to develop fair, well-tuned configurations.

The "Optimized" configuration will test databases with tuned settings:

  • Memory allocation tuned for hardware (buffer pools, caches)
  • I/O settings optimized for NVMe storage
  • Connection handling tuned for benchmark concurrency
  • Configurations reviewed by database maintainers or community experts

This will answer: "What's the performance ceiling with proper tuning?"


Database Benchmarks

Database benchmarks test self-hosted database engines to measure the raw performance of the database engine itself.

Current: GitHub-Hosted Runners

Benchmarks currently run on GitHub-hosted runners (ubuntu-latest):

ComponentSpecification
MachineGitHub-hosted runner
CPU4 cores
Memory16 GB
StorageSSD
💡

GitHub-hosted runners have limited resources. Results are useful for relative comparisons but not indicative of production performance.

Coming Soon: Dedicated Infrastructure

We're setting up a self-hosted GitHub Actions runner on dedicated AWS infrastructure in us-east-1 for more realistic benchmarks.

Future benchmarks will run on:

ComponentSpecification
MachineAWS EC2 (dedicated instance)
CPU16 cores
Memory128 GB
Storage1 TB NVMe SSD (direct-attached)
Regionus-east-1

Test Procedure

1. Environment Setup

# Fresh environment via GitHub Actions
# Install database with documented version
# Apply configuration (defaults or optimized)
# Verify configuration is loaded correctly

2. Data Loading

  • Load the TPC-C dataset with configured warehouse count
  • Wait for any background processes to complete (compaction, vacuuming)
  • Verify data integrity
  • Record load time

3. Warm-up Phase

  • Run workload for 5 minutes to warm caches
  • Discard warm-up results
  • Verify system is in steady state

4. Measurement Phase

  • Run workload for 30 minutes
  • Record all latency samples
  • Monitor system resources (CPU, memory, I/O)

5. Cool-down and Verification

  • Allow system to quiesce
  • Verify data integrity
  • Export raw results

Current Workload: TPC-C

We currently run TPC-C style benchmarks using BenchBase (opens in a new tab) (formerly OLTPBench).

Transaction Mix

TransactionPercentageDescription
New Order45%Create new customer orders
Payment43%Process customer payments
Order Status4%Query order status
Delivery4%Process batch deliveries
Stock Level4%Check warehouse inventory

Scale Factor

The dataset size scales with warehouse count:

WarehousesApproximate Data Size
10~1 GB
100~10 GB
1000~100 GB

Default benchmarks use 100 warehouses (~10 GB) to ensure working set exceeds typical cache sizes while keeping run times reasonable.

Concurrency Levels

We test at multiple concurrency levels to understand scaling behavior:

  • 16 concurrent connections
  • 64 concurrent connections
  • 256 concurrent connections
  • 512 concurrent connections

Metrics Collected

Primary Metrics

MetricDescription
ThroughputTransactions per second (TPS)
Latency p50Median response time
Latency p9595th percentile response time
Latency p9999th percentile response time
Latency p99.999.9th percentile response time (tail latency)

Secondary Metrics

  • CPU utilization (user, system, iowait)
  • Memory usage (used, cached, buffers)
  • Disk I/O (IOPS, throughput, latency)

Configuration Transparency

Every tested database has a published configuration file in our repository.

Defaults Example (PostgreSQL)

# Only essential changes from defaults
max_connections: 600
listen_addresses: '*'
# All other settings at PostgreSQL defaults

Optimized Example (Coming Soon)

# Tuned for 128GB RAM, NVMe storage
shared_buffers: 32GB
effective_cache_size: 96GB
work_mem: 256MB
maintenance_work_mem: 2GB
max_connections: 600
max_parallel_workers: 8
wal_level: replica
synchronous_commit: on
⚠️

We always test with durability enabled (synchronous_commit, fsync, etc.). Benchmarks with durability disabled would be noted separately.


Automation

Benchmarks are run automatically via GitHub Actions on a self-hosted runner. This ensures:

  • Consistency: Same environment for every run
  • Reproducibility: Anyone can trigger the same workflow
  • Transparency: All logs and artifacts are public

Benchmark Schedule

  • Triggered manually for new database versions
  • Scheduled weekly for regression detection
  • All results automatically published to this site

Reproducing Results

All benchmarks can be reproduced using our open-source tooling:

# Clone the benchmark repository
git clone https://github.com/supabase/oltp-benchmark
 
# Run TPC-C benchmark for PostgreSQL with defaults
./run-benchmark.sh postgresql --config defaults --warehouses 100 --duration 1800
 
# Run with optimized config (when available)
./run-benchmark.sh postgresql --config optimized --warehouses 100 --duration 1800

View Source Code on GitHub (opens in a new tab)


Service Provider Benchmarks (Coming Soon)

Service provider benchmarks will have a separate methodology document when released.

Service provider benchmarks will differ from database benchmarks:

AspectDatabase BenchmarkService Provider Benchmark
InfrastructureStandardized local hardwareProvider's infrastructure
ConfigurationDefaults + OptimizedLimited to provider options
What's measuredRaw database performanceComplete service performance
NetworkLocal, minimal latencyCloud networking included

Future Benchmarks

We're considering adding:

  • YCSB: Key-value style workloads (read-heavy, write-heavy, mixed)
  • Point Lookups: Pure primary key read performance
  • Secondary Index: Query performance on non-primary indexes
  • High Concurrency: 1000+ connection scaling tests

Limitations and Caveats

⚠️

Benchmarks are not a substitute for testing your specific workload. Results may vary significantly based on your specific query patterns, data distribution and access patterns, network topology, and operational requirements (backups, replication, etc.)

What We Don't Test

  • Multi-region replication latency
  • Failover and recovery time
  • Operational complexity
  • Cost per transaction
  • Specific SQL feature performance