arangodb
diff --git a/‎site/content/3.12/deploy/production-checklist.md‎
Lines changed: 115 additions & 40 deletions b/‎site/content/3.12/deploy/production-checklist.md‎
Lines changed: 115 additions & 40 deletions
@@ -10,34 +10,89 @@ have been performed on your production system before you go live.
 
 ## Operating System
 
-- Executed the OS optimization scripts if you run ArangoDB on Linux.
+- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux.
   See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages
   [Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and
   [Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details.
 
-- OS monitoring is in place
-  (most common metrics, e.g. disk, CPU, RAM utilization).
+- Ensure your OS is compatible with your ArangoDB version
+  and keep it up to date at all times for security and stability.
+
+- OS monitoring is in place with specific alerting thresholds:
+  - **Disk usage**: Alert when reaching 60% (red line threshold).
+  - **CPU usage**: Alert when reaching 90% (red line threshold).
+  - **Memory usage**: Alert when reaching 85% (red line threshold).
 
 - Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations.
 
 ## ArangoDB
 
-- The user _root_ is not used to run any ArangoDB processes
+- **Use the latest versions**: Deploy the latest version series
+  of ArangoDB to benefit from performance improvements and security fixes.
+
+- **Testing environments**: Use QA environments and UAT (User Acceptance Testing)
+  to test all changes, in particular queries, before going live with production deployments.
+
+### Security
+
+- Create a dedicated system user and group (e.g., "arango")
+  to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes
   (if you run ArangoDB on Linux).
 
+- **Access control**: Restrict access to the deployment to authorized personnel only.
+  Implement proper authentication and authorization mechanisms.
+
+- **JWT authentication**: Enable JWT authentication
+  for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details.
+
+- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md)
+  for sensitive data. Make sure to safely store any secret keys you create for this.
+
+### Logging and Monitoring
+
 - The _arangod_ (server) process and the _arangodb_ (_Starter_) process
   (if in use) have some form of logging enabled and logs can easily be
   located and inspected.
-  
-- *Memory considerations*
-  - If you run multiple processes (e.g. DB-Server and Coordinator) on a single
-    machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
-    environment variable accordingly.
-  - For versions prior to 3.8, make sure to change the
-    [`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit)
-    query option according to the node size and workload.
-  - Disable swap space to avoid slowdown which can result in servers being incorrectly 
-    detected as failed.
+
+- **Third-party monitoring**: Configure third-party metrics monitoring tools like
+  Grafana with Prometheus to monitor ArangoDB metrics comprehensively.
+
+- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring:
+  - Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints
+  - Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics
+  - Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking
+  - Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint
+  - See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information
+
+- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics.
+
+- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines:
+  - Disk usage: 60% (red line)
+  - CPU usage: 90% (red line)
+  - Memory usage: 85% (red line)
+
+### Memory
+
+- For DB-Servers and Coordinators, override the
+  [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
+  environment variable using this rule of thumb:
+  - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc.
+  - Use 3/4 of that value for DB-Servers.
+  - Use 1/4 of that value for Coordinators.
+  - Agents typically don't need much memory and can use the remaining 10% headspace.
+
+- Note that if ArangoDB "sees" x GB of memory in a pod,
+  it will try to use those x GB. Memory accounting has been vastly improved in 3.12,
+  but overshooting in certain cases may still occur.
+
+- Disable swap space to avoid slowdown which can result in servers being incorrectly 
+  detected as failed.
+
+- **Query memory limits**: Configure appropriate memory limits for AQL queries:
+  - Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query.
+  - Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries.
+
+### Service Management
 
 - Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically
   you would use the Kubernetes operator or use systemd to launch the _Starter_.
@@ -50,36 +105,56 @@ have been performed on your production system before you go live.
   update-rc.d -f arangodb3 remove
   ```
 
-- If you have deployed a Cluster, the _replication factor_  and 
-  _minimal_replication_factor_ of your collections
-  are set to a value equal or higher than 2, otherwise you run the risk of
-  losing data in case of a node failure. See
-  [cluster startup options](../components/arangodb-server/options.md#cluster).
-
-- *Disk Performance considerations*
-  - Verify that your **storage performance** is at least 100 IOPS for each
-    volume in production mode. This is the bare minimum and it's recommended to
-    provide more for performance. It is probably only a concern if you use a
-    cloud infrastructure. Note that IOPS might be allotted based on a volume size,
-    so make sure to check your storage provider for details. Furthermore, you should
-    be careful with burst mode guarantees as ArangoDB requires a sustainable
-    high IOPS rate. 
-
-  - The considerations should be given to an IO bandwidth (especially considering 
-    RocksDB write-amplification which can easily be 10x or more).
-
-- Whenever possible use **block storage**. Database data is based on append
-  operations, so filesystem which support this should be used for best
-  performance. We would not recommend to use NFS for performance reasons,
+### Cluster Configuration
+
+- **Replication configuration**: For production clusters, configure collections with:
+  - _replication factor_ of 3 for optimal data availability and fault tolerance.
+  - _minimal_replication_factor_ of a value equal or higher than 2.
+  - _writeConcern_ of 2.
+  See [cluster startup options](../components/arangodb-server/options.md#cluster).
+
+- **Shard limits**: Keep the total number of shards below 10,000 across your cluster
+  to maintain optimal performance and avoid resource exhaustion.
+
+### Disk Performance
+
+- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each
+  volume in production mode. This is the bare minimum and it's recommended to
+  provide more for performance. It is probably only a concern if you use a
+  cloud infrastructure. Note that IOPS might be allotted based on a volume size,
+  so make sure to check your storage provider for details. Furthermore, you should
+  be careful with burst mode guarantees as ArangoDB requires a sustainable
+  high IOPS rate.
+
+- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance.
+
+- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering 
+  RocksDB write-amplification which can easily be 10x or more.
+
+- **Block storage**: Whenever possible use block storage. Database data is based on append
+  operations, so filesystems which support this should be used for best
+  performance. ArangoDB does not recommend using NFS for performance reasons,
   furthermore we experienced some issues with hard links required for
   Hot Backup.
 
-- Verify your **Backup** and restore procedures are working.
+### Backup and Recovery
+
+- **Test restore procedures**: Verify your backup and restore procedures are working.
+  **TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures.
+
+- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your
+  RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements.
+
+- **arangodump backups**: Take backups with arangodump from time to time as an
+  additional backup strategy alongside Hot Backups.
 
-- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md).
-  Make sure to safely store any secret keys you create for this.
+- **Secure backup storage**: Store backups in a secure, separate location from your
+  production systems. Use encrypted storage and ensure backups are geographically
+  distributed to protect against regional disasters. Implement proper access controls
+  for backup storage locations.
 
-- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana).
+- **Retry mechanisms**: Implement exponential retry with jitter in your applications
+  when connecting to ArangoDB to handle temporary network issues and failovers gracefully.
 
 ## Kubernetes Operator (kube-arangodb)
 
@@ -89,4 +164,4 @@ have been performed on your production system before you go live.
 - The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming)
  of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion.
 
-- Use native networking whenever possible to reduce delays.
+- Use native networking whenever possible to reduce delays.