docs(api): add Silk profiling and improve test data guidance

docs(api): update guide with project-specific patterns
- Add rls_transaction usage examples - Add DJANGO_LOGGING_LEVEL=DEBUG option - Add RLS context for raw SQL (set_config) - Add section on partitioned tables index creation - Improve index design guidance for RLS
2025-12-19 05:17:47 +00:00 · 2025-12-16 15:19:06 +01:00 · 2025-12-09 12:47:48 +01:00 · 2025-12-09 12:43:00 +01:00
2 changed files with 458 additions and 0 deletions
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -32,6 +32,7 @@ Please add a detailed description of how to review this PR.
 #### API
 - [ ] Verify if API specs need to be regenerated.
 - [ ] Check if version updates are required (e.g., specs, Poetry, etc.).
+- [ ] Query performance validated with `EXPLAIN ANALYZE` for new/modified endpoints. See [Query Performance Guide](https://github.com/prowler-cloud/prowler/blob/master/api/docs/query-performance-guide.md).
 - [ ] Ensure new entries are added to [CHANGELOG.md](https://github.com/prowler-cloud/prowler/blob/master/api/CHANGELOG.md), if applicable.

 ### License
--- a/api/docs/query-performance-guide.md
+++ b/api/docs/query-performance-guide.md
@@ -0,0 +1,457 @@
+# Query Performance Guide
+
+## Overview
+
+This guide explains how to validate query performance when developing new endpoints or modifying existing ones. **This is part of the development process**, not a separate task—just like writing unit tests.
+
+The goal is simple: ensure PostgreSQL uses indexes correctly for the queries your code generates.
+
+## When to Validate
+
+You **must** validate query performance when:
+
+- Creating a new endpoint that queries the database
+- Modifying an existing query (adding filters, joins, or sorting)
+- Adding new indexes
+- Working on performance-critical endpoints (overviews, findings, resources)
+
+## Profiling with Django Silk (Recommended)
+
+[Django Silk](https://github.com/jazzband/django-silk) is the recommended way to profile queries because it captures the actual SQL generated by your code during real HTTP requests. This gives you the most accurate picture of what happens in production.
+
+### Enabling Silk
+
+Silk is installed as a dev dependency but disabled by default. To enable it temporarily for profiling:
+
+#### 1. Add Silk to your local settings
+
+In `api/src/backend/config/django/devel.py`, add at the end of the file:
+
+```python
+# Silk profiler (temporary - remove after profiling)
+INSTALLED_APPS += ["silk"]  # noqa: F405
+MIDDLEWARE += ["silk.middleware.SilkyMiddleware"]  # noqa: F405
+```
+
+#### 2. Add Silk URLs
+
+In `api/src/backend/api/v1/urls.py`, add at the end:
+
+```python
+from django.conf import settings
+
+if settings.DEBUG:
+    urlpatterns += [path("silk/", include("silk.urls", namespace="silk"))]
+```
+
+#### 3. Run Silk migrations
+
+```bash
+cd api/src/backend
+poetry run python manage.py migrate --database admin
+```
+
+#### 4. Access Silk
+
+Start the development server and navigate to `http://localhost:8000/api/v1/silk/`
+
+### Using Silk
+
+1. Make requests to the endpoint you want to profile
+2. Open Silk UI and find your request
+3. Click on the request to see all SQL queries executed
+4. For each query, you can see:
+   - Execution time
+   - Number of similar queries (N+1 detection)
+   - The actual SQL with parameters
+   - **EXPLAIN output** (click on a query to see it)
+
+### Disabling Silk
+
+After profiling, **remove the changes** you made to `devel.py` and `urls.py`. Don't commit Silk configuration to the repository.
+
+## Manual Query Analysis with EXPLAIN ANALYZE
+
+For quick checks or when you need more control, you can run `EXPLAIN ANALYZE` directly.
+
+### 1. Get Your Query
+
+#### Option A: Using Django Shell with RLS
+
+This approach mirrors how queries run in production with Row Level Security enabled:
+
+```bash
+cd api/src/backend
+poetry run python manage.py shell
+```
+
+```python
+from django.db import connection
+from api.db_utils import rls_transaction
+from api.models import Finding
+
+tenant_id = "your-tenant-uuid"
+
+with rls_transaction(tenant_id):
+    # Build your queryset
+    qs = Finding.objects.filter(status="FAIL").order_by("-inserted_at")[:25]
+
+    # Force evaluation
+    list(qs)
+
+    # Get the SQL
+    print(connection.queries[-1]['sql'])
+```
+
+#### Option B: Print Query Without Execution
+
+```python
+from api.models import Finding
+
+queryset = Finding.objects.filter(status="FAIL")
+print(queryset.query)
+```
+
+> **Note:** This won't include RLS filters, so the actual production query will differ.
+
+#### Option C: Enable SQL Logging
+
+Set `DJANGO_LOGGING_LEVEL=DEBUG` in your environment:
+
+```bash
+DJANGO_LOGGING_LEVEL=DEBUG poetry run python manage.py runserver
+```
+
+### 2. Run EXPLAIN ANALYZE
+
+Connect to PostgreSQL and run:
+
+```sql
+EXPLAIN ANALYZE <your_query>;
+```
+
+Or with more details:
+
+```sql
+EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <your_query>;
+```
+
+#### Running EXPLAIN with RLS Context
+
+To test with RLS enabled (as it runs in production), set the tenant context first:
+
+```sql
+-- Set tenant context
+SELECT set_config('api.tenant_id', 'your-tenant-uuid', TRUE);
+
+-- Then run your EXPLAIN ANALYZE
+EXPLAIN ANALYZE SELECT * FROM findings WHERE status = 'FAIL' LIMIT 25;
+```
+
+### 3. Interpret the Results
+
+#### Good Signs (Index is being used)
+
+```
+Index Scan using findings_tenant_status_idx on findings
+  Index Cond: ((tenant_id = '...'::uuid) AND (status = 'FAIL'))
+  Rows Removed by Filter: 0
+  Actual Rows: 150
+  Planning Time: 0.5 ms
+  Execution Time: 2.3 ms
+```
+
+#### Bad Signs (Sequential scan - no index)
+
+```
+Seq Scan on findings
+  Filter: ((tenant_id = '...'::uuid) AND (status = 'FAIL'))
+  Rows Removed by Filter: 999850
+  Actual Rows: 150
+  Planning Time: 0.3 ms
+  Execution Time: 450.2 ms
+```
+
+## Quick Reference: What to Look For
+
+| What You See | Meaning | Action |
+|--------------|---------|--------|
+| `Index Scan` | Index is being used | Good, no action needed |
+| `Index Only Scan` | Even better - data comes from index only | Good, no action needed |
+| `Bitmap Index Scan` | Index used, results combined | Usually fine |
+| `Seq Scan` on large tables | Full table scan, no index | **Needs investigation** |
+| `Rows Removed by Filter: <high number>` | Fetching too many rows | **Query or index issue** |
+| High `Execution Time` | Query is slow | **Needs optimization** |
+
+## Common Issues and Fixes
+
+### 1. Missing Index
+
+**Problem:** `Seq Scan` on a filtered column
+
+```sql
+-- Bad: No index on status
+EXPLAIN ANALYZE SELECT * FROM findings WHERE status = 'FAIL';
+-- Shows: Seq Scan on findings
+```
+
+**Fix:** Add an index
+
+```python
+# In your model
+class Meta:
+    indexes = [
+        models.Index(fields=['status'], name='findings_status_idx'),
+    ]
+```
+
+### 2. Index Not Used Due to Type Mismatch
+
+**Problem:** Index exists but PostgreSQL doesn't use it
+
+```sql
+-- If tenant_id is UUID but you're passing a string without cast
+WHERE tenant_id = 'some-uuid-string'
+```
+
+**Fix:** Ensure proper type casting in your queries
+
+### 3. Index Not Used Due to Function Call
+
+**Problem:** Wrapping column in a function prevents index usage
+
+```sql
+-- Bad: Index on inserted_at won't be used
+WHERE DATE(inserted_at) = '2024-01-01'
+
+-- Good: Use range instead
+WHERE inserted_at >= '2024-01-01' AND inserted_at < '2024-01-02'
+```
+
+### 4. Wrong Index for Sorting
+
+**Problem:** Query is sorted but index doesn't match sort order
+
+```sql
+-- If you have ORDER BY inserted_at DESC
+-- You need an index with DESC or PostgreSQL will sort in memory
+```
+
+**Fix:** Create index with matching sort order
+
+```python
+class Meta:
+    indexes = [
+        models.Index(fields=['-inserted_at'], name='findings_inserted_desc_idx'),
+    ]
+```
+
+### 5. Composite Index Column Order
+
+**Problem:** Index exists but columns are in wrong order
+
+```sql
+-- Index on (tenant_id, scan_id)
+-- This query WON'T use the index efficiently:
+WHERE scan_id = '...'
+
+-- This query WILL use the index:
+WHERE tenant_id = '...' AND scan_id = '...'
+```
+
+**Rule:** The leftmost columns in a composite index must be in your WHERE clause.
+
+## RLS (Row Level Security) Considerations
+
+Prowler uses Row Level Security via PostgreSQL's `set_config`. When analyzing queries, remember:
+
+1. RLS policies add implicit `WHERE tenant_id = current_tenant()` to queries
+2. Always test with RLS enabled (how it runs in production)
+3. Ensure `tenant_id` is the **first column** in composite indexes
+
+### Using rls_transaction in Code
+
+The `rls_transaction` context manager from `api.db_utils` sets the tenant context for all queries within its scope:
+
+```python
+from api.db_utils import rls_transaction
+from api.models import Finding
+
+with rls_transaction(tenant_id):
+    # All queries here will have RLS applied
+    qs = Finding.objects.filter(status="FAIL")
+    list(qs)  # Execute
+```
+
+### Using RLS in Raw SQL (psql)
+
+```sql
+-- Set tenant context for the transaction
+SELECT set_config('api.tenant_id', 'your-tenant-uuid', TRUE);
+
+-- Now RLS policies will filter by this tenant
+EXPLAIN ANALYZE SELECT * FROM findings WHERE status = 'FAIL';
+```
+
+### Index Design for RLS
+
+Since every query includes `tenant_id` via RLS, your composite indexes should **always start with `tenant_id`**:
+
+```python
+class Meta:
+    indexes = [
+        # Good: tenant_id first
+        models.Index(fields=['tenant_id', 'status', 'severity']),
+
+        # Bad: tenant_id not first - RLS queries won't use this efficiently
+        models.Index(fields=['status', 'tenant_id']),
+    ]
+```
+
+## Test Data Requirements
+
+The amount of test data you need depends on what you're testing. PostgreSQL's query planner considers table statistics, index definitions, and data distribution when choosing execution plans.
+
+### Important Considerations
+
+1. **Small datasets may not use indexes**: PostgreSQL may choose a sequential scan over an index scan if the table is small enough that scanning it directly is faster. This is expected behavior.
+
+2. **Data must exist in the tables you're querying**: If your endpoint queries `findings`, `resources`, `scans`, or other tables, ensure those tables have data. Use the `findings` management command to generate test data:
+
+   ```bash
+   cd api/src/backend
+   poetry run python manage.py findings \
+       --tenant <TENANT_ID> \
+       --findings 1000 \
+       --resources 500 \
+       --batch 500
+   ```
+
+3. **Update table statistics**: After inserting test data, run `ANALYZE` to update PostgreSQL's statistics:
+
+   ```sql
+   ANALYZE findings;
+   ANALYZE resources;
+   ANALYZE scans;
+   -- Add other tables as needed
+   ```
+
+4. **Test with realistic data distribution**: If your query filters by a specific value (e.g., `status='FAIL'`), ensure your test data includes a realistic mix of values.
+
+### When Index Usage Matters Most
+
+Focus on validating index usage when:
+
+- The table will have thousands or millions of rows in production
+- The query is called frequently (list endpoints, dashboards)
+- The query has multiple filters or joins
+
+For small lookup tables or infrequently-called endpoints, sequential scans may be acceptable.
+
+## Performance Checklist for PRs
+
+Before submitting a PR that adds or modifies database queries:
+
+- [ ] Profiled queries with Silk or `EXPLAIN ANALYZE`
+- [ ] Verified indexes are being used (no unexpected `Seq Scan` on large tables)
+- [ ] Checked `Rows Removed by Filter` is reasonable
+- [ ] Tested with RLS enabled
+- [ ] For critical endpoints: documented the query plan in the PR
+
+## Useful Commands
+
+### Update Table Statistics
+
+```sql
+ANALYZE findings;
+ANALYZE resources;
+```
+
+### See Existing Indexes
+
+```sql
+SELECT indexname, indexdef
+FROM pg_indexes
+WHERE tablename = 'findings';
+```
+
+### See Index Usage Stats
+
+```sql
+SELECT
+    schemaname,
+    tablename,
+    indexname,
+    idx_scan,
+    idx_tup_read,
+    idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE tablename = 'findings'
+ORDER BY idx_scan DESC;
+```
+
+### Check Table Size
+
+```sql
+SELECT
+    relname as table_name,
+    pg_size_pretty(pg_total_relation_size(relid)) as total_size
+FROM pg_catalog.pg_statio_user_tables
+WHERE relname IN ('findings', 'resources', 'scans')
+ORDER BY pg_total_relation_size(relid) DESC;
+```
+
+## Working with Partitioned Tables
+
+The `findings` and `resource_finding_mappings` tables are partitioned. When adding indexes, use the helper functions from `api.db_utils`:
+
+### Adding Indexes to Partitions
+
+```python
+# In a migration file
+from functools import partial
+
+from django.db import migrations
+
+from api.db_utils import create_index_on_partitions, drop_index_on_partitions
+
+
+class Migration(migrations.Migration):
+    atomic = False  # Required for CONCURRENTLY
+
+    dependencies = [
+        ("api", "XXXX_previous_migration"),
+    ]
+
+    operations = [
+        migrations.RunPython(
+            partial(
+                create_index_on_partitions,
+                parent_table="findings",
+                index_name="my_new_idx",
+                columns="tenant_id, status, severity",
+                all_partitions=False,  # Only current/future partitions
+            ),
+            reverse_code=partial(
+                drop_index_on_partitions,
+                parent_table="findings",
+                index_name="my_new_idx",
+            ),
+        ),
+    ]
+```
+
+### Parameters
+
+- `all_partitions=False` (default): Only creates indexes on current and future partitions. Use this for new indexes to avoid maintenance overhead on old data.
+- `all_partitions=True`: Creates indexes on all partitions. Use when migrating critical existing indexes.
+
+See [Partitions Documentation](./partitions.md) for more details on partitioning strategy.
+
+## Further Reading
+
+- [Django Silk Documentation](https://github.com/jazzband/django-silk)
+- [PostgreSQL EXPLAIN Documentation](https://www.postgresql.org/docs/current/sql-explain.html)
+- [Using EXPLAIN](https://www.postgresql.org/docs/current/using-explain.html)
+- [Index Types in PostgreSQL](https://www.postgresql.org/docs/current/indexes-types.html)
+- [Prowler Partitions Documentation](./partitions.md)