docs(attack-paths): add module docstring to scan orchestrator (#10277)

2026-03-22 03:08:23 +00:00 · 2026-03-12 08:49:48 +01:00
parent b08cb8ffb3
commit 628a076118
1 changed files with 55 additions and 0 deletions
--- a/api/src/backend/tasks/jobs/attack_paths/scan.py
+++ b/api/src/backend/tasks/jobs/attack_paths/scan.py
@@ -1,3 +1,58 @@
+"""
+Attack Paths scan orchestrator.
+
+Runs the full scan lifecycle for a single provider, called from a Celery task.
+The idea is simple: ingest everything into a throwaway Neo4j database, enrich
+it with Prowler-specific data, then swap it into the tenant's long-lived
+database so queries never see a half-built graph.
+
+Two databases are involved:
+- Temporary (db-tmp-scan-<attack_paths_scan_id>): short-lived, single-provider, dropped after sync.
+- Tenant (db-tenant-<tenant_uuid>): long-lived, multi-provider, what the API queries against.
+
+Pipeline steps:
+
+1. Resolve the Prowler provider and SDK credentials from the scan ID.
+   Retrieve or create the AttackPathsScan row. Exit early if the provider
+   type has no ingestion function (only AWS is supported today).
+
+2. Create a fresh temporary Neo4j database and set up Cartography indexes
+   plus ProwlerFinding indexes before writing any data.
+
+3. Run the provider-specific Cartography ingestion (e.g. aws.start_aws_ingestion).
+   This iterates over cloud services and writes the standard Cartography nodes
+   (AWSAccount, EC2Instance, IAMRole, etc.) and relationships (RESOURCE,
+   POLICY, STATEMENT, TRUSTS_AWS_PRINCIPAL, ...) into the temp database.
+   Wrapped in call_within_event_loop because some Cartography modules use async.
+
+4. Run Cartography post-processing: ontology for label propagation and
+   analysis for derived relationships.
+
+5. Create an Internet singleton node and add CAN_ACCESS relationships to
+   internet-exposed resources (EC2Instance, LoadBalancer, LoadBalancerV2).
+
+6. Stream Prowler findings from Postgres in batches. Each finding becomes a
+   ProwlerFinding node linked to its cloud-resource node via HAS_FINDING.
+   Before that, an _AWSResource label (provider-specific) is added to all
+   nodes connected to the AWSAccount so finding lookups can use an index.
+   Stale findings from previous scans are cleaned up.
+
+7. Sync the temp database into the tenant database:
+   - Drop the old provider subgraph (matched by _provider_id property).
+     graph_data_ready is set to False for all scans of this provider while
+     the swap happens so the API doesn't serve partial data.
+   - Copy nodes and relationships in batches. Every synced node gets a
+     _ProviderResource label and _provider_id / _provider_element_id
+     properties for multi-provider isolation.
+   - Set graph_data_ready back to True.
+
+8. Drop the temporary database, mark the AttackPathsScan as COMPLETED.
+
+On failure the temp database is dropped, the scan is marked FAILED, and the
+exception propagates to Celery.
+
+"""
+
 import logging
 import time