perf(api): optimize scan-compliance-overviews task#11591
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough
ChangesStreaming Compliance Aggregation and Persistence
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
✅ All necessary |
|
✅ Conflict Markers Resolved All conflict markers have been successfully resolved in this pull request. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1688-1694: The current implementation deletes existing compliance
requirement rows before all new rows are inserted, which can leave partial data
visible if a failure occurs mid-stream. Refactor the logic to use a staging
table or run marker approach: instead of immediately deleting the old rows from
ComplianceRequirementOverview, write all new rows to a staging area or temporary
marker, and only after _persist_compliance_requirement_rows completes
successfully and all summaries are finalized, atomically swap or commit the new
data while removing the old data in a single transaction. This ensures that the
requirements endpoint never sees incomplete override rows.
In `@api/src/backend/tasks/tests/test_scan.py`:
- Around line 3688-3689: Add explicit call assertions to verify the streaming
query contract in the test setup at multiple locations in
api/src/backend/tasks/tests/test_scan.py. For each affected site (lines
3688-3689 (anchor), 3729-3730, 3756-3757, 3804-3805, 3846-3847, 3885-3886,
3924-3925, and 3953-3954), after the mock configuration of
mock_queryset.values_list and mock_queryset.iterator, add assertion calls to
verify that these methods were actually invoked during the test execution. Use
assert_called_once() or assert_called() on mock_queryset.values_list to confirm
the values_list() call was made with the expected arguments, and similarly
assert that mock_queryset.iterator was called to ensure the streaming query
shape is maintained.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: f3ed1c92-f501-453b-8d32-2c8057eec4d2
📒 Files selected for processing (2)
api/src/backend/tasks/jobs/scan.pyapi/src/backend/tasks/tests/test_scan.py
🔒 Container Security ScanImage: 📊 Vulnerability Summary
9 package(s) affected
|
| Reads only the consumed columns as tuples via ``values_list`` and streams | ||
| them with ``.iterator()``, using the denormalized ``resource_regions`` array | ||
| instead of ``prefetch_related("resources")``. The finding↔resource relation | ||
| is 1:1, so the denormalized array yields the same per-region tally. |
There was a problem hiding this comment.
The docstring says finding <-> resource is 1:1, but the model actually allows many. Could we reword it so it doesn't read like a guarantee?
99c3095
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/src/backend/tasks/jobs/scan.py (1)
1625-1687: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueGenerator side-effects are safe given the control flow.
The nested generator updates
requirement_statusesas rows are yielded. This pattern is correct because:
- If
_persist_compliance_requirement_rowsfails mid-stream, the exception propagates and_create_compliance_summariesis never reached- If all batches succeed (via COPY or fallback), the generator is fully consumed and
requirement_statusesis completeThe pattern effectively combines row generation with summary tallying in a single memory-efficient pass.
Consider adding a brief inline comment (e.g., at line 1663) noting that this side-effect is intentional and tied to the generator's consumption by the persist function. This aids future maintainers who may not expect mutation within a generator.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/src/backend/tasks/jobs/scan.py` around lines 1625 - 1687, The nested generator function `_iter_compliance_requirement_rows` mutates the `requirement_statuses` dictionary as it yields rows, but this intentional side-effect lacks documentation. Add a brief inline comment near where `requirement_statuses` is updated (around the lines incrementing total_count, fail_count, and pass_count) explaining that this side-effect is intentional and safe because the generator is fully consumed by the persist function before the summarization step, ensuring all tallies are complete before use.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1625-1687: The nested generator function
`_iter_compliance_requirement_rows` mutates the `requirement_statuses`
dictionary as it yields rows, but this intentional side-effect lacks
documentation. Add a brief inline comment near where `requirement_statuses` is
updated (around the lines incrementing total_count, fail_count, and pass_count)
explaining that this side-effect is intentional and safe because the generator
is fully consumed by the persist function before the summarization step,
ensuring all tallies are complete before use.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 45de73db-0442-4c33-b716-0f1514dccc21
📒 Files selected for processing (1)
api/src/backend/tasks/jobs/scan.py
…view-task-performance-api
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/CHANGELOG.md (1)
5-19:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep unreleased versions in descending order.
1.32.0is newer than1.31.2, so placing it below the new entry makes the changelog harder to scan and misstates release order. Move the1.32.0block above the1.31.2section.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/CHANGELOG.md` around lines 5 - 19, The changelog entries are not in descending order by version number. The 1.32.0 section (which is newer) appears after the 1.31.2 section (which is older), making it harder to scan and misstating the release order. Reorder the sections so that the 1.32.0 (Security) block appears first, followed by the 1.31.2 (Changed) block, to maintain descending version order.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@api/CHANGELOG.md`:
- Around line 5-19: The changelog entries are not in descending order by version
number. The 1.32.0 section (which is newer) appears after the 1.31.2 section
(which is older), making it harder to scan and misstating the release order.
Reorder the sections so that the 1.32.0 (Security) block appears first, followed
by the 1.31.2 (Changed) block, to maintain descending version order.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1b25e70d-86d1-47d7-8faf-30f7b2ee600a
📒 Files selected for processing (1)
api/CHANGELOG.md
Context
This optimizes the
scan-compliance-overviewstask (create_compliance_requirements), which materializes the per-requirement compliance rows for a scan. On large scans it was both slow and a memory hog: a single scan produces one row perregion × framework × requirement, so a tenant with many regions and the full set of frameworks ends up building hundreds of thousands of row dictionaries in memory before writing any of them. That peak is what was pushing Celery workers into OOM (SIGKILL) territory.There are three changes, all behind the same task and with no change to the data that ends up in the database.
Impact
Measured locally on an AWS scan (46 frameworks) with a synthetic dataset:
Query count is constant across scan sizes and drops from 19 to 18 (the removed resource prefetch). The important part is the memory: it no longer scales with
regions × frameworks × requirements, which is what was risking the worker OOM.Steps to review
Please add a detailed description of how to review this PR.
Checklist
Community Checklist
SDK/CLI
UI
API
License
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Summary by CodeRabbit
Release Notes
Refactor
scan-compliance-overviewsperformance by streaming region aggregation and persisting compliance requirement rows in batched COPY operations, reducing peak memory usage and avoiding OOM risk.Bug Fixes
Tests
Documentation