Skip to content

perf(api): optimize scan-compliance-overviews task#11591

Open
pedrooot wants to merge 6 commits into
masterfrom
PROWLER-1989-optimize-scan-compliance-overview-task-performance-api
Open

perf(api): optimize scan-compliance-overviews task#11591
pedrooot wants to merge 6 commits into
masterfrom
PROWLER-1989-optimize-scan-compliance-overview-task-performance-api

Conversation

@pedrooot

@pedrooot pedrooot commented Jun 15, 2026

Copy link
Copy Markdown
Member

Context

This optimizes the scan-compliance-overviews task (create_compliance_requirements), which materializes the per-requirement compliance rows for a scan. On large scans it was both slow and a memory hog: a single scan produces one row per region × framework × requirement, so a tenant with many regions and the full set of frameworks ends up building hundreds of thousands of row dictionaries in memory before writing any of them. That peak is what was pushing Celery workers into OOM (SIGKILL) territory.

There are three changes, all behind the same task and with no change to the data that ends up in the database.

Impact

Measured locally on an AWS scan (46 frameworks) with a synthetic dataset:

Scan size Time (before → after) Peak memory (before → after)
100 findings / 3 regions 4.9s → 3.2s 35 MB → 32 MB
1k findings / 8 regions 13.5s → 8.2s 53 MB → 35 MB
10k findings / 20 regions (108k rows) 36.5s → 21.3s 99 MB → 44 MB

Query count is constant across scan sizes and drops from 19 to 18 (the removed resource prefetch). The important part is the memory: it no longer scales with regions × frameworks × requirements, which is what was risking the worker OOM.

Steps to review

Please add a detailed description of how to review this PR.

Checklist

Community Checklist
  • This feature/issue is listed in here or roadmap.prowler.com
  • Is it assigned to me, if not, request it via the issue/feature in here or Prowler Community Slack

SDK/CLI

  • Are there new checks included in this PR? Yes / No
    • If so, do we need to update permissions for the provider? Please review this carefully.

UI

  • All issue/task requirements work as expected on the UI
  • If this PR adds or updates npm dependencies, include package-health evidence (maintenance, popularity, known vulnerabilities, license, release age) and explain why existing/native alternatives are insufficient.
  • Screenshots/Video of the functionality flow (if applicable) - Mobile (X < 640px)
  • Screenshots/Video of the functionality flow (if applicable) - Table (640px > X < 1024px)
  • Screenshots/Video of the functionality flow (if applicable) - Desktop (X > 1024px)
  • Ensure new entries are added to CHANGELOG.md, if applicable.

API

  • All issue/task requirements work as expected on the API
  • Endpoint response output (if applicable)
  • EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable)
  • Performance test results (if applicable)
  • Any other relevant evidence of the implementation (if applicable)
  • Verify if API specs need to be regenerated.
  • Check if version updates are required (e.g., specs, uv, etc.).
  • Ensure new entries are added to CHANGELOG.md, if applicable.

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Summary by CodeRabbit

Release Notes

  • Refactor

    • Enhanced scan-compliance-overviews performance by streaming region aggregation and persisting compliance requirement rows in batched COPY operations, reducing peak memory usage and avoiding OOM risk.
    • Improved compliance overview generation to avoid full in-memory row materialization while keeping the output consistent.
  • Bug Fixes

    • Corrected ThreatScore compliance aggregation for multi-region findings and ensured empty or non-mapped regions are safely ignored.
  • Tests

    • Updated aggregation tests for iterator-based query behavior and added coverage for multi-region and skipped-empty-region scenarios.
  • Documentation

    • Added an unreleased changelog entry for the performance improvements.

@pedrooot pedrooot requested a review from a team as a code owner June 15, 2026 10:56
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

_persist_compliance_requirement_rows is changed to accept an Iterable, stream rows in batches via COPY with a per-batch bulk_create fallback, and return the total persisted count. _aggregate_findings_by_region is rewritten to use values_list(...).iterator() over denormalized resource_regions instead of prefetching related resources. create_compliance_requirements now uses a lazy generator feeding the streaming persist function. Tests are updated to match the iterator-based mock shape and add two new coverage scenarios. A changelog entry documents the performance improvements.

Changes

Streaming Compliance Aggregation and Persistence

Layer / File(s) Summary
Batched COPY persistence with ORM fallback
api/src/backend/tasks/jobs/scan.py
Adds Iterable import and rewrites _persist_compliance_requirement_rows to consume rows lazily via batched(), execute COPY per batch, return total persisted row count as int, and fall back to bulk_create only for the batch that failed on COPY error.
Streaming _aggregate_findings_by_region rewrite
api/src/backend/tasks/jobs/scan.py, api/src/backend/tasks/tests/test_scan.py
Replaces the docstring and implementation to use values_list(check_id, status, resource_regions, compliance).iterator(), tallying FAIL-dominant per-region status and ThreatScore counts from the denormalized resource_regions array. All six existing tests are updated to mock mock_queryset.iterator.return_value with tuple rows; two new tests cover a finding spanning multiple regions and findings with empty/None region lists.
Lazy generator in create_compliance_requirements
api/src/backend/tasks/jobs/scan.py
Adjusts initialization of regions and requirements_created, builds a per-framework compliance plan once, defines a lazy generator yielding rows per region/framework/requirement, deletes prior ComplianceRequirementOverview rows for idempotency, and assigns requirements_created from the streaming persist return value.
Changelog entry
api/CHANGELOG.md
Documents version 1.31.2 unreleased changes describing the shift to streaming aggregation and batched COPY-based writes via denormalized resource_regions for reduced memory usage on large scans.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • danibarranqueroo
  • AdriiiPRodri
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf(api): optimize scan-compliance-overviews task' directly describes the main change in the PR: optimizing the scan-compliance-overviews task for performance.
Description check ✅ Passed The description covers context, impact with measured data, but lacks detailed review steps and most checklist items are unchecked, though key changelog entry is confirmed added.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PROWLER-1989-optimize-scan-compliance-overview-task-performance-api

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

✅ All necessary CHANGELOG.md files have been updated.

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1688-1694: The current implementation deletes existing compliance
requirement rows before all new rows are inserted, which can leave partial data
visible if a failure occurs mid-stream. Refactor the logic to use a staging
table or run marker approach: instead of immediately deleting the old rows from
ComplianceRequirementOverview, write all new rows to a staging area or temporary
marker, and only after _persist_compliance_requirement_rows completes
successfully and all summaries are finalized, atomically swap or commit the new
data while removing the old data in a single transaction. This ensures that the
requirements endpoint never sees incomplete override rows.

In `@api/src/backend/tasks/tests/test_scan.py`:
- Around line 3688-3689: Add explicit call assertions to verify the streaming
query contract in the test setup at multiple locations in
api/src/backend/tasks/tests/test_scan.py. For each affected site (lines
3688-3689 (anchor), 3729-3730, 3756-3757, 3804-3805, 3846-3847, 3885-3886,
3924-3925, and 3953-3954), after the mock configuration of
mock_queryset.values_list and mock_queryset.iterator, add assertion calls to
verify that these methods were actually invoked during the test execution. Use
assert_called_once() or assert_called() on mock_queryset.values_list to confirm
the values_list() call was made with the expected arguments, and similarly
assert that mock_queryset.iterator was called to ensure the streaming query
shape is maintained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f3ed1c92-f501-453b-8d32-2c8057eec4d2

📥 Commits

Reviewing files that changed from the base of the PR and between 6db8ce6 and 6a6b0f6.

📒 Files selected for processing (2)
  • api/src/backend/tasks/jobs/scan.py
  • api/src/backend/tasks/tests/test_scan.py

Comment thread api/src/backend/tasks/jobs/scan.py
Comment thread api/src/backend/tasks/tests/test_scan.py
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🔒 Container Security Scan

Image: prowler-api:d1f7ee1
Last scan: 2026-06-15 15:47:54 UTC

📊 Vulnerability Summary

Severity Count
🔴 Critical 10
Total 10

9 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

  • Review the detailed scan results
  • Update affected packages to patched versions
  • Consider using a different base image if updates are unavailable

📋 Resources:

coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 15, 2026
josema-xyz
josema-xyz previously approved these changes Jun 15, 2026
Comment thread api/src/backend/tasks/jobs/scan.py Outdated
Reads only the consumed columns as tuples via ``values_list`` and streams
them with ``.iterator()``, using the denormalized ``resource_regions`` array
instead of ``prefetch_related("resources")``. The finding↔resource relation
is 1:1, so the denormalized array yields the same per-region tally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says finding <-> resource is 1:1, but the model actually allows many. Could we reword it so it doesn't read like a guarantee?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks

@pedrooot pedrooot dismissed stale reviews from josema-xyz and coderabbitai[bot] via 99c3095 June 15, 2026 14:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/src/backend/tasks/jobs/scan.py (1)

1625-1687: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Generator side-effects are safe given the control flow.

The nested generator updates requirement_statuses as rows are yielded. This pattern is correct because:

  1. If _persist_compliance_requirement_rows fails mid-stream, the exception propagates and _create_compliance_summaries is never reached
  2. If all batches succeed (via COPY or fallback), the generator is fully consumed and requirement_statuses is complete

The pattern effectively combines row generation with summary tallying in a single memory-efficient pass.

Consider adding a brief inline comment (e.g., at line 1663) noting that this side-effect is intentional and tied to the generator's consumption by the persist function. This aids future maintainers who may not expect mutation within a generator.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/src/backend/tasks/jobs/scan.py` around lines 1625 - 1687, The nested
generator function `_iter_compliance_requirement_rows` mutates the
`requirement_statuses` dictionary as it yields rows, but this intentional
side-effect lacks documentation. Add a brief inline comment near where
`requirement_statuses` is updated (around the lines incrementing total_count,
fail_count, and pass_count) explaining that this side-effect is intentional and
safe because the generator is fully consumed by the persist function before the
summarization step, ensuring all tallies are complete before use.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1625-1687: The nested generator function
`_iter_compliance_requirement_rows` mutates the `requirement_statuses`
dictionary as it yields rows, but this intentional side-effect lacks
documentation. Add a brief inline comment near where `requirement_statuses` is
updated (around the lines incrementing total_count, fail_count, and pass_count)
explaining that this side-effect is intentional and safe because the generator
is fully consumed by the persist function before the summarization step,
ensuring all tallies are complete before use.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 45de73db-0442-4c33-b716-0f1514dccc21

📥 Commits

Reviewing files that changed from the base of the PR and between fc87a32 and 99c3095.

📒 Files selected for processing (1)
  • api/src/backend/tasks/jobs/scan.py

coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/CHANGELOG.md (1)

5-19: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep unreleased versions in descending order.

1.32.0 is newer than 1.31.2, so placing it below the new entry makes the changelog harder to scan and misstates release order. Move the 1.32.0 block above the 1.31.2 section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/CHANGELOG.md` around lines 5 - 19, The changelog entries are not in
descending order by version number. The 1.32.0 section (which is newer) appears
after the 1.31.2 section (which is older), making it harder to scan and
misstating the release order. Reorder the sections so that the 1.32.0 (Security)
block appears first, followed by the 1.31.2 (Changed) block, to maintain
descending version order.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@api/CHANGELOG.md`:
- Around line 5-19: The changelog entries are not in descending order by version
number. The 1.32.0 section (which is newer) appears after the 1.31.2 section
(which is older), making it harder to scan and misstating the release order.
Reorder the sections so that the 1.32.0 (Security) block appears first, followed
by the 1.31.2 (Changed) block, to maintain descending version order.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1b25e70d-86d1-47d7-8faf-30f7b2ee600a

📥 Commits

Reviewing files that changed from the base of the PR and between 99c3095 and 2eb0a39.

📒 Files selected for processing (1)
  • api/CHANGELOG.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants