
How do we enable Sourcegraph Deep Search safely for private code (zero retention / no training) and exclude specific repos from AI context?
Most teams I work with want Deep Search’s code understanding for private code, but they need hard guarantees: zero data retention, no model training on their IP, and precise control over which repos can ever be sent to AI. You can get all three with Sourcegraph, but you have to wire it in deliberately.
Below is a practical, step‑by‑step way to enable Sourcegraph Deep Search safely for private code, configure zero‑retention / no‑training behavior, and exclude specific repositories from AI context so neither humans nor agents accidentally leak data.
1. Understand how Deep Search handles private code
Deep Search is part of the Sourcegraph code understanding platform, not a standalone “AI black box.” That matters for safety:
- It runs on top of Sourcegraph’s universal code search.
- It uses Sourcegraph Search as a primary context provider instead of third‑party embedding APIs.
- It’s designed for enterprise environments spanning GitHub, GitLab, Bitbucket, Gerrit, Perforce, and more.
From a privacy and governance standpoint:
- Zero data retention for inference. Sourcegraph’s AI features (including Deep Search) do not train models on your code or user prompts. Inference data is not retained beyond what’s required to serve the request.
- No model training on your code. Models are not trained with user data. You retain ownership of all inputs and outputs.
- SOC 2 Type II + ISO27001 posture. The platform is built for regulated orgs that already rely on SAML/OIDC SSO, SCIM, and RBAC.
Think of Deep Search as “Agentic AI Search” that is constrained by the same identity and access controls you already enforce for humans.
2. Deployment choices that keep Deep Search safe
Before you flip Deep Search on for private code, make sure your deployment and identity model are aligned with your security expectations.
2.1 Choose your Sourcegraph deployment model
Common patterns:
-
Self‑hosted Sourcegraph in your own VPC
- Best for strict compliance / air‑gapped or highly regulated environments.
- You control network boundaries, logging, and integration with internal GitHub Enterprise / GitLab / Bitbucket / Gerrit / Perforce.
- Deep Search and other AI capabilities operate within those boundaries, with zero training on your data.
-
Sourcegraph Cloud with private code
- Still honors zero data retention and no training.
- Backed by SOC2 Type II controls.
- Works well if you’re comfortable with a SaaS boundary but need strong guarantees and auditability.
If you’re in doubt which model fits your risk posture, assume self‑hosted + private connectivity to code hosts; you can still use Deep Search with private repos and zero training.
2.2 Integrate identity: SSO, SCIM, RBAC
Deep Search should never see more than a user (or agent) is allowed to see. That’s enforced through:
- Single Sign‑On (SSO): Use SAML, OpenID Connect, or OAuth to federate identity from your IdP (Okta, Azure AD, etc.).
- SCIM user provisioning: Keep accounts and group membership in sync automatically.
- Role‑based Access Control (RBAC): Map groups from your IdP to roles and permissions in Sourcegraph:
- Restrict visibility of certain repositories to specific roles.
- Limit who can administer AI settings, including context filters.
Result: Deep Search inherits effective permissions from the user or agent’s identity, so there’s no “super‑user AI” that can see everything by default.
3. Enabling Deep Search for private code safely
Once your deployment and identity are in place, you can turn Deep Search on and ensure it behaves within your governance boundaries.
3.1 Connect your code hosts
Deep Search becomes useful when Sourcegraph has full visibility across your estate:
- Connect GitHub, GitLab, Bitbucket, Gerrit, Perforce, and any other supported hosts.
- Configure repository mirroring and sync schedules.
- Verify that repo‑level permissions are synced correctly from each host.
This gives Deep Search a truly universal code understanding layer across 100 or 1M repositories, without changing how you host code.
3.2 Enable Deep Search and AI features
At the instance level (self‑hosted or cloud admin):
- Enable AI and Deep Search in the global settings.
- Confirm:
- AI requests are routed through Sourcegraph’s supported models with zero data retention.
- No configuration is enabled that trains on your data.
- If needed, restrict Deep Search access:
- Allow only specific roles or groups to use AI features initially.
- Roll out to a pilot group of senior engineers or platform/infra teams.
This gives you a controlled rollout, with the ability to audit usage and refine context policies before org‑wide exposure.
4. Using Context Filters to protect sensitive code
Deep Search relies on context—files, repositories, symbols, and patterns—to answer queries for humans and agents. You need fine‑grained control over which parts of the codebase can be sent to AI models at all.
This is where Context Filters come in.
4.1 What Context Filters do
Context Filters let you:
- Exclude specific repositories from ever being used as AI context.
- Filter select code from being sent to AI models, even if users can still search and browse it via standard Code Search.
- Maintain a split between:
- “Searchable by humans only.”
- “Searchable and usable as AI context.”
You can use this to keep crown‑jewel repos, regulated components, or experimental IP out of any AI request surface.
4.2 Designing your context policy
From a governance perspective, I recommend a tiered model:
-
Tier 0 (No AI context):
- Repos with the highest sensitivity: regulated workloads, proprietary algorithms, secrets‑adjacent infrastructure, M&A artifacts.
- Policy: Fully searchable via Code Search for authorized humans, but explicitly excluded from Deep Search context.
-
Tier 1 (Restricted AI context):
- Core application repos with customer‑impacting logic.
- Policy: Allowed for AI context, but only for specific groups (e.g., platform team) and with guardrails like code ownership and monitors.
-
Tier 2 (Open AI context):
- Shared libraries, utilities, demos, and public‑mirror equivalents.
- Policy: Fully available as AI context for Deep Search and agents.
Document this in your internal runbook so security, platform, and app teams have a shared language when proposing repo moves between tiers.
4.3 Implementing repo‑level exclusions
In Sourcegraph’s admin configuration for Context Filters:
- List repositories that should be excluded from AI context.
- Use patterns (e.g.,
^security/,^acquisitions/,^infra/secrets/) to capture entire organizational slices. - Apply changes and verify behavior:
- Deep Search queries should never reference excluded repos.
- Code Search still returns results for those repos for users who have permission.
This gives you confidence that sensitive repos can be searched by humans but never leave your boundary as AI context.
5. Guardrails for public and open‑source code
If you use Deep Search on a mix of private and public/OSS code, you need to enforce licensing and compliance guardrails.
Sourcegraph provides public code guardrails:
- Help prevent code that violates open‑source licensing from being suggested or resurfaced inappropriately.
- Reduce OSS license risk when Deep Search or other AI features use public code as context alongside your private repos.
Combine that with Context Filters and code ownership to keep your AI outputs auditable and compliant.
6. Putting Deep Search behind enterprise controls
Deep Search should feel like a first‑class citizen in your existing governance stack. A few patterns I’ve seen work well:
6.1 Scope Deep Search access by role
Use RBAC to:
- Grant Deep Search access first to:
- Developer productivity/platform teams.
- Security engineers who need cross‑repo visibility.
- Senior engineers leading migrations or refactors.
- Expand access once:
- Context Filters are tuned.
- Excluded repos are locked down.
- Security signs off on zero‑retention / no‑training posture.
6.2 Treat agents as users with permissions
If you expose Deep Search via Sourcegraph MCP or other agent integrations:
- Give each agent its own identity (service account) in your IdP.
- Apply RBAC and Context Filters to that identity:
- Agents never see more than a real user with the same role.
- Repos excluded from AI context remain off‑limits to the agent.
Agents are only as safe as their ability to respect your access model. Treat them as real users with constrained permissions.
6.3 Monitor and audit usage
Pair Deep Search with monitoring and insights:
- Use Monitors to:
- Detect risky patterns or undesirable changes in code (e.g., secrets, forbidden dependencies).
- Trigger notifications or actions when patterns appear.
- Use Insights to:
- Track how Deep Search and AI‑driven workflows change code over time.
- Support migrations and standardization efforts across many repos.
This closes the loop from understanding to controlled, auditable change.
7. Example rollout plan for Deep Search with sensitive repos
To make this concrete, here’s a phased path I’ve used in a regulated enterprise:
-
Foundation
- Deploy Sourcegraph self‑hosted in a secured environment.
- Integrate GitHub + Perforce + other hosts; confirm permission sync.
- Enable SAML/OIDC SSO, SCIM provisioning, and RBAC.
-
Context policy
- Classify repos into Tier 0 / 1 / 2.
- Configure Context Filters:
- Add Tier 0 repos and sensitive patterns to the exclusion list.
- Enable public code guardrails.
-
Pilot Deep Search
- Enable Deep Search for a small, trusted group.
- Validate:
- Zero data retention / no training posture with security.
- Excluded repos never appear in Deep Search answers.
- Agents (if used) respect RBAC and Context Filters.
-
Scale out
- Expand Deep Search access to broader engineering teams.
- Use Batch Changes to run controlled, multi‑repo refactors informed by Deep Search.
- Configure Monitors and Insights to track the impact.
-
Continuous refinement
- Adjust Context Filters as new sensitive repos appear.
- Revisit role mappings and agent permissions regularly.
- Incorporate security review of AI settings into your standard tooling governance process.
Final takeaway
You don’t have to choose between Deep Search and control. With Sourcegraph, you can:
- Run Deep Search on private code with zero data retention and no training on your IP.
- Use Context Filters to exclude specific repositories and patterns from AI context entirely.
- Enforce SSO, SCIM, and RBAC so Deep Search and agents are bound by the same access model as your developers.
- Add public code guardrails, Monitors, and Insights to keep AI‑assisted change both fast and governed.
If you want help designing the right context policy and rollout plan for your environment, my recommendation is to walk through your repo tiers and identity model with the Sourcegraph team and validate them against your compliance requirements.