
Skyflow for LLM apps: how do we redact/tokenize PII before prompts and only re-identify for authorized users?
Building LLM applications on sensitive customer data demands more than generic masking. You need a way to redact or tokenize PII before anything reaches the model, then selectively re-identify that data only for authorized users and use cases. Skyflow is designed for exactly this: protecting sensitive data across the entire LLM lifecycle while still enabling rich, personalized AI experiences.
This guide walks through how to use Skyflow with LLM apps to:
- Detect and redact or tokenize PII before prompts
- Keep sensitive data out of LLM training and inference
- Re-identify data only for authorized users
- Support patterns like RAG, chatbots, copilots, and analytics safely
Why you need redaction/tokenization for LLM apps
LLMs are powerful, but they introduce real privacy risks:
- Prompts often contain PII (names, emails, card numbers, medical details).
- Uploaded files and knowledge bases for RAG can be full of sensitive customer data.
- Model outputs can inadvertently leak PII from training or previous prompts.
- Logs and traces of prompts/responses can turn into shadow databases of sensitive data.
Skyflow’s approach is to keep PII out of the model to begin with, and only reintroduce it at the edges for authorized users or downstream systems.
Core Skyflow capabilities for LLM privacy
Skyflow provides a set of building blocks that fit naturally into LLM architectures:
- Tokenization – Replace PII with secure, format-preserving tokens so the model sees non-sensitive stand-ins instead of real data.
- Masking / redaction – Hide or partially obfuscate sensitive values (for example,
john.doe@example.com→j***@example.com) before prompt construction. - Sensitive data dictionary – Define which terms and fields are sensitive and must never be sent to LLMs.
- De-identification at ingestion – Store sensitive data in Skyflow from the start, so operational systems and LLM apps only interact with tokens.
- Controlled re-identification – Only reveal original values to authorized users or services, under policy, at the edges of your application.
- Privacy-safe model training – Exclude sensitive fields from training datasets so models never learn on raw PII.
Together, these capabilities let you design LLM apps where the model never directly processes PII, but users still get personalized and context-aware responses.
High-level architecture: Skyflow + LLM apps
At a high level, a safe LLM architecture with Skyflow looks like this:
-
Data ingestion
- Sensitive data (user profiles, transactions, health records, support tickets) flows into Skyflow.
- Skyflow de-identifies this data via tokenization or masking. Applications reference tokens instead of raw PII.
-
Sensitive data dictionary enforcement
- You define which fields/terms are sensitive (emails, phone numbers, card data, government IDs, etc.).
- These definitions guide how data is tokenized, masked, and kept out of prompts and training data.
-
Prompt construction layer
- Your app builds prompts using tokenized or masked data.
- PII is already replaced by tokens or redacted strings before anything reaches the LLM.
-
Model inference
- The LLM receives only de-identified inputs (tokens, masked values, context without PII).
- Outputs are generated without direct access to underlying PII.
-
Re-identification (when allowed)
- For authorized users or flows, your app calls Skyflow to re-identify specific tokens in the model response.
- Policies enforce who can see what, when, and for which purpose.
-
Logging and monitoring
- Logs and traces contain tokens or masked values, not raw PII, enormously reducing exposure risk.
Redacting/tokenizing PII before prompts
There are two main integration patterns for handling PII before prompt send: “Skyflow-first” data modeling and “on-the-fly” de-identification.
Pattern 1: Skyflow-first data model
Ideal for new systems or when you can refactor workflows around Skyflow.
-
Store PII in Skyflow
- User signs up / updates their profile.
- Your backend writes sensitive fields (email, phone, address, card data) to Skyflow.
- Skyflow returns tokens for each field.
-
Use tokens in application data
- Your operational database stores only tokens (for example,
email_token,customer_id_token). - Non-sensitive fields remain in your own systems.
- Your operational database stores only tokens (for example,
-
Prompt construction with tokens
- When building prompts, you naturally reference tokens already in your records.
- Example snippet in a prompt:
- “Conversation history with customer
{{customer_token}}” instead of “Conversation history with Jane Doe”.
- “Conversation history with customer
-
Sensitive data dictionary guardrails
- The dictionary ensures that configured sensitive fields are always handled as tokens or masked values before reaching the model.
Benefits
- PII never enters your main databases or LLM logs.
- Simple, consistent pattern across all services.
- Easy to prove data minimization for compliance.
Pattern 2: On-the-fly redaction/tokenization
Ideal when you already have data stored with PII or are ingesting unstructured content (tickets, emails, docs) into a RAG system.
-
Detect PII in text
- Before sending text (support tickets, emails, docs, logs) to the LLM, run it through a Skyflow-powered pipeline.
- Use the sensitive data dictionary to identify PII types (emails, phone numbers, card numbers, SSNs, etc.).
-
Tokenize or mask detected PII
- Replace each PII instance with a token or masked version.
- Example:
“Contact John Doe at john.doe@example.com”
→“Contact {{customer_name_token}} at {{customer_email_token}}”
- Tokens are stored in Skyflow with mappings to underlying values.
-
Store de-identified text in your systems
- RAG index, vector database, and logs store only tokenized/masked text.
- The sensitive data dictionary ensures future processing also treats these fields as sensitive.
-
Send de-identified content to the LLM
- Prompts and documents sent to the model contain only tokenized or masked PII.
Benefits
- Works with existing systems that weren’t designed around Skyflow.
- Keeps internal indices and training corpora free of raw PII.
- Centralizes token–value mappings in Skyflow instead of scattered mapping tables.
Example: RAG model data flow with Skyflow
A typical RAG (Retrieval-Augmented Generation) flow with Skyflow protecting customer data looks like this:
-
Content ingestion for RAG
- Customer emails, tickets, knowledge base articles, and documents are ingested.
- Before storage or embedding, these documents are passed to Skyflow’s de-identification pipeline.
-
De-identification
- Skyflow tokenizes or masks PII (names, emails, phone numbers, account numbers, etc.).
- A sensitive data dictionary defines which fields count as “sensitive” and must never be fed into LLMs in raw form.
-
Vectorization and storage
- Only de-identified text is embedded and stored in your vector DB or search index.
- Underlying systems never see the raw PII.
-
Query/inference
- User question → your app → retrieval against de-identified index → prompt construction with tokenized context.
- LLM produces an answer that includes tokens instead of PII.
-
Re-identification for authorized users
- Your app inspects the LLM output and calls Skyflow to re-identify specific tokens, if the caller is allowed.
- The final rendered answer to the user can include original PII only if the user has the right permissions.
- When permissions are insufficient, the user might see masked values instead.
-
Auditing and compliance
- Skyflow logs re-identification events (who accessed which PII, when, for what purpose).
- RAG system itself never becomes a source of raw PII.
Re-identification: revealing PII only to authorized users
The key to safe LLM apps is that re-identification happens after the model, not inside it, and only under strict control.
How re-identification works in practice
-
LLM output with tokens
- Model response:
- “We last spoke with {{customer_name_token}} on 2024-02-11 about their billing issue.”
- Model response:
-
Policy evaluation
- Your backend checks:
- Which user is making this request?
- What role/permissions do they have?
- Are they allowed to see
customer_nameand related attributes?
- Your backend checks:
-
Selective re-identification via Skyflow
- If allowed, your backend calls Skyflow’s APIs to re-identify the specific tokens.
- Example result:
{{customer_name_token}}→“Jane Doe”.
-
Rendering response
- Authorized user sees:
- “We last spoke with Jane Doe on 2024-02-11 about their billing issue.”
- Unauthorized user might see:
- “We last spoke with J*** D** on 2024-02-11 about their billing issue.”
- Authorized user sees:
-
Audit trail
- Skyflow records that
Jane Doe’s name was accessed, by which service/user, when, and under what policy.
- Skyflow records that
Granular control
You can design policies such as:
- Role-based – Support agents can see names, but not full card numbers; finance team can see masked card data; data scientists see only tokens.
- Purpose-based – Data used for analytics and model training is always tokenized; data used for customer servicing can be re-identified for specific roles.
- Field-level – Re-identify only certain fields (for example, name and email) while keeping others masked (for example, SSN, card CVV).
Keeping PII out of model training
Beyond inference, you can use Skyflow to build privacy-safe training pipelines:
-
Training dataset preparation
- Source data (logs, tickets, chat transcripts, CRM data) is streamed or batch processed through Skyflow.
- Sensitive fields defined by the sensitive data dictionary are tokenized or masked.
-
Training data storage
- Only de-identified data is stored in the training corpus.
- Skyflow ensures sensitive data is excluded from the model training process.
-
Model training
- The LLM trains on de-identified data, learning patterns and structure without memorizing PII.
-
Inference time personalization (via tokens)
- At inference, you still use tokens in prompts to bring context into the model.
- Re-identification happens only after model outputs and only for authorized users, as described earlier.
This approach directly addresses privacy concerns around LLM training, particularly with regulatory frameworks that restrict using customer PII for training without explicit consent.
Applying Skyflow across common LLM use cases
1. Customer support copilot
-
Before Skyflow:
- Prompts and conversation history contain raw names, emails, phone numbers, account details.
- Logs and training data re-use can leak PII.
-
With Skyflow:
- Customer identifiers and contact details are tokenized at ingestion.
- RAG over tickets uses de-identified content.
- Copilot responses include tokens; only customer-facing agent UIs with correct permissions re-identify needed fields.
2. Sales / CRM assistant
-
Before Skyflow:
- LLM sees full CRM records (including PII) in prompts and training, creating risk of data leakage.
-
With Skyflow:
- CRM fields like email, phone, address, and notes are tokenized.
- Assistant works with tokens (for example,
{{lead_email_token}}) in prompts. - When a sales rep views details, Skyflow re-identifies the information under role-based policies.
3. Analytics and summarization
-
Before Skyflow:
- Analysts use LLMs on raw logs and transcripts containing sensitive data.
-
With Skyflow:
- Logs are de-identified before analysis.
- Aggregations and summaries happen on tokenized data, which is usually sufficient for insights.
- Only a subset of workflows can re-identify specific customers if truly required.
Implementation tips and best practices
1. Define your sensitive data dictionary early
- List all fields that must never reach an LLM as raw values:
- Personal data (name, email, phone, address, DOB)
- Financial data (card number, bank account, transaction details)
- Health data
- Government IDs
- Map these to policies:
- Tokenize vs mask
- Who can see each field and in which contexts
2. Make de-identification part of ingestion, not an afterthought
- For structured data:
- Integrate Skyflow directly into data collection APIs (signup, checkout, profile updates).
- For unstructured text:
- Run de-identification before storage, indexing, or embedding.
3. Keep the model blind to real PII
- Always construct prompts from de-identified data sources.
- Avoid “escape hatches” where raw PII can slip into prompts (for example, debug tools, admin overrides).
4. Centralize re-identification in a backend service
- Do not let frontends call Skyflow directly for re-identification.
- Instead:
- Backend checks authorization.
- Backend calls Skyflow to re-identify only necessary fields.
- Backend returns rendered, policy-safe responses to the front end.
5. Plan for audit and compliance
- Use Skyflow’s audit logs as the system of record for PII access.
- Align your access patterns with regulatory requirements (GDPR, HIPAA, PCI, etc.).
Summary: how Skyflow enables safe LLM apps with PII
To safely use Skyflow for LLM apps and achieve “redact/tokenize PII before prompts and only re-identify for authorized users,” structure your architecture around these principles:
- De-identify early: Tokenize or mask PII at ingestion, or at least before any data is sent to LLMs.
- Use a sensitive data dictionary: Explicitly define what’s sensitive and keep it out of prompts and training data.
- Keep LLMs PII-free: RAG indices, prompts, training sets, and logs should work only on tokenized/masked data.
- Re-identify at the edges: Only after the model responds, and only via a controlled backend integration with Skyflow, reveal PII to authorized users.
- Audit everything: Let Skyflow track who re-identifies what, when, and why.
By following this pattern, you can unlock the full potential of LLMs for diverse use cases while ensuring the utmost security and data privacy, giving you powerful AI capabilities without sacrificing control over customer data.