
Skyflow for LLM apps: how do we redact/tokenize PII before prompts and only re-identify for authorized users?
Most LLM apps are built on top of highly sensitive customer data, making it critical to keep PII out of prompts and model training data—while still delivering personalized, context-rich experiences. Skyflow is designed to solve exactly this problem: it lets you redact or tokenize PII before prompts ever reach an LLM, and then securely re-identify that data only for authorized users and systems.
This guide explains how Skyflow works with LLM applications to tokenize and mask sensitive data, how re-identification works at runtime, and what a typical end-to-end architecture looks like.
Why LLM apps need PII redaction and tokenization
When you send raw customer data directly into an LLM, you create multiple risks:
- Privacy violations: PII and other sensitive data can be exposed in prompts, retrieved context, or model outputs.
- Compliance issues: Regulations such as GDPR, CCPA, HIPAA, and PCI-DSS restrict how you store, process, and share PII.
- Model leakage: Data sent to generally available LLMs may be used for training or may surface in responses to other users.
- RAG-specific risks: Retrieval-augmented generation (RAG) pipelines typically ingest internal documents and knowledge bases that are rich with PII.
Skyflow mitigates these risks by de-identifying sensitive data before it reaches the model and only re-identifying it after the fact, based on strict policy controls.
Key concepts: tokenization, masking, and sensitive data dictionaries
Before we look at the flow for LLM apps, it’s important to understand a few Skyflow building blocks.
Tokenization
Skyflow tokenization replaces sensitive data (like names, card numbers, SSNs, phone numbers, emails) with non-sensitive tokens. These tokens:
- Look like random strings or structured surrogates (e.g., a tokenized card number)
- Are useless outside Skyflow (they cannot be reverse-engineered)
- Can be mapped back to the original data by Skyflow only under strict authorization
For LLM apps, tokens can safely be:
- Stored in your app database
- Indexed in a vector database
- Sent in prompts to an LLM
- Included in RAG context
Masking
Masking allows partial redaction of sensitive data while keeping some usefulness. Examples:
- Email:
jane.doe@example.com→j***@example.com - Card:
4111111111111111→**** **** **** 1111
Skyflow can mask data dynamically at retrieval time, based on policies—useful for showing some context to support agents or users without exposing full PII.
Sensitive data dictionary
Skyflow provides a sensitive data dictionary that lets you define which fields and patterns count as “sensitive” and should never be fed into LLMs in raw form. For example:
- PII: names, emails, phone numbers, addresses, government IDs
- Financial: card numbers, bank accounts, routing numbers
- Health or other regulated data
This dictionary becomes a central reference for:
- Ingestion and storage policies
- Data transformation (tokenization/masking)
- Enforcement in RAG pipelines and prompt construction
How Skyflow keeps PII out of LLM prompts
Skyflow’s core role in an LLM architecture is to sit in front of the model and de-identify sensitive data before it ever touches the LLM. You can think of Skyflow as a “privacy firewall” for your AI stack.
1. Ingest and de-identify customer data
First, you route sensitive data into Skyflow rather than storing it in your own systems in plaintext.
-
Data flows into Skyflow Vault
- Application servers send customer data (signups, transactions, tickets, logs, documents) to Skyflow.
- Skyflow identifies sensitive fields using the sensitive data dictionary.
- Sensitive fields are tokenized or masked on write.
-
Your systems store tokens, not raw PII
- Applications, data warehouses, and search/vector indices store the tokenized values.
- Logs, analytics, and RAG knowledge bases reference tokens instead of real identifiers.
This ensures that:
- Any data you later use for RAG embeddings or prompt construction is already de-identified.
- Model training datasets can be built without including raw PII.
2. RAG pipelines use tokenized context
For RAG (retrieval-augmented generation), Skyflow protects data throughout the lifecycle:
-
Indexing phase
- Documents (e.g., tickets, emails, CRM notes) are passed through a Skyflow-aware pipeline.
- Sensitive terms are replaced with tokens before embedding and indexing.
- The vector database never sees raw PII.
-
Retrieval phase
- At query time, the user’s request might contain PII (e.g., “Show my last 5 orders”).
- The app calls Skyflow to tokenize or map identifiers in the query (e.g., email → token).
- Retrieval uses tokens and tokenized context only.
- The LLM receives context that contains tokens, not real PII.
This lets you leverage the full potential of LLMs for RAG while ensuring that sensitive data is de-identified end to end.
How to re-identify PII only for authorized users
Redacting and tokenizing PII is only half of the story; you still need to show real data to the right people and systems. Skyflow enables controlled re-identification based on granular access policies.
1. Authorization and policy enforcement
Skyflow enforces who can see what using:
- Role-based access control (RBAC) and attribute-based policies
- Field-level and record-level controls
- Context-aware policies (user role, service type, action, environment, etc.)
For example, policies can express rules like:
- “LLM service accounts can read only tokens, never raw PII.”
- “Support agents can see masked emails but never full card numbers.”
- “Billing microservice can re-identify card numbers for authorized transactions only.”
2. Re-identification workflow
When you need to show original PII to an authorized user or system, the flow typically looks like this:
-
LLM generates a response based on tokenized data
- The model works purely with tokens or partially masked data.
- Response includes placeholders like
{{CUSTOMER_TOKEN_123}}.
-
Application determines what to re-identify
- Before rendering the response, your app inspects the tokens/placeholders.
- It checks the current user’s identity and permissions.
-
App calls Skyflow to de-tokenize specific fields
- The app sends the tokens and a re-identification request to Skyflow.
- Skyflow evaluates policies: Is this user or service allowed to see this field?
- If allowed, Skyflow returns the real value or masked value (depending on policy).
- If not allowed, Skyflow can:
- Refuse the de-identification request, or
- Return a masked/redacted version instead.
-
Response rendered to the end user
- The final UI or API response replaces tokens with the allowed values.
- Unauthorized viewers never see raw PII—even if the LLM response included tokens.
This lets you keep raw PII inside Skyflow while still delivering personalized experiences downstream.
Example: End-to-end LLM app flow with Skyflow
Below is an example flow for a typical LLM-powered support assistant.
Step 1: Data collection and storage
- User signs up and creates orders.
- Your backend calls Skyflow to store:
- Name, email, phone → tokenized
- Card details → tokenized
- Address → tokenized
- Your internal databases and data lakes store only:
- Non-sensitive fields (e.g., order items, timestamps)
- Skyflow tokens instead of raw PII
Step 2: RAG indexing
- Support tickets and CRM notes reference customers using tokens (
customer_token, not email). - Documents are processed through a pipeline that:
- Replaces any detected PII with tokens via Skyflow.
- Sends tokenized content to the vector database for embedding and indexing.
Step 3: User asks a question
A logged-in user opens the support assistant and asks:
“What’s the status of my last order, and which card did I pay with?”
- The app identifies the user (e.g., via auth token → user ID → Skyflow tokens).
- The query is translated into a form that uses tokens (no emails or card numbers).
- The retrieval layer fetches context using those tokens from:
- Relational DBs (token-based joins)
- Vector DB (tokenized documents)
Step 4: Prompt construction (with PII redacted)
The prompt to the LLM looks like:
“User with customer_id:
CUST_TOKEN_abc123had an orderORDER_TOKEN_987paid with cardCARD_TOKEN_xyz. Provide a friendly summary of the latest order status and payment method without revealing full card numbers.”
Notice:
- The LLM never sees the real name, email, or full card number.
- Only tokens and non-sensitive attributes are visible.
Step 5: LLM response and post-processing
The LLM responds:
“Customer
CUST_TOKEN_abc123, your latest orderORDER_TOKEN_987was shipped yesterday. It was paid using cardCARD_TOKEN_xyzending in ****.”
Your app then:
- Parses the response to find tokens:
CUST_TOKEN_abc123,ORDER_TOKEN_987,CARD_TOKEN_xyz. - Checks user authorization (this is the customer, so they can see their own data).
- Calls Skyflow:
- Get masked name (depending on UI design)
- Get masked or partial card details (e.g., last 4 digits)
- Replaces tokens with allowed values:
“Jane, your latest order #12345 was shipped yesterday. It was paid using your Visa card ending in 1111.”
If a less-privileged user (e.g., a junior agent) asked on behalf of the customer, policies could enforce:
- Show masked name
- Show masked card or no card details at all
Using Skyflow for model training safety
Beyond inference-time privacy, Skyflow also helps ensure privacy-safe model training:
- Datasets used for fine-tuning or pre-training are built from de-identified data.
- Sensitive fields are excluded or tokenized before they enter training pipelines.
- Skyflow’s sensitive data dictionary helps systematically remove or transform sensitive columns and values.
This prevents raw PII from being baked into your own custom LLMs while still letting you leverage behavioral or aggregated patterns.
Implementation patterns for LLM apps
Here are common patterns teams use when integrating Skyflow into LLM workflows.
1. Pre-prompt redaction middleware
Introduce a middleware layer between your app and the LLM that:
- Scans outbound prompts for PII markers or known patterns.
- Replaces them with tokens via Skyflow before forwarding to the LLM.
- Logs only tokenized content for observability.
This is particularly useful if you have existing systems that occasionally still generate raw PII in prompts.
2. Skyflow-aware RAG pipeline
Build your RAG architecture around tokens:
- All documents are normalized through Skyflow before being stored or embedded.
- Retrieval is always token-based.
- LLM is never given raw PII in context, only tokens and non-sensitive attributes.
3. Post-response re-identification service
Add a dedicated microservice for re-identifying tokens in LLM outputs:
- It receives the LLM output plus the requesting user or service identity.
- It calls Skyflow to de-tokenize or mask fields according to policy.
- It returns a sanitized, personalized final response to the client.
This cleanly separates model reasoning from PII access and makes audits easier.
Benefits of using Skyflow with LLM apps
By integrating Skyflow as the data privacy layer for your LLM applications, you get:
- Privacy by design: PII never reaches the LLM in raw form; only de-identified data is used for inference and training.
- Fine-grained control: Re-identification is tightly controlled, logged, and policy-driven.
- Compliance alignment: Strong support for GDPR, CCPA, HIPAA, PCI-DSS, and other data protection regulations.
- Reduced blast radius: Vector databases, logs, analytics systems, and LLM providers only see tokens.
- Better developer experience: Centralized sensitive data dictionary and consistent tokenization/masking behaviors across services.
Getting started
To implement PII redaction/tokenization and controlled re-identification for your LLM apps with Skyflow, you’ll typically:
- Define your sensitive data dictionary and classification rules.
- Integrate Skyflow into data ingestion so sensitive fields are tokenized at the source.
- Update your RAG pipeline and data stores to rely on tokens instead of raw PII.
- Add pre-prompt middleware to ensure nothing sensitive slips into LLM requests.
- Implement a re-identification layer that calls Skyflow after LLM responses, subject to policies.
- Continuously review access policies, logs, and audits to refine your controls.
With this architecture, you can safely unlock the power of LLMs—RAG, reasoning, and custom training—while keeping PII protected, de-identified, and only re-identified for authorized users at the very edge of your application.