Preventing PII leakage when using LLMs: An introduction to Microsoft’s Presidio

Eduardo Blancas

Jan 23, 2025 - 9 Min read

Deploy AI Apps with Ploomber

LLMs have demonstrated impressive capabilities to increase enterprise productivity: from writing emails to automating workflows, LLMs are transforming every aspect of business operations.

However, LLM APIs (such as OpenAI or Anthropic) come with significant risks: employees might inadvertently send sensitive information (such as PII data) to these models, which poses a serious security concern. Keeping data safe is paramount for complying with data protection laws (such as GDPR) and maintaining overall business security. Many companies, including OpenAI, reserve the right to use your data for training their models, which further amplifies these risks.

In this blog post, we’ll show you how to use Presidio, a Python open-source framework from Microsoft that detects and anonymizes sensitive data. You can use Presidio as a safety measure to prevent your company’s data from leaving your control.

We’ll start by covering the basics of Presidio and then demonstrate how to integrate it with OpenAI’s API.

Keep in mind that the code in this blog post is a proof-of-concept. We offer enterprise-grade solutions for keeping your sensitive data secure when using LLMs. If you’re interested in learning more, contact us.

Installation

Run the following commands to get Presidio working locally:

# install the two required packages
pip install presidio-analyzer presidio-anonymizer

# download the spacy model, presidio uses it internally
python -m spacy download en_core_web_lg

Basic usage

Presidio has two main components: the analyzer and the anonymizer. The analyzer identifies the position and type of PII data in a string, while the anonymizer replaces that data with non-identifiable information. Let’s look at a basic example:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text = "My email is person@corporation.com and phone number is 212-555-5555"

analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text=text,
                           entities=["PHONE_NUMBER"],
                           language='en')

print(f"Found {len(results)} element")
print(results)

Console output (1/1):

Found 1 element
[type: PHONE_NUMBER, start: 55, end: 67, score: 0.75]

The analyzer detected one PHONE_NUMBER entity between characters 55 and 67, with a confidence score of 0.75 (higher scores indicate greater certainty that the matched text contains PII data).

Now let’s use the original text and analyzer results to create an anonymized version:

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text.text)

Console output (1/1):

My email is person@corporation.com and phone number is <PHONE_NUMBER>

Note that the PHONE_NUMBER has been redacted. The email address remains unmodified because we only passed entities=["PHONE_NUMBER"] to the analyze method. To get a list of the supported entities, we can do:

analyzer.get_supported_entities()

Console output (1/1):

['PERSON',
 'UK_NINO',
 'MEDICAL_LICENSE',
 'IN_VEHICLE_REGISTRATION',
 'CREDIT_CARD',
 'IN_AADHAAR',
 'IBAN_CODE',
 'LOCATION',
 'IN_PAN',
 'PHONE_NUMBER',
 'US_PASSPORT',
 'DATE_TIME',
 'US_SSN',
 'IN_PASSPORT',
 'EMAIL_ADDRESS',
 'AU_ABN',
 'SG_NRIC_FIN',
 'AU_ACN',
 'US_BANK_NUMBER',
 'CRYPTO',
 'UK_NHS',
 'AU_TFN',
 'IP_ADDRESS',
 'URL',
 'IN_VOTER',
 'NRP',
 'US_DRIVER_LICENSE',
 'AU_MEDICARE',
 'US_ITIN']

Alternatively, we can pass entities=None to detect all supported entities. Let’s try that:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text = "My email is person@corporation.com and phone number is 212-555-5555"

analyzer = AnalyzerEngine()

results = analyzer.analyze(text=text,
                           entities=None,
                           language='en')

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text.text)

Console output (1/1):

My email is <EMAIL_ADDRESS> and phone number is <PHONE_NUMBER>

You’ll see that both the email and phone number are redacted.

Extending the analyzer with regular expressions

Presidio comes with powerful built-ins to redact PII information. However, it’s likely that it won’t fully cover all our needs. Let’s assume we work at an e-commerce company and want to prevent customer IDs from being leaked. For the sake of the example, let’s say our customer IDs have a CUST prefix followed by 10 numbers. We can create a custom recognizer like this:

from presidio_analyzer import Pattern, PatternRecognizer
from presidio_analyzer import AnalyzerEngine

# create a regex-based recognizer
customer_id_pattern = Pattern(name="customer_id_pattern", regex="CUST\\d{10}", score=0.5)
customer_id_recognizer = PatternRecognizer(
    supported_entity="CUSTOMER_ID", patterns=[customer_id_pattern]
)


# create an analyzer and add the customer ID recognizer
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(customer_id_recognizer)

# sample text
text = "Generate an email for the customer with ID CUST0123456789"


# analyze text and anonymize
results = analyzer.analyze(text=text,
                           entities=None,
                           language='en')

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text.text)

Console output (1/1):

Generate an email for the customer with ID <CUSTOMER_ID>

Rule-based recognizers

In many cases, it might be difficult to identify PII information with a regular expression. For these cases, Presidio also allows us to define a recognizer with code. For example, let’s say we want to prevent leaking internal URLs in the https://internal.corporation.com domain. We can write a custom recognizer that redacts such URLs by subclassing EntityRecognizer:

from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts
from presidio_analyzer import AnalyzerEngine

class CorporateURLRecognizer(EntityRecognizer):
    expected_confidence_level = 0.7
    corporate_domain = "https://internal.corporation.com"


    # Presidio requires us to implement a loading method, in this case, we don't
    # need to load anything so we leave it empty
    def load(self) -> None:
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        results = []

        # presidio passes spaCy tokens in (nlp_artifacts.tokens)
        for token in nlp_artifacts.tokens:
            # we check if the corporate domain is contained
            if token.like_url and self.corporate_domain in token.text:
                # and return a result
                result = RecognizerResult(
                    entity_type="CORPORATE_URL",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results

Let’s now use our CorporateURLRecognizer:

text = "Write an email to my boss: the URL https://internal.corporation.com/clients/1234 is not responding"

corp_url_recognizer = CorporateURLRecognizer(supported_entities=["CORPORATE_URL"])
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(corp_url_recognizer)
results = analyzer.analyze(text=text, language="en", entities=["CORPORATE_URL"])

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text.text)

Console output (1/1):

Write an email to my boss: the URL <CORPORATE_URL> is not responding

Leveraging surrounding words to detect PII data

In the regular expressions example, we assumed that we could detect customer IDs with the format: CUST + 10 numbers. Let’s work on a more challenging example to demonstrate Presidio’s advanced recognition capabilities.

Let’s assume that customer IDs are 10-digit numbers. We could write a rule to detect any 10-digit number, but we’re likely to make mistakes - for example, the number might just be a dollar amount or a non-sensitive ID. Intuitively, we could say that a customer ID is likely to be surrounded by certain words like “customer” or “customer ID”. We call these surrounding words context. Here’s an example where the number is likely a customer ID:

Let’s send an email to customer 1234567890 since it’s their birthday!

And here’s an example where a 10-digit number is probably not a customer ID:

The company made $1234567890 in revenue last quarter.

We can leverage the context (surrounding words) to decide whether something matches or not. Let’s see a code example:

from presidio_analyzer import Pattern, PatternRecognizer
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer


# create the basic regex (any 10-digit number) recognizer and assign a low score
customer_id_pattern = Pattern(name="customer_id_pattern", regex="\\d{10}", score=0.01)

# create a pattern recognizer and pass the context
customer_id_recognizer_w_context = PatternRecognizer(
    supported_entity="CUSTOMER_ID",
    patterns=[customer_id_pattern],
    context=["customer", "customer id", "customer with id"],
)

# create an analyzer, add the customer ID recognizer, and the context enhancer
registry = RecognizerRegistry()
registry.add_recognizer(customer_id_recognizer_w_context)

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45,
    min_score_with_context_similarity=0.4,
)

analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=context_aware_enhancer,
    # set a threshold so we only detect customer IDs when we're more confident
    default_score_threshold=0.4
)

# sample text
text = "0123456789 Generate an email for the customer with ID 0123456789"

# analyze text and anonymize
results = analyzer.analyze(text=text,
                           entities=["CUSTOMER_ID"],
                           language='en')

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text.text)

Console output (1/1):

0123456789 Generate an email for the customer with ID <CUSTOMER_ID>

We can see that the first 10-digit number (0123456789) wasn’t redacted because it lacked the contextual words “customer” or “ID” around it. However, the second occurrence was redacted since it appeared near the phrase “customer with ID”. This demonstrates how context-aware recognition helps avoid false positives while still catching sensitive information when it appears in the expected context.

Customizing Anonymization

So far, we’ve used the default anonymization configuration, which redacts identified PII by replacing it with the entity name (e.g., phone numbers become <PHONE_NUMBER>). However, we can customize how each entity is anonymized. To demonstrate this customization, let’s install the faker library, which generates realistic fake data.

pip install faker

Now, let’s write an example where we’ll anonymize data in different ways: we’ll redact phone numbers by replacing them with ************ and we’ll replace email addresses with fake ones (generated by Faker):

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

fake = Faker()


text = "My email is person@corporation.com and phone number is 212-555-5555"
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text,
                           entities=["PHONE_NUMBER", "EMAIL_ADDRESS"],
                           language='en')

# configure the operators: mask phone numbers and generate fake email addresses
operators = {
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 12,
            "from_end": True,
        },
    ),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()})
}

anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators=operators
)

print(anonymized_results.text)

Console output (1/1):

My email is stevemurphy@example.net and phone number is ************

To learn more about anonymization customization, see Presidio’s documentation.

Putting it all together

Presidio is an extremely powerful framework that comes with built-in recognizers for PII data such as phone numbers, email addresses, passport numbers, and more. It’s also flexible enough to allow us to write custom recognizers.

Let’s now show a quick example of how we can integrate Presidio with OpenAI. First, install the OpenAI package and set your API key:

pip install openai

export OPENAI_API_KEY=<YOUR OPENAI KEY>

We’ll write a small wrapper for client.chat.completions.create that anonymizes messages before sending them to OpenAI:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from openai import OpenAI

client = OpenAI()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def anonymize_message(message):
    """Anonymize a message
    """
    text = message["content"]
    results = analyzer.analyze(text=text, language='en')
    anonymized =  anonymizer.anonymize(text=text, analyzer_results=results).text
    print("Original: ", text)
    print("Anonymized: ", anonymized)
    return {"role": message["role"], "content": anonymized}


def chat_completions_create(user_messages):
  """Wrapper for client.chat.completions.create which anonymizes messages
  """
  anonymized_messages = [anonymize_message(message) for message in user_messages]

  response = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
          {"role": "system", "content": "You are a helpful assistant that can write emails."},
      ] + anonymized_messages
  )

  return response

user_messages = [{
  "role": "user",
  "content": """
Draft an email to person@corporation.com, mention that I've been trying to
reach at 212-555-5555 without success and we'd like her to reach out asap
"""
}]

print(chat_completions_create(user_messages).choices[0].message.content)

Console output (1/1):

Original:  
Draft an email to person@corporation.com, mention that I've been trying to
reach at 212-555-5555 without success and we'd like her to reach out asap

Anonymized:  
Draft an email to <EMAIL_ADDRESS>, mention that I've been trying to
reach at <PHONE_NUMBER> without success and we'd like her to reach out asap

Subject: Urgent: Request for Contact

Dear [Recipient's Name],

I hope this message finds you well. I have been trying to reach you at [PHONE_NUMBER] but have been unsuccessful in connecting. 

We would greatly appreciate if you could reach out to us at your earliest convenience. 

Thank you, and I look forward to hearing from you soon.

Best regards,

[Your Name]  
[Your Position]  
[Your Company]  
[Your Contact Information]

Note: Check out our blog post on writing a reverse proxy with FastAPI and Presidio where we show an alternative integration between Presidio and OpenAI’s API.

Enterprise-grade Solutions

The code provided here is a proof-of-concept. If you’re looking for enterprise-grade solutions to prevent PII data leakage, contact us. Our solution addresses many of the challenges with Presidio, such as providing a user interface to define new rules, advanced logging so you can audit all messages sent to the OpenAI API, round-trip conversion so OpenAI sees redacted values but your LLM consumers get responses with the original values they sent, and maximum performance to prevent slowing down your LLM applications.

Seamless deployment for data scientists and developers. Ploomber handles infrastructure so you focus on building. Secure and scalable—from personal projects to enterprise apps. Support for Streamlit, Dash, Docker, and AI-powered applications. Because life's too short for deployment headaches.