Kermadec Journal
2025-03-22 Pydantic & PII
Today while driving I had a pretty decent conversation with Chadt. Working through some details about a hybrid environment I can deploy to AWS but work through functionally at home in Southwall.
Interesting I’m naturally polite with Chadt. Trust but verify. I’ve already worked through some local toys with DuckDB outputting to Unity Catalog, but not yet with S3 Tables.
MDCB: I'm trying to pick a good overall architecture to deal with ingestion from a variety of sources, and I want to include Delta Lake or Iceberg, and I can't decide between the two.
ChatGPT: Both Delta Lake and Apache Iceberg have their strengths. Delta Lake is great if you’re already using the Spark ecosystem and need ACID transactions and schema enforcement. Iceberg is more flexible with support for various query engines like Trino and Flink. Are you leaning towards one ecosystem more than the other?
Delta Lake is a solid choice, especially if you’re already in the Databricks ecosystem. It’s got strong community support and seamless integration with Spark. If you want to stay open, you could design your architecture to be adaptable, so switching or integrating with Iceberg later won’t be too complex. Does that align with your goals?
MDCB:That sounds good. Now tell me about AWS Glue and its catalog versus the Unity Catalog of Databricks.
ChatGPT: AWS Glue’s Data Catalog is a managed metadata repository that integrates well with other AWS services. It automatically discovers and catalogs data from various sources, making it easy to query using AWS services like Athena and Redshift.
Databricks’ Unity Catalog, on the other hand, offers fine-grained governance and security across your data and AI assets, with features like unified access policies. It’s tightly integrated with Databricks, providing a consistent view of data across different clouds.
If you’re already using AWS, Glue might be more straightforward. Unity Catalog, however, is more robust for comprehensive data governance. Do you have a preference for one over the other?
MDCB: I can't be sure that I'm going to be working in a Databricks environment, and I also want to be able to do a lot of this architecture offline, which is to say I don't have a Databricks subscription, but I want to be compatible with it at a later date.
ChatGPT: Got it. AWS Glue is a solid starting point since you can build your architecture offline and maintain compatibility with other systems, including Databricks. Glue’s Data Catalog can help you organize and manage metadata, which is crucial for future integration. Plus, it’s flexible enough to adapt if you decide to move to Databricks later. Does that sound like a good plan for you?
MDCB: Now, the next piece of this architecture is DuckDB. Are you familiar with DuckDB?
ChatGPT: Yes, DuckDB is an in-process SQL OLAP database designed for fast analytics. It’s great for local data processing and can run entirely within your applications. It’s a good fit if you want a lightweight, embeddable solution that supports SQL queries directly on local files. How are you thinking of integrating DuckDB?
MDCB: What I'd like to do is build a hybrid ELT pipeline that uses DuckDB instead of Spark.
ChatGPT: Using DuckDB for a hybrid ELT pipeline can be a great choice. It’s efficient for handling transformations and analytics locally, making it faster and more resource-efficient for certain workloads. You can use DuckDB to handle the heavy lifting of data transformations before loading it into your final storage. How do you plan to orchestrate the pipeline?
MDCB: I want to use either ECS for continuous or Lambda for short transactions.
GPT: Got it. ECS is great for long-running tasks and complex workflows, while Lambda is perfect for short, event-driven transactions. If you want flexibility and automation, combining ECS for the heavier lifting and Lambda for quick tasks could be a good approach. How does that sound?
MDCB: Build images that can work in both. What kind of images can I use in both in Lambda and in ECS?
ChatGPT 4-o: To build images that work in both ECS and Lambda, you’d typically use Docker containers. You can create a lightweight, minimal image that includes your application and dependencies. For Lambda, ensure the image is small and starts up quickly. This way, you can maintain consistency and deploy the same image across both platforms. Does that approach suit your needs?
MDCB: It does. But here's a question. Can I use Podman instead of Docker?
Yes, you can use Podman instead of Docker. Podman is a daemonless container engine that provides a Docker-compatible command-line interface. It supports building and running containers, and the images you create with Podman are compatible with Docker, ECS, and Lambda. It’s a solid choice if you prefer a more secure, rootless container environment.
MDCB: That sounds good. So we have Podman for ECS and Lambda to contain DuckDB. Now, one of the things I'm going to have to do is send parameters and configuration specs to these containers. I've used DynamoDB before. Does that seem to be a viable strategy?
Chat: Yes, using DynamoDB is a viable strategy for sending parameters and configuration specs to your containers. It’s highly scalable, integrates well with AWS services, and provides low-latency access to configuration data. You can store your configs in DynamoDB and have your containers retrieve them as needed. Does that approach work for you?
MDCB: So tell me, I'm going to probably use Python for all of this, and I like UV to configure my Python environments. Does UV have a capacity for making these more packaged up, or as if they were compiled?
Chat: UV is great for managing Python environments and dependencies. While it focuses on streamlining environment management, you might look into tools like PyInstaller or Nuitka for packaging your Python applications into standalone executables. They can bundle your scripts and dependencies into a single package, making deployment easier. Does that align with what you’re looking for?
MDCB: Yeah, that's good. And I think, well, tell me what is going to be the average latency to boot up that Python environment, especially if I want to use Lambda.
Chat: The cold start latency for a Lambda function with a Python environment can vary. For smaller, lightweight images, it might range from a few hundred milliseconds to a couple of seconds. To reduce latency, you can optimize your image size and use provisioned concurrency in Lambda. Are you aiming to minimize the cold start times for a particular use case?
MDCB: But I don't know what that use case might be because I don't have customer specifications. But then here's a question. If I use a standard AWS Linux image, is that going to run any faster with the pods that I create with Podman? Or does it sound more reasonable to use a bare metal instance for Lambda?
CHAT: Using a standard AWS Linux image can offer consistency and potentially faster startup times, especially if it’s lightweight. However, bare metal instances (or Lambda’s provisioned concurrency) can help reduce cold start times by keeping your functions warm.
For your use case, starting with the standard AWS Linux image seems reasonable. You can always refine based on performance needs later. Does that make sense for now?
MDCB: I'm confused about the new feature called S3 tables, and I don't know what generates them and why I would use them separate from raw Parquet files or Delta Lake.
Chat: S3 Tables is a feature provided by AWS to make querying data in S3 easier and more efficient. They allow you to create table abstractions over data stored in S3, enabling easier integration with query engines like Athena.
Compared to raw Parquet files, S3 Tables can offer improved performance and management features. They could be useful when you want to simplify data access and optimize query performance. If you’re using Delta Lake, though, you might already have features like ACID transactions and versioning. It ultimately depends on your specific needs for query performance and management. Does that help clarify?
MDCB: It does, but it does make things a little bit more complicated. Tell me this, what does Delta Lake support in terms of configuration by Terraform?
ChatGPT: Delta Lake itself doesn’t have direct Terraform support, but you can use Terraform to manage the infrastructure around it. For instance, you can use Terraform to provision and configure Databricks workspaces, clusters, and other resources where Delta Lake is used. This includes setting up storage accounts, access policies, and integrating with other services. Using Terraform, you can automate and manage these configurations, making it easier to deploy and maintain a Delta Lake environment. Does that align with what you’re looking to do?
MDCB: They need to compare that to what, let's say, AWS Amplify would do in terms of maintaining that stuff automatically provisioned by either cloud formation or Terraform.
ChatGPT: AWS Amplify is a platform designed to simplify the process of building and deploying applications, often focusing on front-end and mobile applications. It abstracts much of the underlying infrastructure management, providing a more streamlined experience.
Comparing Amplify to using Terraform or CloudFormation for infrastructure-as-code management:
• Amplify: Automates many tasks, offering built-in CI/CD and easy integration with front-end frameworks. It’s more about convenience and rapid development.
• Terraform/CloudFormation: Provides fine-grained control over resources, allowing more customization and infrastructure management.
If you’re looking for more automated provisioning and management, Amplify might be simpler. If you need detailed control, Terraform or CloudFormation are better. Does that help?
MDCB: Yes, it does. I'm prepared to use Terraform and or CloudFormation in managing the IAM access that would manifest ACLs for access to whatever data is in my pipeline. So the question remains, what about CICD? Can I use GitHub Actions in AWS?
ChatGPT: Yes, you can definitely use GitHub Actions for CI/CD in AWS. GitHub Actions can automate your workflows, deploy infrastructure using Terraform or CloudFormation, and manage your application deployments. It integrates smoothly with AWS services, allowing you to run automated tests, build Docker containers, and deploy to ECS, Lambda, or other services. It’s a powerful tool to streamline your DevOps processes. Does that cover what you need?
MDCB: It does. All right, now let's talk about the data streams themselves. I'm going to have to have some data management specifically. Something like PyDantic. Tell me what are the competitors to PyDantic and how does it compare, let's say, for example, to SQLAlchemy or DBT?
ChatGPT: Pydantic is known for data validation and settings management in Python, focusing on ease of use and performance. Some competitors include Marshmallow and Cerberus, which also provide data serialization and validation.
Compared to SQLAlchemy, which is an ORM (Object-Relational Mapping) tool for interacting with databases, Pydantic is more focused on data validation and settings.
dbt (Data Build Tool) is used for transforming data in your warehouse, focusing on analytics engineering.
In summary:
• Pydantic: Great for data validation and settings.
• SQLAlchemy: Focused on database interaction via ORM.
• dbt: Specialized in transforming data in warehouses.
Each tool serves a different purpose in data management. Does that help clarify?
MDCB: I want to set up some kind of Pydantic in combination with DuckDB to tell me what rejects are going against which fields. And I also want to know if Pydantic can help me identify PII.
ChatGPT: Combining Pydantic with DuckDB can be a powerful way to validate and clean your data. You can use Pydantic to define schemas and validate data as it flows through your ELT pipeline. If a record doesn’t meet the schema, you can log it as a reject and identify the specific fields causing issues.
As for identifying PII, Pydantic doesn’t natively support PII detection, but you can build custom validators or use additional libraries to check for common PII patterns (like email, SSN, etc.). Would you like more details on creating custom validators or integrating other tools?
MDCB: What Tools are out there to help you identify PII in Python?
ChatGPT: There are several Python libraries that can help identify PII:
1. Presidio: An open-source tool by Microsoft that detects and anonymizes PII. It’s very customizable and supports multiple types of PII detection.
2. FlashText: This library is useful for fast keyword extraction and can be used to identify PII patterns.
3. Regex: You can use regular expressions to manually create patterns for detecting PII like email addresses, phone numbers, etc.
4. pii-data: A library that helps detect and mask PII data using customizable patterns.
These tools can be integrated with your Pydantic and DuckDB setup to automatically identify and handle PII. Does any of these sound like a good fit?
MDCB: It does sound like a good fit, so let's get into the weeds with this. Let's say I build a library of regex, and then I identify by column which data streams or which sections of data are gonna come through that need to be masked or encrypted. What is a good encryption scheme that is symmetrical where I can keep the key or can AWS manage with KMS a symmetric encryption key that I can apply to PII columns in my data?
ChatGPT: Yes, AWS Key Management Service (KMS) is a great option for managing symmetric encryption keys. You can use AWS KMS to create and manage keys securely, and apply them to encrypt and decrypt your PII data.
For symmetric encryption, AES (Advanced Encryption Standard) is commonly used and well-supported. AWS KMS can handle the encryption and decryption process with AES, so you only need to manage the keys without dealing with the encryption details.
Does that approach sound like it would fit your needs?
MDCB: It depends on where I can embed this in my code. So if I'm building a producer with Pydantic from my source, can I use it that way?
ChatGPT: Yes, you can definitely embed encryption logic in your code. While using Pydantic to validate your data, you can include steps to encrypt specific fields before they are persisted or transmitted. AWS KMS can be invoked within your Python code to encrypt and decrypt these fields seamlessly.
You would typically:
1. Validate data with Pydantic.
2. Encrypt PII fields using AWS KMS.
3. Store or process the encrypted data.
This ensures your PII is protected throughout the data pipeline. Does that sound like what you’re aiming for?
MDCB: It does. Now, if I'm using AWS Glue Catalog or the Unity Catalog, will it contain metadata that helps me know which columns are encrypted?
ChatGPT: Yes, both AWS Glue Catalog and Databricks Unity Catalog can store metadata that indicates which columns are encrypted. You can add custom annotations or tags to specify encryption details. This makes it easier to manage and keep track of which columns need special handling during data processing. Is that what you’re looking to do?
MDCB: It is what I'm looking to do, but also, if I'm rotating my KMS keys on, let's say, a 90-day basis, I will have different encryption keys, yes, for each column, and I'll need to rotate those out and somehow have that communicated to the catalog.
ChatGPT: In that case, you can manage key rotations and updates in your catalog by:
1. Storing Key Metadata: Include metadata in your catalog that specifies the current encryption key used for each column. This way, you can track which key is applied at any given time.
2. Automating Updates: Use automation scripts or Lambda functions to update this metadata in your catalog whenever a key rotation occurs.
3. Versioning: Maintain a version history in your catalog so you can reference which key was used for encryption at different points in time.
By doing this, you ensure smooth key management and seamless encryption updates. Does that sound like a workable plan?
MDCB: It does, and that's all for now. Thank you very much.


