GCP – DLP provisioning with Terraform

This post will address provisioning of the DLP service in GCP using Terraform. Terraform is a great IaC provisioning tool that can be used across various cloud providers and because it’s managed by a third party (Hashicorp), it’s a  beneficial choice for multi-cloud configurations, or just for the decoupling aspect it provides.

Now, because the Terraform GCP Provider API is not managed in tandem with updates to the GCP services (DLP in our case) there will not be an exact 1:1 feature parity when using it to provision DLP (i.e. lack of Firestore support). However, for simple configurations it is great to setup basic inspection templates, job triggers, and stored info types . In addition, the Terraform modules provided by Google, may or may not exist for the service you intend to provision. Currently that is true for our case in DLP so we will demonstrate one form of configuration.

THE SETUP

Prerequisites for setting up DLP here will be:

To start, let’s first grab the source code we will be using and describe it starting from main.tf.

provider "google" {
  project     = "{REPLACE_WITH_YOUR_PROJECT}"
}

terraform {
  backend "gcs" {
    bucket  = "{REPLACE_WITH_YOUR_UNIQUE_BUCKET}"
    prefix  = "terraform/state"
  }
}

// 1. Service account(s)
module "iam" {
  source                              = "./modules/iam"
}

// 2a. Storage bucket (DLP source input #1)
module "storage_input" {
  source                              = "./modules/dlp_input_sources/cloudstorage"
  cloudstorage_input_bucket_name      = var.cloudstorage_input_bucket_name
  cloudstorage_input_bucket_location  = var.location
  unique_label_id                     = var.unique_label
}

// 2b. BQ table (DLP source input #2)
module "bigquery_input" {
  source                              = "./modules/dlp_input_sources/bigquery"
  bq_dataset_id                       = var.input_bq_dataset_id
  bq_dataset_location                 = var.location
  bq_table_id                         = var.input_bq_table_id
  unique_label_id                     = var.unique_label
}

// 2c. Datastore table indexes (DLP source input #3)
module "datastore_input" {
  source                              = "./modules/dlp_input_sources/datastore"
  datastore_kind                      = var.datastore_input_kind
}

// 3. DLP output config
module "bigquery_output" {
  source                              = "./modules/bigquery"
  bq_dataset_id                       = var.output_bq_dataset_id
  bq_dataset_location                 = var.location
  bq_dlp_owner                        = module.iam.bq_serviceaccount
  unique_label_id                     = var.unique_label
}

// 4. DLP... finally
module "dlp" {
  source                              = "./modules/dlp"
  project                             = var.project
  dlp_job_trigger_schedule            = var.dlp_job_trigger_schedule
  bq_output_dataset_id                = module.bigquery_output.bq_dataset_id
  bq_input_dataset_id                 = module.bigquery_input.bq_dataset_id
  bq_input_table_id                   = module.bigquery_input.bg_table_id
  cloudstorage_input_storage_url      = module.storage_input.cloudstorage_bucket
  datastore_input_kind                = var.datastore_input_kind
}

This file sets up Terraform to use the GCP provider (provider block) then specifies a GCS bucket to store our Terraform state file in (terraform block) — which is best practice when working with multiple teams members or dealing with revision management.

Next, custom modules are specified to create the various components needed by DLP. This part can be modified to use other Google modules if they fit better instead. Each module provided is basic enough but can be extended to fit other scenarios. What we have is:

  • 1. ) Service Account Module – This module creates the service account which can be used to access the Big Query service and specifically, with the owner role.
  • 2.) DLP Input Source Modules – There are 3 specific modules Cloud Storage, Big Query, and Datastore (a, b, and c) which can be used to create a valid source for data to be checked against. It is most likely, you would only check against 1 source whether it be cloud storage, or a database. Thus you may choose to modify only the one used in your case — the others will be auto created anyway for demonstration purposes.
  • 3.) Big Query Dataset Module – As discussed in #1, this is where the DLP job results will go.
  • 4.) DLP Module – This is where our inspection template and job triggers will be created. The inspection template is set to detect likely matches for [Email, Person Name, and Phone Number], with sample limits and an exclusion rule set. The job triggers are set for each of the input sources we want to check (as seen in #2) with a schedule specified.
    DLP NOTE #1: Because the DLP service can be expensive, this module should be tweaked to the unique needs of your specific project for cost control.
    DLP NOTE #2: Although the module contains a stored info type resource, Terraform can not us it for inspection template info type configuration.

Finally, all we need to do in order to configure the execution is pass in our input parameters into the vars.tf file.

If everything has gone well and you perform Terraform steps listed in the Readme.md, you will see all of your resources created.

Apply complete! Resources: 13 added, 0 changed, 0 destroyed.

TESTING IT OUT

Great! But how do we know everything worked?
First ,we can go into GCP and check that our jobs are created.

DLP Jobs
DLP Jobs

Then we need to add some data in our input sources. We can add both positive and negative test case data in accordance with the inspection template.

Finally, let’s run the triggers (rather than waiting for the schedule) to test them out.

BQ DLP Run
BQ DLP Run
GCS DLP Run
GCS DLP Run

As we see, this is working as we expected. Awesome!

GCP – Data Loss Prevention (DLP)

Data Loss Prevention  (DLP) is a great feature available in Google Cloud (GCP) that as of current, is unmatched in ability among leading cloud providers — AWS, and Azure. In essence, DLP allows you to find specific information types (i.e sensitive information such as passwords, identification numbers, creds, etc.) from sources (Storage and DB) then report and redact that information. It can operation on multiple file types including text, image, binary, and pdf. This is an excellent way to keep information of interest secure.

This post will discuss a simple resolution to Error Code 7: “Not authorized to access requested inspect template. that may save you time when starting off using the DLP service.

DLP Trigger error - not authorized
DLP Trigger error – Not authorized

This error can occur when the inspection template is created in a resource location different from the where the job trigger was created. To fix, make sure the trigger and template are in same location. If however there were role modifications on the service account used by DLP API, logically, the permissions to read (see role: roles/dlp.inspectTemplatesReader) need to be added.

DLP configuration template
DLP configuration template

Overall the issues encountered enabling and starting with the DLP service are minimal and as a whole, it’s intuitive to use. It is usually obvious as to how to resolve any errors (i.e. ‘Permission missing’, ‘resource doesn’t exist/not found’) when they do occur. More on DLP coming soon!

How to become an AWS Certified Developer

AWS Certified Developer AssociateBecoming an AWS Certified Developer Associate will allow you to showcase your conceptual knowledge of AWS services with others, and give you an edge in today’s modern era of cloud computing. The weighted, multiple-choice exam which lasts 130 minutes, contains 65 questions that test your understanding of the AWS Platform, from a Developers perspective. (See the “Exam Resourceshere for more info). You should be familiar with:

  • How to encrypt / decrypt secure data
  • Using IAM policies, identities vs ACLs
  • Why and how to utilize KMS
  • Web Identity and SAML Federation
  • User authentication & authorization
  • What a VPC is and how it can be used
  • Shared Responsibility Model
  • Important Service limits
  • Horizontal (auto) and vertical (compute) scaling
  • Deploying services through CI/CD systems
  • Elastic Beanstalk deployment strategies
  • How to achieve redundancy
  • Best practices in design and refactoring
  • Read/Write Dynamo Capacity Unit calculations
  • Service APIs
  • The following services and how they can be used together:
    • API Gateway
    • Elastic Beanstalk
    • CloudFormation
    • CloudWatch
    • Cognito
    • EC2
    • ELB
    • Elasticache
    • IAM
    • KMS
    • Lambda
    • Kinesis
    • DynamoDB
    • Code Commit / Build / Deploy / Pipeline
    • Step Functions
    • S3
    • SNS
    • SQS
    • SWF
    • STS
    • X-Ray

The exam is not easy and rote memorization without experience and understanding of AWS services will guarantee failure — you need a 740 / 1000 to pass. If there is a particular AWS service you have not used, it is highly recommended to dive in and experiment with using it, while also taking into account how it can be used with other services. For example:

A) As a  developer you could use CloudFormation to provision a stack with a DynamoDB NoSQL DB, and an EC2 M5 instance as the server hosting your web service instrumented with X-Ray and using Cognito to manage user identities. All with the proper IAM policies in place.

or

B) As a developer you could deploy a Lamda package stored on S3 with API Gateway as the primary endpoints/event triggers and configure CloudWatch to send a message to SNS topics that will notify subscribers if certain a certain metric threshold has been reached. All with the proper IAM policies in place.

When preparing for the exam, it is recommended to take the practice exam from the AWS training site to familiarize yourself with the question format. This method is in contrast to taking practice exams from other 3rd party training sites — whereby the questions are limited to the course’s range and are often of insufficient difficulty. If you encouter any questions that don’t know or are too broad (there are often multiple answers but 1 “best” answer) take a step back and review that area to gain a better understanding.

After you pass the exam, you can share your recognition with the world by generating a badge, email signature, or transcript in the AWS Certification portal (Certmetrics). Additionally you can buy branded swag, gain access to AWS Certification Lounges at events, recieve discounts on future exams, and more

Good luck on your journey to becoming AWS Certified Developer Associate!