How to design a seamless self-service installation experience for Hybrid SaaS

 

Lariat Data operates a Hybrid SaaS architecture for integrating with customer data stacks. This evolving architecture has been popularized by SaaS vendors like Monte Carlo and Tecton where connecting to customer data stores in a secure and compliant fashion is a key requirement of a successful integration. Broadly this architecture consists of:

  • A multi-tenant control plane hosted by the vendor (ourselves)

  • A single-tenant data plane hosted by the customer in their own cloud environment

Today Lariat Data integrates with databases and object storage across AWS, GCP and Azure. This blog post will discuss the design choices we made in providing a self-service installation and management experience for the data plane portion. We hope that this will prove useful to SaaS vendors with similar requirements for accessing sensitive customer data, and ensuring that only insensitive metadata (such as telemetry, and data quality metrics) are sent back to the vendor.

Background

Lariat Data connects with several data sources to compute data quality metrics. For example, users opt-in to monitoring an S3 prefix, or a Snowflake table. The Lariat Data data plane then connects to these sources and collects diagnostic metrics which are forwarded to our own cloud. Most analytics providers will function similarly, although they may opt to distribute the work between vendor and customer in different ways. The broad approaches are:

  1. Vendor hosts everything: customers give the vendor credentials for connecting to their data stack, and vendor takes care of the rest

  2. Vendor hosts nothing: They provide software, and usually a web UI, for collecting and analyzing data that is entirely hosted in the customer cloud

  3. (our choice) Vendor and customer share workload: The vendor provides a straightforward method for the customer to setup third-party software in the customer environment. This software then sends relevant data back to the vendor, who exposes a cloud-based analytics UI. This allows customers to maintain full control of sensitive data, while giving us the efficiencies of operating a multi-tenant storage and analytics infrastructure.

By choosing option 3, we took on the responsibility of providing customers a method to install Lariat Data software into their environment. And our solution had to be palatable to Dev, Ops, and Security teams. With this in mind, we outlined the following design goals for our installation experience::

  • Installation should be self-service.

  • Customers should understand what they are installing (and what it costs them to operate)

  • Any software written by us, and operating in the customer’s cloud, should have least-privilege permissions

  • Customers should be able to easily apply updates and security patches to our software

  • Don’t re-invent the wheel for undifferentiated parts of the architecture (e.g. cron scheduling)

Below are the key design choices we made to deliver a self-service installation experience for the Lariat Data data plane:

  • Use Infrastructure-as-code as the basis for the installer

  • Prefer cloud provider primitives where possible

  • Use Native Data Egress

  • Store Installation State

  • Establish a base installer framework


The rest of this post will dig into each of these choices.

Design Choice 1: Use Infrastructure-as-code as the basis for the installer

Self-service means different things for different products. If you’re installing an application for your laptop, say a PDF reader, your expectation is that you visit a web page and can download and use the software with a couple of clicks. If you’re installing cloud infrastructure, you expect that you will have to get your hands dirty with the command-line, provide access credentials, and refer to vendor documentation to make sure you do things right. The complexity of the installation process tends to mirror the complexity of the software.

Our data plane software started off as a single long-running process, but it grew in complexity in response to customer needs. We realized as we added this complexity that the number of steps for users to follow to complete the installation was increasing as well, to the point where potential users might question whether it was worth the effort. We made the decision, at this point, to base our installation experience around Terraform (which we are now porting to OpenTofu). The advantage of this was that the surface area exposed to the installing user was a straightforward “terraform apply” - regardless of how complex the defined infrastructure actually was. With this approach, the installation process need not take on additional complexity just because your software has become more complex.

Alternative: Using AWS CloudFormation / Google Deployment Manager etc.

Cloud providers now come with baked-in tools for Infrastructure-as-code workflows, such CloudFormation for AWS and Google Deployment Manager for GCP. These are useful tools for modelling and provisioning infrastructure based on templates provided by the vendor. However the boundaries of these tools are restricted to cloud provider resources, and our installation often required modification of additional pieces of infra - such as creating warehouses in AWS-hosted Snowflake. We could create the additional resources via scripting, but in the process we lose the guarantees of consistency and state management provided by the IaC framework. For this reason, we chose a third-party IaC tool in Terraform, which provides declarative modules for all cloud providers, and for the variety of data sources we integrate with.

Design Choice 2: Prefer cloud provider primitives where possible

Cloud providers offer several building blocks for running serverless function code, cron- or event-driven scheduling, and workflow management. You could choose to ignore these building blocks in favour of custom cloud-agnostic code. For example, a long-running data plane process packaged as a Docker container will run in pretty much any cloud environment, with minimal variance. This will also allow you to build from a shared codebase rather than maintain cloud- and runtime-specific versions. However we chose not to do this. We found that relying on cloud provider building blocks greatly improved the reliability of our data plane, and the resulting infra was far cheaper for our customers to run. Concretely this means that we maintain separate Infrastructure-as-Code specifications for AWS, Google Cloud and Azure. For example, the AWS installation creates Lambda, EventBridge, and S3 resources, while the Google Cloud installation creates Cloud Functions, Eventarc, and GCS resources on GCP. This also makes the installed infrastructure easier to grok for customers, who tend to already have tools and processes that are opinionated towards their choice of cloud provider.


Design Choice 3: Use Native Data Egress

A corollary to preferring cloud-provider building blocks is to delegate data transfer to native mechanisms. When moving data between customer and vendor clouds, we avoided the temptation to send data over the wire. We opted instead to enable cross-account access to dedicated object storage buckets (think S3 or GCS) hosted in our cloud. Instead of sending large dataframes over HTTP (and bearing onerous data egress costs), we cheaply copy files from the customer’s object storage into our own.

Design Choice 4: Store Installation State

Where regulations allow it, it is advisable to collect some metadata about the state of your software when running in an external environment like a customer cloud. At a minimum, the version of the software running, as well as any information that can be used to diagnose the health of the installation should be collected. Our previous design choices made this even more paramount, since by relying on cloud-provider services as a backbone, you open yourself up to failure modes that are beyond your control - for example, a customer exhausting their quota of Lambda invocations, or lacking permissions on an S3 bucket. In our case, installation state is represented in remote tfstate files living in our cloud (using S3 as a Terraform state backend), as well as metadata from periodic “phone-homes” sent by the data plane. Since remote tfstate is persisted even for incomplete “terraform apply” procedures, this brings a nice side-effect that failed installations can be picked up where they left off. For example, if an installation fails because the installing user lacks privileges to create a cloud resource defined in our IaC manifest, they can re-run our installation script after fixing their permissions without duplicating previous work.


Design Choice 5: Establish a base installer framework

As we onboarded more data sources, and expanded support across multiple clouds, we found ourselves falling back to a few reliable patterns in designing our installers. One was to abstract away permissions policies, such as IAM JSON, so that these could be easily evolved without having to play with Terraform code. Another was to ensure that configuration files for our data plane were always stored in native object storage, so that users could modify these easily and our data plane could pick them up at runtime. We encoded these conventions in a base installer framework that was specific to each cloud provider. This allows us to use a foundation of shared practices and code for any data source hosted in that cloud provider e.g. we share a lot of code between installers for AWS S3, RDS, and Snowflake on AWS. After consolidating code in this framework, we saw a dramatic increase in velocity for shipping new integrations and installers.


Putting it all Together: Our Solution

Today, any installation of Lariat Data software in customer clouds follows this procedure.

  • A privileged user in the customer org acquires a set of short-lived local credentials on their machine

  • The user downloads our installer, packaged as a Docker image. The entrypoint for the Docker image triggers a `terraform apply` after some validation of inputs.

  • The user spins up a container from the installer image, injecting their credentials as environment variables or a file mount. These variables get forwarded through to the IaC module as terraform variables (tfvars), permissioning our installer to provision infrastructure in the customer cloud.

  • Once the container exits successfully, the customer has a fully provisioned Lariat data plane in their cloud. All provisioned infrastructure is tagged with Vendor:Lariat so that customers can breakdown the costs associated with operating our data plane.

With this setup, privileged cloud provider credentials never leave the installing user’s machine. The heavy-lifting is done within the installer Docker image. We can get pretty close to a one-line install experience by providing a templated `docker run` for the installing user to execute on the command line e.g.

docker run -it --pull=always --mount type=bind,source=$PWD/gcs_agent.yaml,target=/workspace/gcs_agent.yaml,readonly --mount type=bind,source=$PWD/gcloud_access_token,target=/workspace/gcloud_access_token,readonly -e LARIAT_API_KEY=<redacted> -e LARIAT_APPLICATION_KEY=<redacted> lariatdata/install-gcp-gcs-agent:latest install

Conclusion

Providing a self-service installation flow will always expose you to failure modes that you would not encounter if you managed the entire installation process yourself as a vendor. Accounting for these failure modes, and exposing the right degree of complexity to the end-user is important to ensure that users complete your self-service process with a clear understanding of what software has been installed and what they can expect from it.

We’re pleased with the installation experience we have designed at Lariat Data, and as self-service is rapidly becoming ‘table stakes’ for enterprise software, we hope that this post proves useful for others building similar flows for SaaS.

You can check out these Github repositories for examples of what our installers look like:

If you haven’t tried Lariat out, you can be up and running with our Continuous Data Quality monitoring platform in minutes! Check us out here.


 
Next
Next

Announcing GCS Object Monitoring with Lariat