How to Enforce Data Retention Policies in Looker
Keeping every piece of data forever isn't just impractical - it's expensive and can introduce serious security and compliance risks. Setting up clear data retention policies is essential, but enforcing them within a powerful BI tool like Looker requires a specific approach. This guide will show you exactly how to manage data access and implement retention policies, focusing on where the work needs to happen: in your database and within Looker's caching and PDT settings.
Why Data Retention Policies Matter
Before jumping into the "how," it’s important to understand the "why." A well-defined data retention policy isn't about arbitrarily deleting data. It's a strategic plan that offers several significant benefits for your business.
- Compliance and Legal Requirements: Regulations like GDPR in Europe and CCPA in California mandate that personal data should only be kept for as long as necessary. Failing to comply can result in hefty fines and damage to your brand's reputation.
- Cost Management: Data storage isn't free. The more data you warehouse in platforms like BigQuery, Snowflake, or Redshift, the more you pay. Purging old, irrelevant data is a straightforward way to control escalating cloud costs.
- Improved Performance: Querying a massive, cluttered database is slower and less efficient. A leaner database leads to faster queries in Looker, which means your dashboards load quicker and your team can get answers without drumming their fingers on the desk.
- Reduced Security Risk: Every piece of data you store is a potential liability in the event of a data breach. Minimizing your data footprint reduces the amount of sensitive information that could be exposed.
Understanding Looker's Role in Data Retention
Here’s a common point of confusion: Looker does not store your core business data. It’s a modeling and visualization layer that sits on top of your existing database or data warehouse. When you run a query in Looker, it sends SQL code to your database (like BigQuery or Snowflake) and then visualizes the results that the database sends back.
This means the primary enforcement of your data retention policy - the actual deletion of old records - must happen directly in your source database. Looker will simply reflect whatever data is available there.
However, Looker does save some data temporarily for performance reasons. This is where your focus should be within the Looker platform itself. Specifically, you need to manage:
- Looker's Query Cache: To speed things up, Looker temporarily stores the results of recent queries. If you delete data from your database but Looker serves an answer from its cache, users might see stale information.
- Persistent Derived Tables (PDTs): These are special tables that Looker materializes (builds and writes) in your database on a set schedule. If you have PDTs that summarize old data, they won't automatically update just because you deleted the source records.
The goal is to align Looker’s caching and PDT refresh schedules with your database's cleanup schedule so that Looks and dashboards always reflect the current state of your data.
Free PDF Guide
AI for Data Analysis Crash Course
Learn how to get AI to do data analysis for you — the best tools, prompts, and workflows to go from raw data to insights without writing a single line of code.
Setting the Policy in Your Data Warehouse
Since your database holds the data, the process starts there. Enforcing a retention policy involves defining it, scripting the deletion logic, and automating the process.
Step 1: Define Your Policy
First, decide on your rules. Don't do this in a vacuum, involve stakeholders from legal, marketing, and operations. Ask fundamental questions:
- What data needs a retention rule? (e.g., user PII, event logs, marketing campaign results).
- How long should different types of data be kept? (e.g., "delete user-level web events after 18 months," "anonymize customer PII 30 days after an account is closed").
- What is the legal or business justification for this timeframe? Having this documented is crucial for compliance audits.
Step 2: Implement Deletion Logic at the Database Level
Once you have a defined policy, you need to create a script that executes it. This is usually a SQL script containing a DELETE statement. For example, if your policy is to delete user session data older than two years, your pseudo-SQL might look like this:
DELETE FROM web_analytics.sessions
WHERE session_start_timestamp < DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)Some data warehouses offer features to make this easier. Google BigQuery, for instance, allows you to set a "partition expiration," which automatically deletes data partitions after a specified time. This is far more efficient than running a massive DELETE query.
Step 3: Schedule a Regular Cleanup Job
Manually running your deletion script is a recipe for forgetting. You need to automate this process. Nearly every modern data warehouse and workflow management tool provides a way to schedule jobs. This could be:
- A daily or weekly stored procedure.
- A cron job that executes your SQL script.
- A scheduled task within a cloud service like Google Cloud Scheduler or AWS Lambda.
Managing Data and Caching within Looker
Now that your database is being regularly cleaned, you need to make sure Looker doesn't hold onto old results. You'll do this in your LookML project by configuring datagroups.
What is a Datagroup?
A datagroup in Looker is a powerhouse for managing caching. It's an object you define in a LookML model file that sets a caching policy. You can then apply this datagroup policy to Explores, models, and PDTs to ensure they all refresh on the same schedule.
Controlling Looker Queries with Datagroups
Let's say your data warehouse purges old data every night at 2 AM. You want Looker to clear its cache shortly after to fetch fresh results. You can define a datagroup that triggers this refresh.
In your LookML model file (e.g., my_model.model.lkml), you would add:
datagroup: hourly_refresh {
max_cache_age: "1 hour"
sql_trigger: SELECT MAX(id) FROM etl_log
}Here's what that does:
max_cache_agesets a hard limit. In this example, Looker will not serve cached results older than 1 hour, no matter what.sql_triggeris the star of the show. Looker runs this query periodically. If the result of the query value is different from the last time it ran, the datagroup is triggered, and Looker invalidates all cached data associated with this datagroup.
To apply this to your Explores, you would set a persist_with parameter:
explore: orders {
persist_with: hourly_refresh
}
explore: users {
persist_with: hourly_refresh
}Now, any dashboards or Looks built from the orders or users Explores will adhere to the hourly_refresh cache policy.
Free PDF Guide
AI for Data Analysis Crash Course
Learn how to get AI to do data analysis for you — the best tools, prompts, and workflows to go from raw data to insights without writing a single line of code.
Handling Persistent Derived Tables (PDTs)
PDTs that summarize data over long periods can easily become stale after your cleanup scripts run. You can tie their rebuild schedule to your datagroup as well.
In your PDT's view file, you would use datagroup_trigger:
view: user_lifetime_value {
derived_table: {
datagroup_trigger: hourly_refresh
sql: SELECT user_id, SUM(order_total) as total_revenue FROM orders GROUP BY 1
}
}With this configuration, whenever the hourly_refresh datagroup is triggered, Looker will not only clear the cache but also rebuild the user_lifetime_value table from scratch, ensuring it's always based on the current data in your warehouse.
Best Practices for a Healthy Data Environment
Implementing the steps above is a solid start. To maintain a truly clean and compliant environment, consider these additional practices.
- Document Everything: Your data retention policy should live in a central, accessible document. It should detail what data is purged, when, and why. This is incredibly important for training new team members and satisfying any compliance requests.
- Perform Regular Audits: Periodically verify that your deletion scripts are running correctly and that your Looker caches are behaving as expected. Check your database costs to see if your policies are having the intended financial impact.
- Align Teams: Ensure that your marketing, sales, and product teams are aware of the retention policies. If the marketing team knows that user-level granular data is only available for 18 months, they won't try to build a 24-month lookback report and wonder why the data is incomplete.
Final Thoughts
Enforcing a data retention policy in Looker is a two-part process. The most critical part happens in your data warehouse, where you define and automate the deletion of records. The second part, handled within Looker, is about aligning your caching policies and PDT rebuilds to ensure users are always seeing fresh, accurate data that reflects your rules.
Managing LookML files, sql_trigger queries, and datagroups for data governance is powerful, but it's also a reflection of the technical configurations needed to maintain traditional BI tools. At Graphed, we created a different experience. We believe you should be able to get answers from all your marketing and sales platforms instantly, without the engineering overhead. Instead of building complex models, you can connect your data sources in seconds and ask questions in plain English to create live, interactive dashboards that are always up-to-date.
Related Articles
Facebook Ads for Home Cleaners: The Complete 2026 Strategy Guide
Learn how to run Facebook ads for home cleaners in 2026. Discover the best ad formats, targeting strategies, and budgeting tips to generate more leads.
Facebook Ads for Pet Grooming: The Complete 2026 Strategy Guide
Learn how to run Facebook ads for pet grooming businesses in 2025. Discover AI-powered creative scaling, pain point discovery strategies, and the new customer offer that works.
AI Marketing Apps: The 15 Best Tools to Scale Your Marketing in 2026
Discover the 15 best AI marketing apps in 2026, from content creation to workflow automation, organized by category with pricing and use cases.