Data Reality¶

Your data determines your strategy more than your ambition does.

Part of From Strategy to Production

The Data Assumption¶

Every AI strategy assumes data. "We'll use AI to analyse customer behaviour." "We'll build an AI-powered risk model." "We'll automate document processing with AI."

Each of these assumes that the required data exists, is accessible, is clean enough to use, and can legally be processed. In most organisations, at least one of these assumptions is wrong.

Data isn't a supporting resource for AI strategy. Data is the strategy's foundation. An organisation with excellent data and a modest model will outperform an organisation with poor data and the most capable model available. This is well-established in machine learning literature and routinely ignored in strategic planning.

The Data Readiness Assessment¶

Before committing to any AI initiative, assess data reality across five dimensions:

Data Readiness Assessment

1. Existence¶

Does the data actually exist?

This sounds absurd to ask, but it's the most common failure point. Organisations frequently plan AI initiatives around data they assume they have but don't.

Scenario	What's Assumed	What's Actually True
"Customer interaction history"	Complete records of all customer contacts	Phone calls aren't transcribed; email is in 3 systems; chat logs were deleted after 90 days
"Product performance data"	Detailed usage metrics	Basic pageviews exist; feature-level usage isn't tracked
"Employee skills database"	Structured skills inventory	Freeform text in CV uploads, last updated 3 years ago
"Fraud labels"	Known fraud cases with outcomes	Fraud team uses spreadsheets; labelling is inconsistent; most cases are "suspected" not confirmed
"Equipment sensor data"	Continuous readings from all machines	60% of machines have sensors; data gaps during maintenance windows; sampling rate varies

Before you plan the AI, find the data. Literally. Go look at it. Open the database. Export a sample. If someone says "we have that data," ask them to show you.

2. Accessibility¶

Can we get to the data?

Data existing somewhere in the organisation is not the same as data being available for AI use.

Barrier	How Common	Impact
Siloed systems	Very common	Data in system A can't be joined with data in system B without a major integration project
Legacy platforms	Common	Data exists but in formats, systems, or databases that are difficult to extract from
No API access	Common	Data is in a SaaS platform with limited or no programmatic access
Access permissions	Very common	Data team doesn't have access; getting it requires approvals from multiple business units
Real-time vs. batch	Common	Data exists but is only available in nightly batch extracts, not real-time

A common strategic mistake: Planning an AI initiative that requires real-time access to data that's only available in a data warehouse refreshed overnight. The AI is ready; the data pipeline isn't. This adds 3-6 months to the timeline that wasn't in the original plan.

3. Quality¶

Is the data good enough?

"Good enough" depends on the use case and risk tier. A CRITICAL-tier system making autonomous financial decisions needs higher data quality than a MEDIUM-tier internal assistant.

Quality Dimension	What to Check	Why It Matters for AI
Completeness	What percentage of records have all required fields?	Missing data leads to biased models and unreliable outputs
Accuracy	How much of the data is actually correct?	Incorrect training data produces incorrect models
Consistency	Are the same things represented the same way?	"UK", "United Kingdom", "GB", "GBR" — same country, four representations
Timeliness	How current is the data?	Models trained on stale data make decisions about a world that no longer exists
Duplication	How many records are duplicated?	Duplicates skew distributions and bias models
Labelling quality	For supervised learning — are labels consistent and correct?	Inconsistent labelling is the single biggest quality issue for ML

Real-world example — customer churn prediction:

An insurance company wants to predict customer churn. They have 5 years of customer data. Assessment reveals:

Address fields are 70% complete (customers who moved didn't always update)
"Churn reason" was only tracked for the last 18 months
Product names changed twice, creating three naming conventions in the same dataset
12% of records are duplicates from a system migration
"Renewal" sometimes means "auto-renewed" and sometimes means "customer actively renewed" — no way to distinguish

The model can still be built. But the strategy needs to account for 3-6 months of data cleaning before any AI development starts. And the model will be less accurate than the business case assumed, because the underlying data has limitations that no model can overcome.

The framework connection: The framework's data protection controls (DAT-01 to DAT-08) focus on securing data. Data quality is a prerequisite the framework assumes but doesn't enforce. If your data quality is poor, the framework's controls will faithfully protect poor data — and the AI will faithfully produce poor outputs.

4. Representativeness¶

Does the data represent the real world the AI will operate in?

This is particularly important for high-stakes AI and is directly connected to the framework's novel risk #6 (training data influence) and #5 (opacity).

Representativeness Problem	Example	Consequence
Historical bias	Lending data reflects past discrimination	AI perpetuates discriminatory patterns
Survivorship bias	Only data on customers who stayed, not those who left	Model can't predict churn because it's never seen churners
Geographic bias	Data primarily from one region	Model performs poorly in other regions
Temporal bias	Training data from pre-pandemic; deployment in post-pandemic economy	Model assumes conditions that no longer exist
Selection bias	Data from customers who opted in to tracking	Model doesn't represent customers who opted out
Label bias	Fraud labels applied by a team that only investigated certain transaction types	Model only detects fraud it was shown

Strategic implication: If your data has representativeness problems, the AI will have bias problems. This isn't a technical issue to fix with model tuning — it's a data problem that requires data solutions. For CRITICAL-tier systems (lending, hiring, insurance pricing), representativeness isn't optional. Regulators will ask.

5. Permissibility¶

Are we allowed to use this data for this purpose?

Legal/Ethical Dimension	Question	Consequence of Getting It Wrong
Data protection	Does our GDPR/privacy basis cover AI processing?	Regulatory enforcement, fines
Consent	Did customers consent to AI-based decisions?	Complaints, regulatory action
Contractual	Do data supplier contracts permit AI use?	Contract breach, data loss
Purpose limitation	Was data collected for a different purpose?	GDPR Article 5 violation
Automated decision-making	Are we making solely automated decisions with legal effects?	GDPR Article 22 rights apply
Intellectual property	Are we training on copyrighted content?	IP litigation risk
Ethical	Even if legal, should we use this data this way?	Reputational risk

Real-world scenario: A retailer wants to use loyalty card data to train a personalised recommendation AI. The loyalty card terms say data will be used "to provide personalised offers." Legal advises that training an AI model is different from running a database query, and may require updated consent under GDPR legitimate interest assessment. This isn't a technical blocker — it's a 3-month legal process that wasn't in the project plan.

How Data Reality Shapes Strategy¶

The Data-Strategy Matrix¶

Organisations fall into one of four quadrants based on their data position and their strategic ambition:

	Data ready	Data not ready
Ambitious strategy (HIGH/CRITICAL tier)	Possible but expensive. Data is ready; controls are significant. Viable if funded and governed properly.	Dangerous. High ambition on weak foundations. Most likely outcome: expensive failure or silent quality problems.
Conservative strategy (LOW/MEDIUM tier)	Best starting position. Good data, low risk. Quick wins that build capability and confidence.	Manageable. Low risk means data limitations have lower consequences. Start here, improve data in parallel.

The common mistake: Organisations with "data not ready" pursue ambitious strategies because the business case is compelling. The business case assumes the data problem will be solved during the project. It usually isn't.

The better approach: Start in the bottom-left quadrant (conservative strategy, data ready). Build capability. Improve data quality in parallel. Move to the top-left (ambitious strategy, data ready) when both data and organisational readiness support it. This is the Progression path.

Data Quality as Risk Multiplier¶

The framework's risk tiers classify systems by their potential impact. Data quality should modify this classification:

Risk Tier	Good Data	Poor Data
LOW	Standard controls sufficient	Standard controls sufficient (errors are low-impact)
MEDIUM	Standard controls sufficient	Consider upgrading to HIGH — poor data increases error probability
HIGH	HIGH controls appropriate	Consider upgrading to CRITICAL — poor data on high-impact decisions
CRITICAL	Maximum controls	Question the deployment. CRITICAL decisions on poor data may not be appropriate.

The framework doesn't currently incorporate data quality into risk classification. This is a gap. A CRITICAL-tier system built on well-curated, representative data is a different risk profile from a CRITICAL-tier system built on incomplete, biased data. The controls are the same; the residual risk is not.

Data Strategies That Work¶

Strategy 1: Start With What You Have¶

Don't wait for perfect data. Identify what's available now, assess its quality honestly, and design AI use cases that work within those constraints.

Example: A bank wants AI-powered customer insights. The ideal dataset would integrate all customer touchpoints. The available data is transactional data from core banking — complete, accurate, but limited in scope. Strategy: start with transaction-based insights (spending patterns, anomaly detection) where the data is strong. Add other data sources over time.

Framework alignment: This naturally produces a Fast Lane or Tier 1 deployment. Internal users, read-only, leveraging data you already control.

Strategy 2: Build Data Capability Alongside AI Capability¶

Treat data improvement as a parallel workstream, not a prerequisite. But be honest about which AI use cases are viable now and which need better data first.

Timeline	AI Capability	Data Capability
Months 1-3	Fast Lane deployments using existing, well-understood data	Audit data landscape; identify gaps; start data quality programme
Months 4-6	Tier 1 deployments with cleaned data	First data quality improvements delivered; new data pipelines built
Months 7-12	Tier 2 deployments with enriched data	Data platform operational; quality monitoring automated
Year 2	Tier 3 (if needed) with high-quality, well-governed data	Mature data governance; continuous quality improvement

Strategy 3: Buy Data Capability, Don't Build It¶

For many organisations, building a data platform from scratch is a multi-year, multi-million-pound programme. Consider:

Cloud data platforms (Snowflake, Databricks, BigQuery) that provide infrastructure without building it
Data quality tools (Great Expectations, Monte Carlo, Anomalo) that automate quality monitoring
Data cataloguing (Alation, Collibra, Atlan) that solve the "does this data exist?" problem

These don't eliminate the need for internal data governance, but they accelerate time to usable data.

Strategy 4: Accept Data Limitations Explicitly¶

Not every data problem needs solving. For LOW and MEDIUM tier AI deployments, imperfect data may be acceptable — as long as:

The limitations are documented
The impact of data errors is understood
Human review catches the cases where data quality causes AI errors
The risk tier accounts for data quality (not just use case impact)

The framework supports this. The three-layer model (Guardrails → Judge → Human) is designed to catch errors regardless of their cause. Data quality issues manifest as AI output errors, which the Judge can detect and humans can review. The controls don't need to know why the output is wrong — they need to know that it's wrong.

Data Anti-Patterns¶

Anti-Pattern	Why It Happens	What Goes Wrong
"We'll fix data quality during the project"	Project timelines don't include data work	Data cleaning takes longer than expected; AI development stalls
"Our data warehouse has everything"	Assumption based on warehouse existence, not content audit	Data warehouse has aggregated data; AI needs raw data; they're in different systems
"We'll use synthetic data"	Attractive solution to data availability	Synthetic data doesn't represent real-world distributions; model learns synthetic patterns
"More data is always better"	Quantity over quality mindset	Large datasets with poor quality produce worse models than small datasets with good quality
"The vendor handles our data"	Outsource AI to a vendor; assume data is their problem	Vendor needs your data in a usable form; data preparation still falls on you
"Our data is fine"	No one has actually assessed quality	Reality is only discovered when the model performs poorly in production

The Uncomfortable Truth¶

Most organisations' data is not ready for the AI strategy they want to pursue. This is not a failure — it's a starting condition. The organisations that succeed are the ones that:

Assess data reality before committing to AI strategy — not after
Match ambition to data maturity — pursue what's possible now, not what's possible with data you don't yet have
Invest in data as a strategic asset — not as an AI project dependency
Accept that some use cases need to wait — until the data supports them
Use the framework's risk tiers to calibrate expectations — higher tiers need better data, not just better controls

The framework can secure AI systems built on poor data. It cannot make them accurate. That distinction is critical for strategy.

AI Runtime Behaviour Security, 2026 (Jonathan Gill).