What is AI Safety?
AI safety is the multidisciplinary field focused on building AI systems that don't cause harm: spanning technical alignment research (making models do what we want), robustness (making models behave well on novel inputs), interpretability (understanding what models learn), governance (policies and norms for AI development), and existential safety (concerns about future AI systems whose capabilities exceed human oversight).
Overview
AI safety has evolved from an academic concern in the 2010s into one of the most important technical and policy fields of the 2020s. The discipline spans many subfields: alignment research (RLHF, Constitutional AI, DPO), robustness research (adversarial training, distribution shift handling), interpretability (Anthropic, MIRI, Apollo Research, Goodfire), evaluation and red-teaming (UK AISI, US AISI, METR), governance (NIST AI RMF, EU AI Act, US AI executive orders), and existential safety research (focused on risks from much more capable future AI). Anthropic, DeepMind, OpenAI's safety team, MATS, ARC, and many academic groups produce the technical research; policy work happens at national AI Safety Institutes and within governments. For production AI engineering, the practical layer of AI safety is what BearPlex implements daily: alignment via prompts and guardrails, evaluation harnesses, red-team testing, monitoring, and incident response.
Subfields of AI safety
(1) Alignment: making AI systems do what humans intend; covers RLHF, Constitutional AI, DPO, scalable oversight research. (2) Robustness: making AI systems perform reliably on novel inputs, including adversarial inputs; covers adversarial training, distribution shift handling, prompt injection defense. (3) Interpretability: understanding what AI systems learn and why they make decisions; covers mechanistic interpretability, attribution methods, model probing. (4) Evaluation and red-teaming: measuring AI capability and safety properties; covers benchmark design, dangerous capability evaluations, red-team frameworks. (5) Governance: policies and norms for AI development and deployment; covers AI executive orders, EU AI Act, sectoral regulations, voluntary commitments. (6) Existential safety: research on risks from much more capable future AI systems; smaller field but high-profile, covers control research, alignment scalability, and societal preparedness.
Production AI safety practice
For BearPlex production engagements, safety practice is concrete: (1) Pre-deployment red-team evaluation against known attack patterns (OWASP LLM Top 10, prompt injection suites, jailbreak datasets); (2) Programmatic guardrails on inputs (PII detection, content moderation, prompt injection detection) and outputs (structured validation, safety filtering, citation verification); (3) Tool design with privilege separation: read operations unprivileged, destructive operations gated behind human approval; (4) Audit logging on every input, output, and action for incident review; (5) Monitoring for safety-relevant signals: refusal rate, escalation rate, user complaints, anomalous outputs; (6) Incident response plans, who responds, how the system gets paused, how the issue gets fixed; (7) Documentation matching the client's compliance framework (NIST AI RMF, ISO 42001, sectoral requirements). The work is unsexy but it's what separates production-grade AI from demos.
AI safety governance landscape (2026)
Current production-relevant frameworks: (1) NIST AI Risk Management Framework (AI RMF), voluntary US framework widely adopted; covers govern, map, measure, manage functions; (2) EU AI Act: risk-tiered regulatory framework; high-risk AI systems have specific obligations including risk management, data governance, transparency, oversight, accuracy/robustness/cybersecurity; entered into force August 2024 with phased compliance; (3) ISO 42001: international standard for AI management systems; voluntary but increasingly required in enterprise procurement; (4) Sectoral regulations: FDA SaMD for healthcare, FINRA / SEC / OCC for financial services, FTC guidance for consumer products; (5) National AI Safety Institutes (UK AISI, US AISI, EU AI Office): government bodies conducting frontier model evaluations and developing standards. For production AI engagements in 2026, governance integration is increasingly mandatory rather than optional.
Use cases
- Pre-deployment safety evaluation for production AI systems
- Compliance with NIST AI RMF, EU AI Act, ISO 42001 frameworks
- Red-team testing for prompt injection, jailbreaking, and adversarial robustness
- Building incident response capabilities for AI system failures
- Procurement and vendor evaluation for enterprise AI adoption
Examples in production
NIST AI Risk Management Framework
NIST AI RMF provides voluntary US framework for managing AI risk; widely adopted in enterprise and federal contexts.
SourceEU AI Act
First comprehensive regulatory framework for AI; entered into force August 2024 with risk-tiered obligations on AI systems.
SourceUK AI Safety Institute
Government body conducting independent frontier model evaluations; publishes safety research and capabilities assessments.
SourceAnthropic Responsible Scaling Policy
Voluntary commitment framework where Anthropic deploys safety measures scaled to model capability; influenced industry voluntary commitments.
SourceAI Safety compared to alternatives
| Alternative | Choose AI Safety when | Choose alternative when |
|---|---|---|
AI alignment Specifically: getting AI systems to do what humans intend | AI safety is broader: encompasses alignment plus robustness, interpretability, evaluation, governance | AI alignment is one (important) subfield within AI safety |
AI ethics Concerns about AI's social, moral, and societal implications | AI safety focuses more on technical risk management and harm prevention | AI ethics covers broader societal questions; the fields overlap but emphasize different concerns |
Common pitfalls
- Conflating AI safety with bias / fairness: related but distinct concerns with different mitigation approaches
- Treating safety as a checkbox at deployment time: needs to be designed in throughout development
- Focusing only on alignment without robustness, evaluation, monitoring: incomplete defense
- Underinvesting in incident response: when (not if) safety failures occur, response speed matters
- Ignoring governance frameworks until forced to comply: much more expensive to retrofit than build in
Questions about AI Safety.
Practically, every day. Production AI safety means: pre-deployment red-team evaluation, guardrails on inputs and outputs, tool design with privilege separation, audit logging, monitoring for safety signals, incident response capabilities, and compliance with applicable governance frameworks. These aren't research questions: they're engineering deliverables on every BearPlex production engagement.
If you have any users or customers in the EU, yes: the EU AI Act has extraterritorial application similar to GDPR. The compliance complexity depends on the risk tier of your AI system. Most consumer-facing and enterprise B2B AI applications fall into limited-risk or minimal-risk categories with lighter obligations; high-risk applications (employment, education, critical infrastructure, law enforcement, healthcare, financial services) have substantial compliance requirements.
Standard enterprise AI procurement frameworks include: NIST AI RMF (most common in US), ISO 42001 (increasingly required for international procurement), sectoral requirements (HIPAA for healthcare, FINRA for financial services). Ask vendors for: (1) safety evaluation reports, (2) red-team testing results, (3) governance documentation aligned to the framework you're using, (4) incident response capabilities, (5) ongoing monitoring and update procedures.
Need help implementing AI Safety?
BearPlex builds production AI systems that use AI Safety for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.