Sensitivity & Necessity Analysis of the Coherence Scoring Framework

A Structural Validation Study

Ring 2 — Canonical Grounding

Ring 3 — Framework Connections


Abstract

We present a comprehensive sensitivity and necessity analysis of a universal coherence scoring framework designed to evaluate theoretical systems across domains. Unlike typical frameworks that rely on parameter tuning and curve-fitting, our approach tests structural necessity: whether the framework’s components and topology are load-bearing, or whether they could be replaced with arbitrary alternatives without loss of function.

This document provides full transparency into our methodology, including all source code, test procedures, and falsification criteria. We invite scrutiny, replication, and adversarial testing of our approach.

Key Finding: A framework is scientifically defensible only when it can demonstrate that its structure could not work otherwise.


1. Motivation: The Parameter Tuning Problem

1.1 The Standard Criticism

Most scoring frameworks face a fatal criticism:

“You tuned the weights until you got the answer you wanted.”

This is the curve-fitting objection: if you have enough adjustable parameters, you can fit any data. The framework becomes unfalsifiable because any failure can be blamed on “wrong weights” rather than structural inadequacy.

1.2 Our Approach: Structure-Only Testing

We deliberately avoid parameter tuning by testing structural properties:

  • Ablation: Is each component necessary, or could it be removed without impact?
  • Topology: Does the connection structure matter, or could any graph work?
  • Label Independence: Does the math work regardless of semantic labels, or is it just storytelling?
  • Adversarial Resistance: Does the framework correctly identify incoherence, or can it be gamed?

Critical Rule: No weights are adjusted during testing. Components are either present or absent. Structure is either intact or modified.


2. The Framework Under Test

2.1 Core Components

The coherence scoring framework evaluates any theoretical system across three domains:

10 Variables (χ Components):

  • G (Grace / Negentropy)
  • M (Motion)
  • E (Energy)
  • S (Entropy)
  • T (Time)
  • K (Knowledge)
  • R (Resurrection / Transformation)
  • Q (Quantum / Probability)
  • F (Faith / Trust)
  • C (Consciousness)

12 Fruits (Coherence Indicators):

  • Grace, Hope, Patience, Faithfulness, Self-Control
  • Love, Peace, Truth, Humility, Goodness, Unity, Joy

9 Constraints (Structural Properties):

  • Binding/Cohesion, Resonance, Equilibrium
  • Temporal Persistence, Positive Coupling, Value Conservation
  • Consistency, Minimal Perturbation, Boundary Regulation

Triad Architecture:

  • Π (Polis): Institutional/collective coherence
  • A (Anthropos): Individual/psychological coherence
  • Λ (Logos): Informational/epistemic coherence

2.2 Scoring Function

χ = (Π × A × Λ)^(1/3)

Where:

  • Π aggregates institutional trust, social cohesion, political integration, economic coordination
  • A aggregates psychological stability, meaning/purpose, social embeddedness, agency/efficacy
  • Λ aggregates shared reality, epistemic infrastructure, information coherence, sensemaking capacity

No adjustable parameters. Components either contribute or they don’t.


3. Sensitivity Test Suite

3.1 Test 1: Component Ablation (Necessity)

Hypothesis: If the framework is structurally sound, removing key components will degrade coherence scores.

Method:

  1. Score a baseline coherent document → χ_baseline
  2. Remove one component (e.g., Grace variable)
  3. Rescore the same document → χ_ablated
  4. Compute Δχ = χ_baseline - χ_ablated
  5. If |Δχ| > 10% of baseline → component is LOAD-BEARING
  6. Repeat for all 31 components (10 variables + 12 fruits + 9 constraints)

Falsification Criterion:

  • If <5 components are load-bearing → framework is REDUNDANT
  • If removing any component improves score → framework has PARASITIC elements

Interpretation:

  • High load-bearing count → structure matters
  • Low load-bearing count → structure is arbitrary

3.2 Test 2: Topology Sensitivity (Structure)

Hypothesis: If the framework’s connection structure is necessary, random permutations will degrade performance.

Method:

  1. Score baseline document → χ_baseline
  2. Scramble Fruit-to-Triad mappings randomly (e.g., Grace maps to different Triad components)
  3. Rescore → χ_scrambled
  4. Compute Δχ = χ_baseline - χ_scrambled
  5. If |Δχ| > 15% of baseline → topology is LOAD-BEARING
  6. Repeat with:
    • Flattened hierarchy (all weights equal)
    • Reversed order (L10 → L1 instead of L1 → L10)

Falsification Criterion:

  • If scrambled topology performs equally well → structure is ARBITRARY
  • If random graphs work → any framework would do

Interpretation:

  • High topology sensitivity → connections matter
  • Low topology sensitivity → graph could be anything

3.3 Test 3: Label Independence (Critical Test)

Hypothesis: If the framework is mathematically grounded (not semantic storytelling), it should work regardless of label names.

Method:

  1. Score baseline document with theological labels → χ_theo
  2. Replace all labels with neutral equivalents:
    • “Grace” → “Negentropy_Field”
    • “Sin” → “Entropy_Source”
    • “Logos” → “Information_Substrate”
    • “Faith” → “Trust_Operator”
    • “Resurrection” → “State_Transition”
    • “Redemption” → “Error_Correction”
    • (all 31 components)
  3. Rescore with neutral labels → χ_neutral
  4. Compute Δχ = |χ_theo - χ_neutral|
  5. If Δχ < 5% of baseline → framework is LABEL-INDEPENDENT

Falsification Criterion:

  • If neutral labels cause >10% degradation → framework is SEMANTIC, not structural
  • If theological language is necessary for function → it’s storytelling, not math

Interpretation:

  • Label independence proves the framework is mathematical, not theological rhetoric
  • This is the most important test for scientific credibility

3.4 Test 4: Adversarial Resistance

Hypothesis: A robust framework should correctly identify incoherent systems and resist gaming.

Method:

Attack 1: Keyword Spam

  • Generate text stuffed with high-scoring keywords but no structure
  • Example: “grace truth coherence faith love unity peace knowledge entropy energy quantum consciousness resurrection” × 50
  • Expected: χ_spam < χ_baseline - 1.0

Attack 2: Random Gibberish

  • Generate completely random text
  • Expected: χ_random < χ_baseline - 1.0

Attack 3: Coherent Opposite Framework

  • Generate well-structured materialist/reductionist framework
  • Expected: χ_opposite < χ_baseline - 0.5
  • (Should score lower but not as low as gibberish, since it has some structure)

Falsification Criterion:

  • If keyword spam scores high → framework can be GAMED
  • If random text scores equally → framework detects nothing
  • If opposite framework scores equally → framework has no discrimination

Interpretation:

  • Adversarial resistance proves the framework measures STRUCTURE, not keyword frequency

3.5 Test 5: Null Hypothesis Comparison

Hypothesis: The framework should outperform random scoring functions.

Method:

  1. Generate 100 random scoring functions (random weights, random mappings)
  2. Score the same test corpus with:
    • Our framework
    • Random function 1
    • Random function 2
    • Random function 100
  3. Compute signal-to-noise ratio: χ_framework / mean(χ_random)
  4. If S/N > 2.0 → framework is non-random

Falsification Criterion:

  • If random functions perform equally → framework is no better than chance
  • If S/N < 1.5 → framework is WEAK

Interpretation:

  • High S/N proves the framework has SIGNAL, not just noise

4. Implementation Details

4.1 Source Code Availability

All code is open source and available at:

O:\Theophysics_Backend\Python_Backend\Backend Python\
├── core/
│   └── coherence/
│       ├── unified_scorer.py          # Main scoring engine
│       └── rubrics/
│           ├── fruit_matrix.yaml      # Fruit definitions
│           ├── variable_rubric.yaml   # Variable definitions
│           ├── constraint_rubric.yaml # Constraint definitions
│           └── defense_rubric.yaml    # Evidence quality metrics
├── sensitivity_analyzer.py            # This sensitivity test suite
└── score_moral_decay.py              # Example application

License: Open for academic use, replication, and adversarial testing.

4.2 Rubric Files (YAML Format)

All detection rules are stored in human-readable YAML files:

Example: Grace Variable (variable_rubric.yaml)

G_grace:
  code: "G"
  name: "Grace"
  domain: "theo|field"
  definition: "Negentropic restorative field; entropy absorption"
  role: "Counters entropy/sin, enables recovery"
  detection_keywords:
    primary: ["grace", "mercy", "forgiveness", "restoration", "negentropy"]
    secondary: ["absorb", "recover", "restore", "heal", "repair"]

No hidden logic. All rules are explicit and auditable.

4.3 Test Execution

Command Line:

cd "O:\Theophysics_Backend\Python_Backend\Backend Python"
python sensitivity_analyzer.py > sensitivity_report.txt

Outputs:

  • sensitivity_report.txt: Full test results
  • sensitivity_analysis_report.json: Machine-readable summary

Test Duration: ~5-10 minutes on standard hardware


5. Results and Interpretation

5.1 Preliminary Findings

Test Document: Sample text describing Master Equation framework (~400 words)

Ablation Results:

  • Load-bearing components: 0/31 (0%)
  • Interpretation: Either (a) framework is too robust, or (b) test document is too simple

Topology Sensitivity:

  • Scrambled mappings: Δχ = +0.06 (+1.2%)
  • Interpretation: Topology change did NOT degrade score (structure insensitive)

Label Independence: (Test in progress)

Adversarial Resistance: (Test in progress)

5.2 Threshold Calibration

Current Thresholds:

  • Load-bearing: |Δχ| > 10% of baseline
  • Topology sensitivity: |Δχ| > 15% of baseline
  • Label independence: |Δχ| < 5% of baseline

Open Question: Are these thresholds too strict?

Calibration Plan:

  1. Test on diverse corpus (high-coherence, low-coherence, mixed)
  2. Compare to human expert ratings
  3. Adjust thresholds if necessary (document all adjustments)

5.3 Known Limitations

Current Issues:

  1. Simulated Ablation: Currently using simulated degradation for ablation tests
    • Fix: Implement true rubric modification in real-time
  2. Small Test Corpus: Only tested on single document
    • Fix: Expand to 100+ documents across coherence spectrum
  3. No Inter-Rater Reliability: No comparison to human expert scores
    • Fix: Collect expert ratings for benchmark

Status: This is a FIRST DRAFT methodology, not a final validation


6. Falsification Criteria

6.1 Framework FAILS if:

  1. Ablation Test:

    • <5 components are load-bearing → Framework is redundant
    • Removing components improves scores → Framework has parasitic elements
  2. Topology Test:

    • Scrambled structure performs equally → Structure is arbitrary
    • Random graphs work → Any framework would do
  3. Label Independence Test:

    • Neutral labels cause >10% degradation → Framework is semantic, not structural
    • Theological language is necessary → It’s storytelling, not math
  4. Adversarial Test:

    • Keyword spam scores high → Framework can be gamed
    • Random text scores equally → Framework detects nothing
  5. Null Hypothesis Test:

    • Random functions perform equally → Framework is no better than chance

6.2 Framework PASSES if:

  • ≥10 components are load-bearing (>30% of components)
  • Topology changes cause ≥15% degradation
  • Label swaps cause <5% change
  • Adversarial attacks are correctly rejected (≥66% success rate)
  • Signal-to-noise ratio vs random > 2.0

6.3 Current Verdict

INCOMPLETE - Testing in progress

Preliminary Concern: Framework may be TOO INSENSITIVE (robust to structural changes)

Alternative Hypothesis: Test document is too simple to reveal structural necessity

Next Step: Test on high-variance corpus (strong theories, weak theories, nonsense)


7. Reproducibility Protocol

7.1 System Requirements

  • Python 3.9+
  • Dependencies: numpy, pyyaml, pathlib
  • Hardware: Any modern PC (no GPU required)
  • OS: Windows/Linux/Mac

7.2 Installation

git clone [repository]
cd Backend\ Python
pip install -r requirements.txt

7.3 Running Tests

Full Sensitivity Suite:

python sensitivity_analyzer.py

Score a Single Document:

python -c "
from core.coherence.unified_scorer import UnifiedCoherenceScorer
scorer = UnifiedCoherenceScorer('core/coherence/rubrics')
result = scorer.score_document(open('your_document.txt').read(), 'Test')
print(f'Chi: {result.chi:.2f}, Kappa: {result.kappa:.2f}, Rho: {result.rho:.2f}')
"

Score Moral Decline of America Project:

python score_moral_decay.py

7.4 Expected Outputs

  • Console output with test progress
  • sensitivity_report.txt: Human-readable results
  • sensitivity_analysis_report.json: Machine-readable summary
  • moral_decay_score_report.txt: Example application output

8. Invitation for Adversarial Testing

8.1 We Welcome Attacks

We invite attempts to break this framework:

  • Submit adversarial documents that should score low but don’t
  • Identify gaming strategies that inflate scores
  • Find structural modifications that don’t degrade performance
  • Demonstrate that random frameworks perform equally

8.2 Reporting Issues

Submit to: [contact information]

Include:

  • Attack description
  • Test document (if applicable)
  • Expected vs actual behavior
  • Suggested fixes (optional)

8.3 Bounty Program (Future)

We plan to offer rewards for:

  • Successful gaming attacks (prove framework is gameable)
  • Label-dependence demonstrations (prove framework is semantic, not structural)
  • Null hypothesis violations (prove random functions work equally)

Amount: [To be determined]


9. Comparison to Other Frameworks

9.1 Standard Academic Frameworks

FrameworkParameter TuningStructural TestsOpen SourceFalsifiable
OursNONEYESYESYES
TypicalExtensiveRareRareDifficult

9.2 Key Differentiators

  1. No Weights: We do not adjust parameters to fit data
  2. Structural Focus: Tests whether structure is necessary, not whether it fits
  3. Full Transparency: All rubrics, code, and methods are public
  4. Falsification-First: We define failure criteria upfront

9.3 Inspired By

  • Ablation studies in neural networks
  • Lesion studies in neuroscience
  • Knockout experiments in genetics
  • Structural equation modeling in social science

Core Insight: If removing a component doesn’t break the system, the component isn’t necessary.


10. Future Work

10.1 Immediate Priorities

  1. Complete Full Test Suite:

    • Finish all 5 test types
    • Run on diverse corpus (100+ documents)
    • Collect human expert ratings for calibration
  2. Refine Ablation Implementation:

    • Move from simulated to true rubric modification
    • Test computational cost
  3. Expand Test Corpus:

    • High-coherence theories (physics, mathematics)
    • Low-coherence theories (pseudoscience, word salad)
    • Adversarial examples

10.2 Long-Term Goals

  1. Inter-Domain Validation:

    • Test on legal documents, economic theories, psychological frameworks
    • Verify cross-domain consistency
  2. Meta-Analysis:

    • Compare to human expert ratings
    • Compute inter-rater reliability
    • Establish validity coefficients
  3. Automated Adversarial Generation:

    • Train adversarial models to game the framework
    • Use failures to harden the system
  4. Public Dashboard:

    • Real-time scoring of submitted documents
    • Leaderboard of coherence scores
    • Transparent methodology display

11. Philosophical Note

11.1 Why This Matters

Most frameworks claim universality but rely on:

  • Hidden parameters tuned to desired outcomes
  • Post-hoc rationalization when predictions fail
  • Unfalsifiable structure that can’t be proven wrong

This is not science. This is storytelling with equations.

Our approach inverts the problem:

“Here is the structure. Here are tests that would falsify it. Run them.”

If the structure survives, it earns credibility not because we claim it works, but because adversaries couldn’t break it.

11.2 The Standard We Aim For

  • Physics: Theories make predictions that can be tested
  • Mathematics: Proofs are either valid or invalid
  • Engineering: Designs either work or fail

We want the same standard for coherence scoring.


12. Conclusion

We have developed a universal coherence scoring framework and subjected it to structural necessity testing. Unlike typical frameworks that tune parameters to fit data, we test whether the structure itself is load-bearing.

Current Status: Testing in progress

Preliminary Findings: Framework may be too robust (insensitive to structural changes) OR test corpus is too simple

Next Steps:

  1. Complete full test suite
  2. Expand to diverse corpus
  3. Refine ablation implementation
  4. Collect expert ratings for validation

Open Invitation: We invite adversarial testing, replication attempts, and critical analysis.

Core Claim: If this framework cannot be falsified through structural tests, it has earned scientific credibility not through our authority, but through surviving attack.


Appendix A: Test Corpus

(To be populated with full test documents and results)


Appendix B: Rubric Definitions

(Full YAML files included for transparency)


Appendix C: Source Code

(Complete annotated source code)


Document Version: 1.0
Last Updated: January 11, 2026
Status: DRAFT - Testing in Progress
Contact: [To be added]
Repository: [To be added]


License: This methodology is released under [to be determined] for academic use, replication, and adversarial testing.

Citation: [To be formatted]

Canonical Hub: CANONICAL_INDEX