MASTER DEVELOPMENT PROMPT

Triad Coherence Data Ingestion + Normalization + Multi-Format Processing System

SYSTEM GOAL: You will design and implement a full data ingestion + cleaning + normalization + preview + export system for the Coherence Index Project. The system must:

  1. Import raw data from multiple formats:

    • CSV
    • Excel (XLS/XLSX)
    • PDF (text extraction required)
    • Markdown (.md)
    • Text (.txt)
    • HTML
  2. Automatically detect:

    • Year columns
    • Values
    • Units
    • Missing data
    • Inconsistent formatting
    • Header anomalies
    • Multi-row headers
    • Broken or shifted tables
  3. Generate a data preview for the user:

    • Show the detected columns
    • Show a cleaned 5—10 row sample
    • Ask for confirmation or corrections
    • If wrong, allow the user to specify the correct mapping
    • Re-run the cleaning pipeline with the corrected mapping
  4. Normalize all indicators into a unified structure: Every dataset must be transformed to the schema:

    year (INT)
    indicator_name (TEXT)
    indicator_domain (TEXT) -- society / individual / physics/logos
    raw_value (FLOAT)
    normalized_value (FLOAT)
    source (TEXT)
    notes (TEXT)
    
  5. Export results to:

    • PostgreSQL (with automated table creation)
    • Obsidian-compatible Markdown summaries
    • Clean CSV files
    • Optional: JSON bundles for analytics pipelines
  6. GUI Requirement: Build a Streamlit GUI with:

    • Drag-and-drop file upload
    • Automatic format detection
    • Step-by-step cleaning wizard
    • Preview panels
    • An export menu
    • A logging window
  7. Architectural Requirements:

    • Modular Python package structure
    • Dedicated folder for format-specific parsers
    • Dedicated folder for cleaning/normalization functions
    • A “rules engine” that encodes the Triad domains
    • A configuration file for expanding indicator mappings later
    • A “human-in-loop correction loop” for invalid auto-detections

WHAT I WANT FROM YOU FIRST:

  1. A complete high-level architecture
  2. File/folder structure
  3. Class/module design
  4. The Streamlit GUI layout plan
  5. A test plan using sample dirty CSV/PDF/MD files
  6. After approval, generate the full code base in steps.

ADDITIONAL RULES:

  • Code must be Python 3.10+ compatible.
  • Use only widely supported libraries (pandas, numpy, pdfplumber, python-docx if needed, markdown2, BeautifulSoup4, SQLAlchemy, Streamlit).
  • For PDFs: extract tables if possible, fallback to text-block parsing.
  • For Markdown: identify tables if present, fallback to YAML + body parsing.
  • For HTML: extract <table>s; ignore styling.
  • All errors must be caught and displayed cleanly in the Streamlit GUI.

BEGIN

“Begin by designing the full system architecture. Do not write code yet. Produce the blueprint first.”