MASTER DEVELOPMENT PROMPT
Triad Coherence Data Ingestion + Normalization + Multi-Format Processing System
SYSTEM GOAL: You will design and implement a full data ingestion + cleaning + normalization + preview + export system for the Coherence Index Project. The system must:
-
Import raw data from multiple formats:
- CSV
- Excel (XLS/XLSX)
- PDF (text extraction required)
- Markdown (.md)
- Text (.txt)
- HTML
-
Automatically detect:
- Year columns
- Values
- Units
- Missing data
- Inconsistent formatting
- Header anomalies
- Multi-row headers
- Broken or shifted tables
-
Generate a data preview for the user:
- Show the detected columns
- Show a cleaned 5—10 row sample
- Ask for confirmation or corrections
- If wrong, allow the user to specify the correct mapping
- Re-run the cleaning pipeline with the corrected mapping
-
Normalize all indicators into a unified structure: Every dataset must be transformed to the schema:
year (INT) indicator_name (TEXT) indicator_domain (TEXT) -- society / individual / physics/logos raw_value (FLOAT) normalized_value (FLOAT) source (TEXT) notes (TEXT) -
Export results to:
- PostgreSQL (with automated table creation)
- Obsidian-compatible Markdown summaries
- Clean CSV files
- Optional: JSON bundles for analytics pipelines
-
GUI Requirement: Build a Streamlit GUI with:
- Drag-and-drop file upload
- Automatic format detection
- Step-by-step cleaning wizard
- Preview panels
- An export menu
- A logging window
-
Architectural Requirements:
- Modular Python package structure
- Dedicated folder for format-specific parsers
- Dedicated folder for cleaning/normalization functions
- A “rules engine” that encodes the Triad domains
- A configuration file for expanding indicator mappings later
- A “human-in-loop correction loop” for invalid auto-detections
WHAT I WANT FROM YOU FIRST:
- A complete high-level architecture
- File/folder structure
- Class/module design
- The Streamlit GUI layout plan
- A test plan using sample dirty CSV/PDF/MD files
- After approval, generate the full code base in steps.
ADDITIONAL RULES:
- Code must be Python 3.10+ compatible.
- Use only widely supported libraries (
pandas,numpy,pdfplumber,python-docxif needed,markdown2,BeautifulSoup4,SQLAlchemy,Streamlit). - For PDFs: extract tables if possible, fallback to text-block parsing.
- For Markdown: identify tables if present, fallback to YAML + body parsing.
- For HTML: extract
<table>s; ignore styling. - All errors must be caught and displayed cleanly in the Streamlit GUI.
BEGIN
“Begin by designing the full system architecture. Do not write code yet. Produce the blueprint first.”