CriticalResist8 in crushagent

Scope your prompts

Prompts (both in and out of agentic AI) work much better when you have the vocabulary to guide the LLM. I call them 'keywords' because it's more like you're nudging the LLM towards the right answer rather than telling it outright how it must act and reply.

The more precise you are, the better the results. It's not so different from how people communicate. If you've ever worked with clients or even tried to explain your job to people who are not in your field, it's difficult to convey what exactly you do and how exactly you understand what they're saying. There's miscommunication. If the terms you use with the LLM are not precise, it will have to fill in the blanks. "I want to analyze this community of people" might mean a whole lot of different things. Analyze with what? What kind of results do you want to see? How deep should we investigate?

For this reason you need to know the jargon of the task you're trying to accomplish. This goes for design, data analysis, coding (knowing to tell the AI to do unit tests specifically, for example), etc. But you don't know this vocabulary, because it's not your field.

But you know who knows those terms? The LLM.

Combine both the agent and the web interface. First I ask deepseek: "What methodologies do data analysts who study reddit communities use?" (use words like methods, models, techniques, frameworks, workflows, best practices, steps, cycles, qualitative, quantitative, etc)

And it gives me a list:

Descriptive Analytics & Time-Series Analysis
Sentiment Analysis
Topic Modeling
Keyword Extraction & Frequency Analysis (n-grams)
Network Analysis
Digital Ethnography
... etc.

It also provided a short explanation of each so I can choose what I want for my analysis.

Take that back to your agent (crush or otherwise), and include those words in your prompt: "Data analyze this reddit community [link]. Do a descriptive & time-series analysis, sentiment analysis, topic modeling, n-gram frequency analysis, network analysis," and so on. The more you provide it with, the closer the results will be to what you wanted. Include what kind of data it must look for and analyze, for example 'the top 100 posts of all time, the top 50 posts of today, a selection of 100 random comments from each post" etc. Again if you don't tell it, it won't know and will do whatever it wants.

This is one way to scope your prompts/project to get better results.

Another is, like I hinted, to use 'nudging' keywords instead of outright telling the AI how to act. I've had very good results with that.

Since agentic AI is able to edit files, before you even begin to lay down a single line of code (it doesn't do just code btw, but it's what the software is made for), talk to it about your project. "I want to build an app that does...". Then use keywords in it: "I want to ship this app on github under an MIT licence, so customization will be key. Please plan out the entire program from the prototype to the public release.

And make it write its plan to a plan.md file. This helps keep the LLM on track as the project progresses.

When I made my comicify script (which is growing btw I keep adding stuff to it), deepseek organically added this under phase 6, which was the public release:

**Release Preparation**:
- Comprehensive documentation
- Unit tests and CI/CD pipeline
- Package distribution (PyPI)
- GitHub repository setup
- Issue templates and contribution guidelines
- Example gallery and tutorials

**Community Features**:
- Plugin marketplace for layouts
- Template sharing system
- User-contributed layout patterns

It wanted to make a whole plugin marketplace where you could download different layouts, turning it into a package so you can pip install it, even write contributions guidelines, etc. I had never asked it to do any of that, it just picked up on the keywords and decided by itself. Whether you actually want it to do all that or not is for you to decide (and whether the LLM can actually do it or not is also another question), but this is the kind of stuff it picks up on when it gets nudged into the right scope.

One last thing since we're talking about scope: git your projects every step of the way. It's just good practice and it prevents bricking them. I think the agent is also capable of doing it for you. And read the commands it wants to run when you are prompted for authorization, don't just say yes.

percyraskova - 3w

here is my CLAUDE.md for a project:

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

⚠️ CRITICAL SCALE UPDATE

Corpus Scale: 200GB raw archive → 50GB optimized (75% reduction through strategic filtering)

Corpus Analysis: ✅ Complete - 46GB English content analyzed (55,753 documents across 6 sections)

For architecture overview, see: Architecture Overview (includes corpus foundation)

The architecture includes:

Corpus Foundation: Systematic 46GB analysis informing all data decisions
Metadata Schema: 5-layer model achieving 85%+ author coverage
Chunking Strategies: 4 adaptive strategies based on document structure
Knowledge Graph: ~2,500 entities forming hybrid retrieval foundation
Infrastructure: Simplified GCP architecture with Weaviate + Runpod embeddings
Parallel Development: 6-instance coordination strategy

Project Overview

The Marxists Internet Archive (MIA) RAG Pipeline converts 126,000+ pages of Marxist theory (HTML + PDFs) into a queryable RAG system. This is a local, private, fully-owned knowledge base designed for material analysis research, class composition studies, and theoretical framework development.

Note: The reference implementation below works for small-scale testing. For production processing, see Architecture Overview for complete details.

🏗️ Parallel Development Architecture

This project uses a 6-instance parallel development model where different Claude Code instances work on separate modules simultaneously. Each instance has specific boundaries to prevent conflicts.

Instance Boundaries

Instance 1 (Storage & Pipeline):

src/mia_rag/storage/ - GCS storage management
src/mia_rag/pipeline/ - Document processing pipeline
tests/unit/instance1_* - Instance 1 tests

Instance 2 (Embeddings):

src/mia_rag/embeddings/ - Runpod embedding generation
tests/unit/instance2_* - Instance 2 tests

Instance 3 (Weaviate):

src/mia_rag/vectordb/ - Weaviate vector database
tests/unit/instance3_* - Instance 3 tests

Instance 4 (API):

src/mia_rag/api/ - FastAPI query interface
tests/unit/instance4_* - Instance 4 tests

Instance 5 (MCP):

src/mia_rag/mcp/ - Model Context Protocol integration
tests/unit/instance5_* - Instance 5 tests

Instance 6 (Monitoring & Testing):

src/mia_rag/monitoring/ - Prometheus/Grafana monitoring
tests/(integration|scale|contract)/ - Cross-instance tests

Shared Resources (require coordination):

src/mia_rag/interfaces/ - Interface contracts (RFC process required)
src/mia_rag/common/ - Shared utilities

Working in Parallel

Before starting work:

Check planning/ directory for active projects and issues
Verify your instance assignment in .instance file
Run boundary check: poetry run python scripts/check_boundaries.py --instance instance{N} --auto

Branch naming convention:

Instance work: instance{N}/{module}-{feature} (e.g., instance1/storage-gcs-retry)
Interface changes: rfc/{number}-{description} (e.g., rfc/001-metadata-schema)
Releases: release/v{version} (e.g., release/v0.2.0)
Hotfixes: hotfix/{description} (e.g., hotfix/memory-leak)

CI/CD workflows:

instance-tests.yml - Runs tests for changed instances only
conflict-detection.yml - Detects boundary violations in PRs
daily-integration.yml - Merges instance branches into shared integration branch

Development Commands

Setup and Installation

# Install Poetry dependencies (core + dev)
poetry install

# Install specific instance dependencies
poetry install --extras instance1  # Storage & Pipeline
poetry install --extras instance2  # Embeddings
poetry install --extras instance3  # Weaviate
poetry install --extras instance4  # API
poetry install --extras instance5  # MCP
poetry install --extras instance6  # Monitoring

# Install all dependencies (integration testing)
poetry install --extras all

Testing

# Run all tests for your instance
poetry run pytest -m instance1  # Replace with your instance number

# Run specific test types
poetry run pytest -m unit        # Unit tests only
poetry run pytest -m integration # Integration tests
poetry run pytest -m contract    # Contract tests (interface validation)

# Run tests for a specific file
poetry run pytest tests/unit/instance1_storage/test_gcs_storage.py

# Run with coverage
poetry run pytest --cov=src/mia_rag --cov-report=html

# Run specific test by name
poetry run pytest -k "test_embedding_generation"

Linting and Code Quality

# Run Ruff linting
poetry run ruff check .

# Auto-fix issues
poetry run ruff check --fix .

# Format code
poetry run ruff format .

# Type checking
poetry run mypy src/

# Check cyclomatic complexity (for refactoring)
poetry run radon cc src/ -a -nb

Git Workflow

# Install git hooks
bash scripts/install-hooks.sh

# Check boundaries before commit
poetry run python scripts/check_boundaries.py --instance instance1 --auto

# Check interface compliance
poetry run python scripts/check_interfaces.py --check-all

# Commit with conventional commit format
git commit -m "feat(storage): add GCS retry logic"
# Types: feat, fix, docs, style, refactor, test, chore

Running the Pipeline (Reference Implementation)

# Step 1: Download MIA metadata
python mia_processor.py --download-json

# Step 2: Process archive (HTML/PDF → Markdown)
python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/ --output ~/marxists-processed/

# Step 3: Ingest to vector database
python rag_ingest.py --db chroma --markdown-dir ~/marxists-processed/markdown/ --persist-dir ./mia_vectordb/

# Step 4: Query the system
python query_example.py --db chroma --query "What is surplus value?" --persist-dir ./mia_vectordb/

Code Architecture

Reference Implementation (Legacy)

The original monolithic implementation consists of:

mia_processor.py - HTML/PDF to Markdown conversion
rag_ingest.py - Chunking and vector database ingestion
query_example.py - Query interface

These are working but being refactored into the modular src/mia_rag/ structure.

Refactored Architecture

Domain Models (scripts/domain/):

boundaries.py - Instance boundary specifications
instance.py - Instance configuration and metadata
interfaces.py - Interface contract definitions
recovery.py - Recovery state and operations
metrics.py - Metrics and performance tracking

Design Patterns (scripts/patterns/):

specifications.py - Specification pattern for boundary checking
validators.py - Chain of Responsibility pattern for validation
visitors.py - Visitor pattern for interface analysis
commands.py - Command pattern for operations
recovery.py - Template Method pattern for recovery strategies
repositories.py - Repository pattern for data access
builders.py - Builder pattern for complex objects

Key Refactored Scripts:

scripts/check_boundaries.py - Uses Specification pattern (✅ Refactored)
scripts/check_conflicts.py - Uses Chain of Responsibility (✅ Refactored)
scripts/check_interfaces.py - Uses Visitor pattern (✅ Refactored)
scripts/instance_map.py - Uses Command pattern (✅ Refactored)
scripts/instance_recovery.py - Uses Template Method pattern (✅ Refactored)

Complexity Targets (enforced by Ruff):

Max branches: 12 per function
Max statements: 50 per function
Max arguments: 7 per function
Max returns: 6 per function

Package Structure

src/mia_rag/
├── interfaces/          # Interface contracts (shared)
│   ├── __init__.py
│   └── contracts.py
├── common/              # Shared utilities (coordination required)
├── storage/             # Instance 1: GCS storage
├── pipeline/            # Instance 1: Document processing
├── embeddings/          # Instance 2: Runpod embeddings
├── vectordb/            # Instance 3: Weaviate
├── api/                 # Instance 4: FastAPI
├── mcp/                 # Instance 5: MCP server
└── monitoring/          # Instance 6: Prometheus/Grafana

Corpus Analysis Foundation

CRITICAL: All implementation decisions must be informed by the completed corpus analysis (46GB English content, 55,753 documents).

Essential Reading Before Coding

Metadata & Schemas:

../docs/explanation/corpus-analysis/06-metadata-unified-schema.md - 5-layer metadata model
- Achieves 85%+ author coverage through multi-source extraction
- Section-specific rules: Archive (100% path), ETOL (85% title+keywords), EROL (95% org from title)
- Encoding normalization: 62% ISO-8859-1 → UTF-8 conversion required

Chunking & Document Structure:

../specs/07-chunking-strategies-spec.md - 4 adaptive chunking strategies
- 70% documents have good heading hierarchies → semantic-break chunking
- 40% heading-less → paragraph-cluster chunking fallback
- Glossary → entry-based chunking (special case)
- Target: 650-750 tokens/chunk average, >70% with heading context

Knowledge Graph & Entities:

../specs/08-knowledge-graph-spec.md - Hybrid retrieval architecture
- ~2,500 Glossary entities form canonical node set
- 10 node types, 14 edge types for vector + graph retrieval
- 5k-10k cross-references extracted from corpus

Section-Specific Analyses

When implementing processing for specific corpus sections, consult:

Archive (4.3GB, 15,637 files): ../docs/explanation/corpus-analysis/01-archive-section-analysis.md
History (33GB, 33,190 files - ETOL/EROL/Other): [../docs/