Prompts (both in and out of agentic AI) work much better when you have the vocabulary to guide the LLM. I call them 'keywords' because it's more like you're nudging the LLM towards the right answer rather than telling it outright how it must act and reply.
The more precise you are, the better the results. It's not so different from how people communicate. If you've ever worked with clients or even tried to explain your job to people who are not in your field, it's difficult to convey what exactly you do and how exactly you understand what they're saying. There's miscommunication. If the terms you use with the LLM are not precise, it will have to fill in the blanks. "I want to analyze this community of people" might mean a whole lot of different things. Analyze with what? What kind of results do you want to see? How deep should we investigate?
For this reason you need to know the jargon of the task you're trying to accomplish. This goes for design, data analysis, coding (knowing to tell the AI to do unit tests specifically, for example), etc. But you don't know this vocabulary, because it's not your field.
But you know who knows those terms? The LLM.
Combine both the agent and the web interface. First I ask deepseek: "What methodologies do data analysts who study reddit communities use?" (use words like methods, models, techniques, frameworks, workflows, best practices, steps, cycles, qualitative, quantitative, etc)
And it gives me a list:
Descriptive Analytics & Time-Series Analysis
Sentiment Analysis
Topic Modeling
Keyword Extraction & Frequency Analysis (n-grams)
Network Analysis
Digital Ethnography
... etc.
It also provided a short explanation of each so I can choose what I want for my analysis.
Take that back to your agent (crush or otherwise), and include those words in your prompt: "Data analyze this reddit community [link]. Do a descriptive & time-series analysis, sentiment analysis, topic modeling, n-gram frequency analysis, network analysis," and so on. The more you provide it with, the closer the results will be to what you wanted. Include what kind of data it must look for and analyze, for example 'the top 100 posts of all time, the top 50 posts of today, a selection of 100 random comments from each post" etc. Again if you don't tell it, it won't know and will do whatever it wants.
This is one way to scope your prompts/project to get better results.
Another is, like I hinted, to use 'nudging' keywords instead of outright telling the AI how to act. I've had very good results with that.
Since agentic AI is able to edit files, before you even begin to lay down a single line of code (it doesn't do just code btw, but it's what the software is made for), talk to it about your project. "I want to build an app that does...". Then use keywords in it: "I want to ship this app on github under an MIT licence, so customization will be key. Please plan out the entire program from the prototype to the public release.
And make it write its plan to a plan.md file. This helps keep the LLM on track as the project progresses.
When I made my comicify script (which is growing btw I keep adding stuff to it), deepseek organically added this under phase 6, which was the public release:
**Release Preparation**:
- Comprehensive documentation
- Unit tests and CI/CD pipeline
- Package distribution (PyPI)
- GitHub repository setup
- Issue templates and contribution guidelines
- Example gallery and tutorials
**Community Features**:
- Plugin marketplace for layouts
- Template sharing system
- User-contributed layout patterns
It wanted to make a whole plugin marketplace where you could download different layouts, turning it into a package so you can pip install it, even write contributions guidelines, etc. I had never asked it to do any of that, it just picked up on the keywords and decided by itself. Whether you actually want it to do all that or not is for you to decide (and whether the LLM can actually do it or not is also another question), but this is the kind of stuff it picks up on when it gets nudged into the right scope.
One last thing since we're talking about scope: git your projects every step of the way. It's just good practice and it prevents bricking them. I think the agent is also capable of doing it for you. And read the commands it wants to run when you are prompted for authorization, don't just say yes.
The Marxists Internet Archive (MIA) RAG Pipeline converts 126,000+ pages of Marxist theory (HTML + PDFs) into a queryable RAG system. This is a local, private, fully-owned knowledge base designed for material analysis research, class composition studies, and theoretical framework development.
Note: The reference implementation below works for small-scale testing. For production processing, see Architecture Overview for complete details.
🏗️ Parallel Development Architecture
This project uses a 6-instance parallel development model where different Claude Code instances work on separate modules simultaneously. Each instance has specific boundaries to prevent conflicts.
# Run all tests for your instance
poetry run pytest -m instance1 # Replace with your instance number
# Run specific test types
poetry run pytest -m unit # Unit tests only
poetry run pytest -m integration # Integration tests
poetry run pytest -m contract # Contract tests (interface validation)
# Run tests for a specific file
poetry run pytest tests/unit/instance1_storage/test_gcs_storage.py
# Run with coverage
poetry run pytest --cov=src/mia_rag --cov-report=html
# Run specific test by name
poetry run pytest -k "test_embedding_generation"
Linting and Code Quality
# Run Ruff linting
poetry run ruff check .
# Auto-fix issues
poetry run ruff check --fix .
# Format code
poetry run ruff format .
# Type checking
poetry run mypy src/
# Check cyclomatic complexity (for refactoring)
poetry run radon cc src/ -a -nb
Git Workflow
# Install git hooks
bash scripts/install-hooks.sh
# Check boundaries before commit
poetry run python scripts/check_boundaries.py --instance instance1 --auto
# Check interface compliance
poetry run python scripts/check_interfaces.py --check-all
# Commit with conventional commit format
git commit -m "feat(storage): add GCS retry logic"
# Types: feat, fix, docs, style, refactor, test, chore
Running the Pipeline (Reference Implementation)
# Step 1: Download MIA metadata
python mia_processor.py --download-json
# Step 2: Process archive (HTML/PDF → Markdown)
python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/ --output ~/marxists-processed/
# Step 3: Ingest to vector database
python rag_ingest.py --db chroma --markdown-dir ~/marxists-processed/markdown/ --persist-dir ./mia_vectordb/
# Step 4: Query the system
python query_example.py --db chroma --query "What is surplus value?" --persist-dir ./mia_vectordb/
Code Architecture
Reference Implementation (Legacy)
The original monolithic implementation consists of:
mia_processor.py - HTML/PDF to Markdown conversion
rag_ingest.py - Chunking and vector database ingestion
query_example.py - Query interface
These are working but being refactored into the modular src/mia_rag/ structure.
Refactored Architecture
Domain Models (scripts/domain/):
boundaries.py - Instance boundary specifications
instance.py - Instance configuration and metadata
interfaces.py - Interface contract definitions
recovery.py - Recovery state and operations
metrics.py - Metrics and performance tracking
Design Patterns (scripts/patterns/):
specifications.py - Specification pattern for boundary checking
validators.py - Chain of Responsibility pattern for validation
visitors.py - Visitor pattern for interface analysis
commands.py - Command pattern for operations
recovery.py - Template Method pattern for recovery strategies
repositories.py - Repository pattern for data access
CriticalResist8 in crushagent
Scope your prompts
Prompts (both in and out of agentic AI) work much better when you have the vocabulary to guide the LLM. I call them 'keywords' because it's more like you're nudging the LLM towards the right answer rather than telling it outright how it must act and reply.
The more precise you are, the better the results. It's not so different from how people communicate. If you've ever worked with clients or even tried to explain your job to people who are not in your field, it's difficult to convey what exactly you do and how exactly you understand what they're saying. There's miscommunication. If the terms you use with the LLM are not precise, it will have to fill in the blanks. "I want to analyze this community of people" might mean a whole lot of different things. Analyze with what? What kind of results do you want to see? How deep should we investigate?
For this reason you need to know the jargon of the task you're trying to accomplish. This goes for design, data analysis, coding (knowing to tell the AI to do unit tests specifically, for example), etc. But you don't know this vocabulary, because it's not your field.
But you know who knows those terms? The LLM.
Combine both the agent and the web interface. First I ask deepseek: "What methodologies do data analysts who study reddit communities use?" (use words like methods, models, techniques, frameworks, workflows, best practices, steps, cycles, qualitative, quantitative, etc)
And it gives me a list:
It also provided a short explanation of each so I can choose what I want for my analysis.
Take that back to your agent (crush or otherwise), and include those words in your prompt: "Data analyze this reddit community [link]. Do a descriptive & time-series analysis, sentiment analysis, topic modeling, n-gram frequency analysis, network analysis," and so on. The more you provide it with, the closer the results will be to what you wanted. Include what kind of data it must look for and analyze, for example 'the top 100 posts of all time, the top 50 posts of today, a selection of 100 random comments from each post" etc. Again if you don't tell it, it won't know and will do whatever it wants.
This is one way to scope your prompts/project to get better results.
Another is, like I hinted, to use 'nudging' keywords instead of outright telling the AI how to act. I've had very good results with that.
Since agentic AI is able to edit files, before you even begin to lay down a single line of code (it doesn't do just code btw, but it's what the software is made for), talk to it about your project. "I want to build an app that does...". Then use keywords in it: "I want to ship this app on github under an MIT licence, so customization will be key. Please plan out the entire program from the prototype to the public release.
And make it write its plan to a plan.md file. This helps keep the LLM on track as the project progresses.
When I made my comicify script (which is growing btw I keep adding stuff to it), deepseek organically added this under phase 6, which was the public release:
It wanted to make a whole plugin marketplace where you could download different layouts, turning it into a package so you can pip install it, even write contributions guidelines, etc. I had never asked it to do any of that, it just picked up on the keywords and decided by itself. Whether you actually want it to do all that or not is for you to decide (and whether the LLM can actually do it or not is also another question), but this is the kind of stuff it picks up on when it gets nudged into the right scope.
One last thing since we're talking about scope: git your projects every step of the way. It's just good practice and it prevents bricking them. I think the agent is also capable of doing it for you. And read the commands it wants to run when you are prompted for authorization, don't just say yes.
here is my CLAUDE.md for a project:
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
⚠️ CRITICAL SCALE UPDATE
Corpus Scale: 200GB raw archive → 50GB optimized (75% reduction through strategic filtering)
Corpus Analysis: ✅ Complete - 46GB English content analyzed (55,753 documents across 6 sections)
For architecture overview, see: Architecture Overview (includes corpus foundation)
The architecture includes:
Project Overview
The Marxists Internet Archive (MIA) RAG Pipeline converts 126,000+ pages of Marxist theory (HTML + PDFs) into a queryable RAG system. This is a local, private, fully-owned knowledge base designed for material analysis research, class composition studies, and theoretical framework development.
Note: The reference implementation below works for small-scale testing. For production processing, see Architecture Overview for complete details.
🏗️ Parallel Development Architecture
This project uses a 6-instance parallel development model where different Claude Code instances work on separate modules simultaneously. Each instance has specific boundaries to prevent conflicts.
Instance Boundaries
Instance 1 (Storage & Pipeline):
src/mia_rag/storage/- GCS storage managementsrc/mia_rag/pipeline/- Document processing pipelinetests/unit/instance1_*- Instance 1 testsInstance 2 (Embeddings):
src/mia_rag/embeddings/- Runpod embedding generationtests/unit/instance2_*- Instance 2 testsInstance 3 (Weaviate):
src/mia_rag/vectordb/- Weaviate vector databasetests/unit/instance3_*- Instance 3 testsInstance 4 (API):
src/mia_rag/api/- FastAPI query interfacetests/unit/instance4_*- Instance 4 testsInstance 5 (MCP):
src/mia_rag/mcp/- Model Context Protocol integrationtests/unit/instance5_*- Instance 5 testsInstance 6 (Monitoring & Testing):
src/mia_rag/monitoring/- Prometheus/Grafana monitoringtests/(integration|scale|contract)/- Cross-instance testsShared Resources (require coordination):
src/mia_rag/interfaces/- Interface contracts (RFC process required)src/mia_rag/common/- Shared utilitiesWorking in Parallel
Before starting work:
planning/directory for active projects and issues.instancefilepoetry run python scripts/check_boundaries.py --instance instance{N} --autoBranch naming convention:
instance{N}/{module}-{feature}(e.g.,instance1/storage-gcs-retry)rfc/{number}-{description}(e.g.,rfc/001-metadata-schema)release/v{version}(e.g.,release/v0.2.0)hotfix/{description}(e.g.,hotfix/memory-leak)CI/CD workflows:
instance-tests.yml- Runs tests for changed instances onlyconflict-detection.yml- Detects boundary violations in PRsdaily-integration.yml- Merges instance branches into shared integration branchDevelopment Commands
Setup and Installation
Testing
Linting and Code Quality
Git Workflow
Running the Pipeline (Reference Implementation)
Code Architecture
Reference Implementation (Legacy)
The original monolithic implementation consists of:
mia_processor.py- HTML/PDF to Markdown conversionrag_ingest.py- Chunking and vector database ingestionquery_example.py- Query interfaceThese are working but being refactored into the modular
src/mia_rag/structure.Refactored Architecture
Domain Models (
scripts/domain/):boundaries.py- Instance boundary specificationsinstance.py- Instance configuration and metadatainterfaces.py- Interface contract definitionsrecovery.py- Recovery state and operationsmetrics.py- Metrics and performance trackingDesign Patterns (
scripts/patterns/):specifications.py- Specification pattern for boundary checkingvalidators.py- Chain of Responsibility pattern for validationvisitors.py- Visitor pattern for interface analysiscommands.py- Command pattern for operationsrecovery.py- Template Method pattern for recovery strategiesrepositories.py- Repository pattern for data accessbuilders.py- Builder pattern for complex objectsKey Refactored Scripts:
scripts/check_boundaries.py- Uses Specification pattern (✅ Refactored)scripts/check_conflicts.py- Uses Chain of Responsibility (✅ Refactored)scripts/check_interfaces.py- Uses Visitor pattern (✅ Refactored)scripts/instance_map.py- Uses Command pattern (✅ Refactored)scripts/instance_recovery.py- Uses Template Method pattern (✅ Refactored)Complexity Targets (enforced by Ruff):
Package Structure
Corpus Analysis Foundation
CRITICAL: All implementation decisions must be informed by the completed corpus analysis (46GB English content, 55,753 documents).
Essential Reading Before Coding
Metadata & Schemas:
Chunking & Document Structure:
Knowledge Graph & Entities:
Section-Specific Analyses
When implementing processing for specific corpus sections, consult: