CRAWLER
theunpartycrawler (Analytics Intelligence)
Location: unparty-app/theunpartycrawler
Status: Active Development
Primary Purpose: AI conversation analytics suite and content processing for development intelligence
CRAWLER is a comprehensive AI conversation analytics suite that has evolved from a simple NLP perceptron pipeline. It processes ChatGPT and Claude conversation exports to generate insights about usage patterns, topic evolution, and behavioral analysis.
Overview
This repository serves as the analytics intelligence engine for the UNPARTY ecosystem, providing deep insights into AI-assisted development workflows and conversation patterns. It combines legacy NLP capabilities with modern analytics processors to deliver actionable intelligence for development teams.
Tech Stack
- Language: Python 3.12
- Framework: Hybrid Architecture (Legacy NLP + Modern Analytics)
- ML Libraries: PyTorch, Transformers, scikit-learn
- Visualization: Plotly, Matplotlib, Pandas
- Data Processing: NumPy, Pandas
- Testing: pytest
- Build Tool: Bash scripts for markdown conversion
- Deployment: Vercel (with security headers and route protection)
Enhanced Convert.sh Tool š
The repository now includes an enhanced markdown-to-HTML conversion tool with production-ready features:
⨠Key Features
- š Dark/Light Mode Toggle: Automatic theme detection with manual override
- šØļø Direct Print Functionality: Optimized print styles with professional layout
- š Intelligent Path Handling: Outputs converted files adjacent to source markdown files
- šØ Customizable Branding: 5 customizable brand colors with mandatory accessibility compliance
- š± Responsive Design: Mobile-friendly with accessible controls
š Quick Start
bash
# Make executable
chmod +x convert.sh
# Convert markdown to enhanced HTML
./convert.sh your-document.md
# Output will be created as: your-document.html (same directory)š Documentation
- Local Development Guide - Step-by-step setup and usage
- Deployment Documentation - Production deployment guide
- Convert Steps Analysis - Technical architecture details
šÆ Path Handling Examples
bash
./convert.sh file.md # ā file.html (in output/web/)
./convert.sh docs/readme.md # ā docs/readme.html
./convert.sh /tmp/notes.md # ā /tmp/notes.html
./convert.sh file.md custom/output.html # ā custom/output.htmlKey Features
AI Conversation Analytics Suite
- Conversation Analytics Suite: 67+ specialized Python processors for ChatGPT/Claude conversation data
- Platform Comparison: Cross-platform usage analysis and behavioral insights
- Topic Evolution: Monthly and weekly trend analysis with keyword detection
- Interactive Dashboards: HTML visualizations with responsive design and dark/light mode
- Business Calendar Integration: Custom fiscal, sprint, and work calendar analytics
- Memory-Efficient Processing: Streaming generators handle 1000+ conversation datasets
Enhanced Convert Tool
- Production-Ready Conversion: Markdown-to-HTML with dark/light mode toggle
- Print Optimization: Professional print styles with direct print functionality
- Responsive Design: Mobile-friendly with accessible controls
- Customizable Branding: 5 brand colors with accessibility compliance
Legacy NLP Pipeline
- Perceptron-Based Classification: Binary text classification for action/non-action content
- Pattern Recognition: Token co-occurrence and semantic relationship discovery
- Structured Data Output: LLM-ready formatted clusters
Architecture
Code
Analytics Intelligence (Python)
āāā Legacy Pipeline (NLP Perceptron)
ā āāā main.py (HTML crawler)
ā āāā preprocessor.py (Text processing)
ā āāā perceptron.py (Binary classification)
ā āāā action.py (Pattern recognition)
ā āāā store.py (Data clustering)
āāā Modern Analytics Suite (67+ processors)
ā āāā conversation_timeline.py (ChatGPT analysis)
ā āāā claude_timeline.py (Claude analysis)
ā āāā platform_comparison.py (Cross-platform insights)
ā āāā weekly_usage_analyzer.py (Behavioral patterns)
ā āāā monthly_topic_analyzer.py (Topic evolution)
ā āāā analytics_validator.py (Quality assurance)
āāā Enhanced Convert Tool
ā āāā convert.sh (Markdown-to-HTML converter)
ā āāā convert.css (Responsive styling)
ā āāā onboarding-convert.md (Documentation)
āāā Calendar System
ā āāā generate_calendar_files.py (Business calendar generator)
ā āāā data/ (Fiscal, sprint, and work calendars)
āāā Utilities
āāā utils/html_styling.py (Centralized styling)
āāā tests/ (5 test files, 3 passing)Data Processing Patterns
Conversation Format Handling
- ChatGPT Format: Complex nested mapping structure with conversation nodes
- Claude Format: Simple array structure with direct message access
- Topic Detection: Keyword-based classification with configurable dictionaries
- Streaming Processing: Memory-efficient generators for large conversation datasets
- Output Formats: HTML dashboards, CSV exports, JSON dumps, Markdown reports
Input Data Expectations
python
# ChatGPT: conversations.json with 'mapping' field
{
"mapping": {
"node_id": {
"message": {
"create_time": timestamp,
"content": {"parts": ["message text"]}
}
}
}
}
# Claude: conversations.json with 'chat_messages' array
{
"chat_messages": [
{
"created_at": timestamp,
"text": "message text",
"sender": "human" | "assistant"
}
]
}Integration Points
- GitHub Actions: CI/CD with Claude Code and Claude Code Review workflows
- Vercel: Deployment platform with security headers and route protection
- Calendar Systems: Business calendar integration for fiscal and sprint analytics
- Cross-Platform Analysis: Integrates ChatGPT and Claude conversation data
- UNPARTY Ecosystem: Provides analytics intelligence to theunpartyrunway for development metrics
Business Value
ABOUT ā BUILD ā CONNECT
-
ABOUT: Analyzes conversation patterns to understand user engagement, topic evolution, and development workflows. Helps teams understand how they use AI assistance and where time is spent.
-
BUILD: Provides development insights through technical discussion analysis, code extraction, and pattern recognition. Enables data-driven decisions about development practices and AI tool usage.
-
CONNECT: Generates comprehensive reports and visualizations for stakeholder communication. Creates shareable HTML dashboards, CSV exports, and markdown reports that facilitate team discussions and retrospectives.
Protecting Creator Ownership, Privacy, and Cost-Sensitivity
- Privacy-First: All processing happens locally on conversation exports; no data sent to external services
- Cost-Aware: Streaming generators minimize memory usage and compute costs for large datasets
- Creator Ownership: All analytics outputs are local files that remain under creator control
- Transparency: Open-source codebase allows full audit of data processing logic
Key Differentiators
- Dual Nature System: Maintains both legacy NLP capabilities and modern analytics in a single codebase
- Conversation Intelligence: Deep analysis of ChatGPT and Claude conversation exports with platform-specific optimizations
- Production-Ready Conversion: Enhanced markdown-to-HTML tool with professional styling and accessibility features
- Business Integration: Custom calendar systems for fiscal and sprint-based analytics align technical metrics with business rhythms
- Memory Efficiency: Handles 1000+ conversation datasets without performance degradation through streaming architecture
Relationship to Ecosystem
theunpartycrawler serves as the analytics intelligence engine for the UNPARTY ecosystem:
- ā theunpartyrunway: Provides conversation analytics and AI usage metrics for development velocity tracking
- ā theunpartyapp: Supplies content processing capabilities and markdown-to-HTML conversion
- ā theunpartyrunway: Receives development workflow data and cost tracking information
- Team Integration: Generates insights that inform development practices across all UNPARTY repositories
NLP Perceptron Pipeline
A lightweight NLP processing pipeline for analyzing text from Unparty documentation. This project crawls documentation content, processes text, classifies it using a perceptron model, and generates structured data for further analysis.
Project Architecture
The pipeline consists of the following stages:
- Web Crawling: Fetches HTML content from Unparty documentation pages.
- Text Preprocessing: Cleans and tokenizes text, builds vocabulary.
- Classification: Uses a perceptron model to classify text as "action" or "non-action".
- Pattern Recognition: Identifies action patterns through token co-occurrence and semantic relationships.
- Storage Formatting: Organizes processed data into structured clusters for LLM inference.
Components
- UnpartyDocCrawler (
main.py): HTML crawler that extracts meaningful text content from documentation pages. - LightPreprocessor (
preprocessor.py): Text preprocessor for converting raw text into numerical format. - LightPerceptron (
perceptron.py): Lightweight perceptron for binary text classification. - ActionPatternProcessor (
action.py): Discovers action patterns through token co-occurrence. - StorageClusterProcessor (
store.py): Formats action analysis data into structured clusters. - TextClusteringPipeline (
main.py): Orchestrates the entire processing pipeline.
Output Files
The pipeline generates three main JSON output files:
clusters_output1.json: Contains initial text clustering into action/non-action categories.action_output1.json: Contains detailed action pattern analysis.storage_output1.json: Contains structured data formatted for LLM inference.
Requirements
- Python 3.x
- Dependencies listed in
requirements.txt:- nltk==3.8.1
- numpy>=2.1.3
- pytest==7.4.0
- beautifulsoup4 (not listed but required)
- requests (not listed but required)
Installation
bash
# Clone the repository
# Navigate to project directory
cd /path/to/nlp_perceptron
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtRunning the Application
bash
# Activate virtual environment (if not already activated)
source venv/bin/activate # On Windows: venv\Scripts\activate
# Run the main script
python src/main.pyDevelopment
The project includes a testing directory for unit tests:
bash
# Run tests
pytest tests/Data
Sample data is provided in the data/ directory for testing purposes.