CRAWLER

the tool for crawling data

theunpartycrawler (Analytics Intelligence)

Location: unparty-app/theunpartycrawler
Status: Active Development
Primary Purpose: AI conversation analytics suite and content processing for development intelligence

CRAWLER is a comprehensive AI conversation analytics suite that has evolved from a simple NLP perceptron pipeline. It processes ChatGPT and Claude conversation exports to generate insights about usage patterns, topic evolution, and behavioral analysis.

Overview

This repository serves as the analytics intelligence engine for the UNPARTY ecosystem, providing deep insights into AI-assisted development workflows and conversation patterns. It combines legacy NLP capabilities with modern analytics processors to deliver actionable intelligence for development teams.

Tech Stack

Language: Python 3.12
Framework: Hybrid Architecture (Legacy NLP + Modern Analytics)
ML Libraries: PyTorch, Transformers, scikit-learn
Visualization: Plotly, Matplotlib, Pandas
Data Processing: NumPy, Pandas
Testing: pytest
Build Tool: Bash scripts for markdown conversion
Deployment: Vercel (with security headers and route protection)

Enhanced Convert.sh Tool 🚀

The repository now includes an enhanced markdown-to-HTML conversion tool with production-ready features:

✨ Key Features

🌓 Dark/Light Mode Toggle: Automatic theme detection with manual override
🖨️ Direct Print Functionality: Optimized print styles with professional layout
📁 Intelligent Path Handling: Outputs converted files adjacent to source markdown files
🎨 Customizable Branding: 5 customizable brand colors with mandatory accessibility compliance
📱 Responsive Design: Mobile-friendly with accessible controls

🚀 Quick Start

bash

# Make executable
chmod +x convert.sh

# Convert markdown to enhanced HTML
./convert.sh your-document.md

# Output will be created as: your-document.html (same directory)

📖 Documentation

Local Development Guide - Step-by-step setup and usage
Deployment Documentation - Production deployment guide
Convert Steps Analysis - Technical architecture details

🎯 Path Handling Examples

bash

./convert.sh file.md                    # → file.html (in output/web/)
./convert.sh docs/readme.md             # → docs/readme.html
./convert.sh /tmp/notes.md              # → /tmp/notes.html
./convert.sh file.md custom/output.html # → custom/output.html

Key Features

AI Conversation Analytics Suite

Conversation Analytics Suite: 67+ specialized Python processors for ChatGPT/Claude conversation data
Platform Comparison: Cross-platform usage analysis and behavioral insights
Topic Evolution: Monthly and weekly trend analysis with keyword detection
Interactive Dashboards: HTML visualizations with responsive design and dark/light mode
Business Calendar Integration: Custom fiscal, sprint, and work calendar analytics
Memory-Efficient Processing: Streaming generators handle 1000+ conversation datasets

Enhanced Convert Tool

Production-Ready Conversion: Markdown-to-HTML with dark/light mode toggle
Print Optimization: Professional print styles with direct print functionality
Responsive Design: Mobile-friendly with accessible controls
Customizable Branding: 5 brand colors with accessibility compliance

Legacy NLP Pipeline

Perceptron-Based Classification: Binary text classification for action/non-action content
Pattern Recognition: Token co-occurrence and semantic relationship discovery
Structured Data Output: LLM-ready formatted clusters

Architecture

Code

Analytics Intelligence (Python)
├── Legacy Pipeline (NLP Perceptron)
│   ├── main.py (HTML crawler)
│   ├── preprocessor.py (Text processing)
│   ├── perceptron.py (Binary classification)
│   ├── action.py (Pattern recognition)
│   └── store.py (Data clustering)
├── Modern Analytics Suite (67+ processors)
│   ├── conversation_timeline.py (ChatGPT analysis)
│   ├── claude_timeline.py (Claude analysis)
│   ├── platform_comparison.py (Cross-platform insights)
│   ├── weekly_usage_analyzer.py (Behavioral patterns)
│   ├── monthly_topic_analyzer.py (Topic evolution)
│   └── analytics_validator.py (Quality assurance)
├── Enhanced Convert Tool
│   ├── convert.sh (Markdown-to-HTML converter)
│   ├── convert.css (Responsive styling)
│   └── onboarding-convert.md (Documentation)
├── Calendar System
│   ├── generate_calendar_files.py (Business calendar generator)
│   └── data/ (Fiscal, sprint, and work calendars)
└── Utilities
    ├── utils/html_styling.py (Centralized styling)
    └── tests/ (5 test files, 3 passing)

Data Processing Patterns

Conversation Format Handling

ChatGPT Format: Complex nested mapping structure with conversation nodes
Claude Format: Simple array structure with direct message access
Topic Detection: Keyword-based classification with configurable dictionaries
Streaming Processing: Memory-efficient generators for large conversation datasets
Output Formats: HTML dashboards, CSV exports, JSON dumps, Markdown reports

Input Data Expectations

python

# ChatGPT: conversations.json with 'mapping' field
{
  "mapping": {
    "node_id": {
      "message": {
        "create_time": timestamp,
        "content": {"parts": ["message text"]}
      }
    }
  }
}

# Claude: conversations.json with 'chat_messages' array
{
  "chat_messages": [
    {
      "created_at": timestamp,
      "text": "message text",
      "sender": "human" | "assistant"
    }
  ]
}

Integration Points

GitHub Actions: CI/CD with Claude Code and Claude Code Review workflows
Vercel: Deployment platform with security headers and route protection
Calendar Systems: Business calendar integration for fiscal and sprint analytics
Cross-Platform Analysis: Integrates ChatGPT and Claude conversation data
UNPARTY Ecosystem: Provides analytics intelligence to theunpartyrunway for development metrics

Business Value

ABOUT → BUILD → CONNECT

ABOUT: Analyzes conversation patterns to understand user engagement, topic evolution, and development workflows. Helps teams understand how they use AI assistance and where time is spent.
BUILD: Provides development insights through technical discussion analysis, code extraction, and pattern recognition. Enables data-driven decisions about development practices and AI tool usage.
CONNECT: Generates comprehensive reports and visualizations for stakeholder communication. Creates shareable HTML dashboards, CSV exports, and markdown reports that facilitate team discussions and retrospectives.

Protecting Creator Ownership, Privacy, and Cost-Sensitivity

Privacy-First: All processing happens locally on conversation exports; no data sent to external services
Cost-Aware: Streaming generators minimize memory usage and compute costs for large datasets
Creator Ownership: All analytics outputs are local files that remain under creator control
Transparency: Open-source codebase allows full audit of data processing logic

Key Differentiators

Dual Nature System: Maintains both legacy NLP capabilities and modern analytics in a single codebase
Conversation Intelligence: Deep analysis of ChatGPT and Claude conversation exports with platform-specific optimizations
Production-Ready Conversion: Enhanced markdown-to-HTML tool with professional styling and accessibility features
Business Integration: Custom calendar systems for fiscal and sprint-based analytics align technical metrics with business rhythms
Memory Efficiency: Handles 1000+ conversation datasets without performance degradation through streaming architecture

Relationship to Ecosystem

theunpartycrawler serves as the analytics intelligence engine for the UNPARTY ecosystem:

→ theunpartyrunway: Provides conversation analytics and AI usage metrics for development velocity tracking
→ theunpartyapp: Supplies content processing capabilities and markdown-to-HTML conversion
← theunpartyrunway: Receives development workflow data and cost tracking information
Team Integration: Generates insights that inform development practices across all UNPARTY repositories

NLP Perceptron Pipeline

A lightweight NLP processing pipeline for analyzing text from Unparty documentation. This project crawls documentation content, processes text, classifies it using a perceptron model, and generates structured data for further analysis.

Project Architecture

The pipeline consists of the following stages:

Web Crawling: Fetches HTML content from Unparty documentation pages.
Text Preprocessing: Cleans and tokenizes text, builds vocabulary.
Classification: Uses a perceptron model to classify text as "action" or "non-action".
Pattern Recognition: Identifies action patterns through token co-occurrence and semantic relationships.
Storage Formatting: Organizes processed data into structured clusters for LLM inference.

Components

UnpartyDocCrawler (main.py): HTML crawler that extracts meaningful text content from documentation pages.
LightPreprocessor (preprocessor.py): Text preprocessor for converting raw text into numerical format.
LightPerceptron (perceptron.py): Lightweight perceptron for binary text classification.
ActionPatternProcessor (action.py): Discovers action patterns through token co-occurrence.
StorageClusterProcessor (store.py): Formats action analysis data into structured clusters.
TextClusteringPipeline (main.py): Orchestrates the entire processing pipeline.

Output Files

The pipeline generates three main JSON output files:

clusters_output1.json: Contains initial text clustering into action/non-action categories.
action_output1.json: Contains detailed action pattern analysis.
storage_output1.json: Contains structured data formatted for LLM inference.

Requirements

Python 3.x
Dependencies listed in requirements.txt:
- nltk==3.8.1
- numpy>=2.1.3
- pytest==7.4.0
- beautifulsoup4 (not listed but required)
- requests (not listed but required)

Installation

bash

# Clone the repository
# Navigate to project directory
cd /path/to/nlp_perceptron

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Application

bash

# Activate virtual environment (if not already activated)
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Run the main script
python src/main.py

Development

The project includes a testing directory for unit tests:

bash

# Run tests
pytest tests/

Data

Sample data is provided in the data/ directory for testing purposes.