CRAWLER

the tool for crawling data

theunpartycrawler (Analytics Intelligence)

Location: unparty-app/theunpartycrawler
Status: Active Development
Primary Purpose: AI conversation analytics suite and content processing for development intelligence

CRAWLER is a comprehensive AI conversation analytics suite that has evolved from a simple NLP perceptron pipeline. It processes ChatGPT and Claude conversation exports to generate insights about usage patterns, topic evolution, and behavioral analysis.

Overview

This repository serves as the analytics intelligence engine for the UNPARTY ecosystem, providing deep insights into AI-assisted development workflows and conversation patterns. It combines legacy NLP capabilities with modern analytics processors to deliver actionable intelligence for development teams.

Tech Stack

  • Language: Python 3.12
  • Framework: Hybrid Architecture (Legacy NLP + Modern Analytics)
  • ML Libraries: PyTorch, Transformers, scikit-learn
  • Visualization: Plotly, Matplotlib, Pandas
  • Data Processing: NumPy, Pandas
  • Testing: pytest
  • Build Tool: Bash scripts for markdown conversion
  • Deployment: Vercel (with security headers and route protection)

Enhanced Convert.sh Tool šŸš€

The repository now includes an enhanced markdown-to-HTML conversion tool with production-ready features:

✨ Key Features

  • šŸŒ“ Dark/Light Mode Toggle: Automatic theme detection with manual override
  • šŸ–Øļø Direct Print Functionality: Optimized print styles with professional layout
  • šŸ“ Intelligent Path Handling: Outputs converted files adjacent to source markdown files
  • šŸŽØ Customizable Branding: 5 customizable brand colors with mandatory accessibility compliance
  • šŸ“± Responsive Design: Mobile-friendly with accessible controls

šŸš€ Quick Start

bash

# Make executable
chmod +x convert.sh

# Convert markdown to enhanced HTML
./convert.sh your-document.md

# Output will be created as: your-document.html (same directory)

šŸ“– Documentation

šŸŽÆ Path Handling Examples

bash

./convert.sh file.md                    # → file.html (in output/web/)
./convert.sh docs/readme.md             # → docs/readme.html
./convert.sh /tmp/notes.md              # → /tmp/notes.html
./convert.sh file.md custom/output.html # → custom/output.html

Key Features

AI Conversation Analytics Suite

  • Conversation Analytics Suite: 67+ specialized Python processors for ChatGPT/Claude conversation data
  • Platform Comparison: Cross-platform usage analysis and behavioral insights
  • Topic Evolution: Monthly and weekly trend analysis with keyword detection
  • Interactive Dashboards: HTML visualizations with responsive design and dark/light mode
  • Business Calendar Integration: Custom fiscal, sprint, and work calendar analytics
  • Memory-Efficient Processing: Streaming generators handle 1000+ conversation datasets

Enhanced Convert Tool

  • Production-Ready Conversion: Markdown-to-HTML with dark/light mode toggle
  • Print Optimization: Professional print styles with direct print functionality
  • Responsive Design: Mobile-friendly with accessible controls
  • Customizable Branding: 5 brand colors with accessibility compliance

Legacy NLP Pipeline

  • Perceptron-Based Classification: Binary text classification for action/non-action content
  • Pattern Recognition: Token co-occurrence and semantic relationship discovery
  • Structured Data Output: LLM-ready formatted clusters

Architecture

Code

Analytics Intelligence (Python)
ā”œā”€ā”€ Legacy Pipeline (NLP Perceptron)
│   ā”œā”€ā”€ main.py (HTML crawler)
│   ā”œā”€ā”€ preprocessor.py (Text processing)
│   ā”œā”€ā”€ perceptron.py (Binary classification)
│   ā”œā”€ā”€ action.py (Pattern recognition)
│   └── store.py (Data clustering)
ā”œā”€ā”€ Modern Analytics Suite (67+ processors)
│   ā”œā”€ā”€ conversation_timeline.py (ChatGPT analysis)
│   ā”œā”€ā”€ claude_timeline.py (Claude analysis)
│   ā”œā”€ā”€ platform_comparison.py (Cross-platform insights)
│   ā”œā”€ā”€ weekly_usage_analyzer.py (Behavioral patterns)
│   ā”œā”€ā”€ monthly_topic_analyzer.py (Topic evolution)
│   └── analytics_validator.py (Quality assurance)
ā”œā”€ā”€ Enhanced Convert Tool
│   ā”œā”€ā”€ convert.sh (Markdown-to-HTML converter)
│   ā”œā”€ā”€ convert.css (Responsive styling)
│   └── onboarding-convert.md (Documentation)
ā”œā”€ā”€ Calendar System
│   ā”œā”€ā”€ generate_calendar_files.py (Business calendar generator)
│   └── data/ (Fiscal, sprint, and work calendars)
└── Utilities
    ā”œā”€ā”€ utils/html_styling.py (Centralized styling)
    └── tests/ (5 test files, 3 passing)

Data Processing Patterns

Conversation Format Handling

  • ChatGPT Format: Complex nested mapping structure with conversation nodes
  • Claude Format: Simple array structure with direct message access
  • Topic Detection: Keyword-based classification with configurable dictionaries
  • Streaming Processing: Memory-efficient generators for large conversation datasets
  • Output Formats: HTML dashboards, CSV exports, JSON dumps, Markdown reports

Input Data Expectations

python

# ChatGPT: conversations.json with 'mapping' field
{
  "mapping": {
    "node_id": {
      "message": {
        "create_time": timestamp,
        "content": {"parts": ["message text"]}
      }
    }
  }
}

# Claude: conversations.json with 'chat_messages' array
{
  "chat_messages": [
    {
      "created_at": timestamp,
      "text": "message text",
      "sender": "human" | "assistant"
    }
  ]
}

Integration Points

  • GitHub Actions: CI/CD with Claude Code and Claude Code Review workflows
  • Vercel: Deployment platform with security headers and route protection
  • Calendar Systems: Business calendar integration for fiscal and sprint analytics
  • Cross-Platform Analysis: Integrates ChatGPT and Claude conversation data
  • UNPARTY Ecosystem: Provides analytics intelligence to theunpartyrunway for development metrics

Business Value

ABOUT → BUILD → CONNECT

  • ABOUT: Analyzes conversation patterns to understand user engagement, topic evolution, and development workflows. Helps teams understand how they use AI assistance and where time is spent.

  • BUILD: Provides development insights through technical discussion analysis, code extraction, and pattern recognition. Enables data-driven decisions about development practices and AI tool usage.

  • CONNECT: Generates comprehensive reports and visualizations for stakeholder communication. Creates shareable HTML dashboards, CSV exports, and markdown reports that facilitate team discussions and retrospectives.

Protecting Creator Ownership, Privacy, and Cost-Sensitivity

  • Privacy-First: All processing happens locally on conversation exports; no data sent to external services
  • Cost-Aware: Streaming generators minimize memory usage and compute costs for large datasets
  • Creator Ownership: All analytics outputs are local files that remain under creator control
  • Transparency: Open-source codebase allows full audit of data processing logic

Key Differentiators

  • Dual Nature System: Maintains both legacy NLP capabilities and modern analytics in a single codebase
  • Conversation Intelligence: Deep analysis of ChatGPT and Claude conversation exports with platform-specific optimizations
  • Production-Ready Conversion: Enhanced markdown-to-HTML tool with professional styling and accessibility features
  • Business Integration: Custom calendar systems for fiscal and sprint-based analytics align technical metrics with business rhythms
  • Memory Efficiency: Handles 1000+ conversation datasets without performance degradation through streaming architecture

Relationship to Ecosystem

theunpartycrawler serves as the analytics intelligence engine for the UNPARTY ecosystem:

  • → theunpartyrunway: Provides conversation analytics and AI usage metrics for development velocity tracking
  • → theunpartyapp: Supplies content processing capabilities and markdown-to-HTML conversion
  • ← theunpartyrunway: Receives development workflow data and cost tracking information
  • Team Integration: Generates insights that inform development practices across all UNPARTY repositories

NLP Perceptron Pipeline

A lightweight NLP processing pipeline for analyzing text from Unparty documentation. This project crawls documentation content, processes text, classifies it using a perceptron model, and generates structured data for further analysis.

Project Architecture

The pipeline consists of the following stages:

  1. Web Crawling: Fetches HTML content from Unparty documentation pages.
  2. Text Preprocessing: Cleans and tokenizes text, builds vocabulary.
  3. Classification: Uses a perceptron model to classify text as "action" or "non-action".
  4. Pattern Recognition: Identifies action patterns through token co-occurrence and semantic relationships.
  5. Storage Formatting: Organizes processed data into structured clusters for LLM inference.

Components

  • UnpartyDocCrawler (main.py): HTML crawler that extracts meaningful text content from documentation pages.
  • LightPreprocessor (preprocessor.py): Text preprocessor for converting raw text into numerical format.
  • LightPerceptron (perceptron.py): Lightweight perceptron for binary text classification.
  • ActionPatternProcessor (action.py): Discovers action patterns through token co-occurrence.
  • StorageClusterProcessor (store.py): Formats action analysis data into structured clusters.
  • TextClusteringPipeline (main.py): Orchestrates the entire processing pipeline.

Output Files

The pipeline generates three main JSON output files:

  • clusters_output1.json: Contains initial text clustering into action/non-action categories.
  • action_output1.json: Contains detailed action pattern analysis.
  • storage_output1.json: Contains structured data formatted for LLM inference.

Requirements

  • Python 3.x
  • Dependencies listed in requirements.txt:
    • nltk==3.8.1
    • numpy>=2.1.3
    • pytest==7.4.0
    • beautifulsoup4 (not listed but required)
    • requests (not listed but required)

Installation

bash

# Clone the repository
# Navigate to project directory
cd /path/to/nlp_perceptron

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Application

bash

# Activate virtual environment (if not already activated)
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Run the main script
python src/main.py

Development

The project includes a testing directory for unit tests:

bash

# Run tests
pytest tests/

Data

Sample data is provided in the data/ directory for testing purposes.