Fana AI Insight Forge - Your WebGenius Scraper and FAQ Engine v0.2.0

Table of Contents

  1. What's New

  2. Core Improvements

  3. Content Processing Pipeline

  4. Technical Architecture

  5. Best Practices

  6. Configuration Guide

  7. Directory Structure

What's New

  • Enhanced HTML content cleaning using Fana's robust content cleaner module

  • Improved markdown conversion with cleaner output

  • Better handling of dynamic content and JavaScript elements

  • More accurate content extraction from complex web pages

  • Smarter form and boilerplate content removal

  • New: Customizable directory structure for organized content storage

Core Improvements

Content Cleaning

  • Integrated Fana's production-grade HTML content cleaner

  • Advanced regex patterns for better content filtering

  • Intelligent handling of website builder artifacts

  • Removal of tracking scripts and analytics code

  • Better preservation of meaningful content

Content Processing Flow

Raw HTML → Main Content Extraction → Unwanted Section Removal → 
Content Cleaning → Markdown Conversion → FAQ Generation → Organized Directory Storage

HTML Cleaning Features

  • Comprehensive script and style removal

  • Form and contact information filtering

  • Footer and cookie notice elimination

  • URL and tracking code cleanup

  • Duplicate content detection and removal

Content Processing Pipeline

1. Initial Content Extraction

  • Smart main content detection using prioritized selectors

  • Better handling of content containers

  • Improved body content processing

  • Removal of non-content elements

2. Content Cleaning

  • Enhanced website builder pattern detection

  • Improved form and contact information removal

  • Better handling of inline scripts

  • Smarter URL detection and processing

  • Unicode normalization and character handling

3. Output Generation

  • Three distinct content versions stored in dedicated directories:

    • full/: Complete original content

    • cleaned/: Processed plain text

    • markdown/: Clean, formatted markdown

    • faq/: Generated FAQs

  • Consistent cleaning across all formats

Technical Architecture

Content Cleaner Module

  • Leveraged from Fana Rust Backend

  • Extensive pattern matching capabilities

  • Configurable cleaning parameters

  • Robust UTF-8 handling

  • Intelligent content preservation

Key Components

  1. Pattern Matching Engine

    • Website builder patterns

    • Form and contact patterns

    • Script and style patterns

    • URL patterns

    • HTML entity patterns

  2. Content Processing

    • Multi-phase cleaning approach

    • Smart line processing

    • Duplicate detection

    • Substring removal

    • Whitespace normalization

  3. Configuration Options

    • Line length control

    • URL handling

    • Newline preservation

    • Unicode normalization

    • Minimum content thresholds

Best Practices

Content Extraction

  • Use specific content selectors first

  • Fall back to broader selectors when needed

  • Remove obvious non-content elements

  • Preserve meaningful content structure

Content Cleaning

  • Apply consistent cleaning across all formats

  • Remove unwanted elements before processing

  • Preserve important formatting

  • Handle special characters appropriately

Output Generation

  • Generate clean, consistent output

  • Maintain content hierarchy

  • Remove duplicate information

  • Preserve important formatting

Directory Organization

  • Use descriptive directory names

  • Maintain consistent file naming conventions

  • Keep related content together

  • Follow the established directory structure

Configuration Guide

Content Cleaner Configuration

ContentCleanerConfig {
    max_line_length: 1000,
    min_text_length: 5,
    remove_urls: true,
    preserve_newlines: true,
    max_consecutive_newlines: 2,
    normalize_unicode: true,
}

Directory Structure Configuration

scraped_content/
└── [user-defined-name]/
    ├── faq/         # Generated FAQ files
    ├── markdown/    # Clean markdown content
    ├── full/        # Original HTML content
    └── cleaned/     # Processed plain text

Usage Example

Enter name for the content directory:
example

How many FAQs do you want per page?
10

Please paste the URLs to scrape, 1 per line. Enter a blank line to finish:
https://example.com/page1
https://example.com/page2
[empty line to finish]

Customization Options

  • Adjust content thresholds

  • Configure URL handling

  • Control newline behavior

  • Set Unicode normalization

  • Define minimum content length

  • Customize output directory name

This release represents a significant improvement in content processing quality and organization. The integration of Fana's robust content cleaner module ensures more reliable and cleaner output across all formats, while the new directory structure feature provides better organization and management of generated content.

Last updated