Creating a Brochure from a URL Using LLM

Ever needed to create a professional brochure from a company's website but dreaded the manual copy-paste process? Large Language Models (LLMs) have made this task remarkably simple. In this guide, I'll show you how to automatically extract website content and generate a polished brochure using AI.

Why This Matters

In my consulting work, I often need to quickly understand a client's business and create presentation materials. Manually browsing websites, extracting key information, and formatting brochures is time-consuming. With LLMs, we can:

  • Save hours of manual work - Automate content extraction and formatting
  • Ensure consistency - Use the same structure across all brochures
  • Focus on what matters - Spend time on strategy, not copy-paste
  • Generate multiple versions - Easily create variations for different audiences

The Architecture

Here's the high-level approach we'll use:

  1. Fetch website content - Extract HTML from the URL
  2. Parse and clean - Remove scripts, styles, and navigation
  3. Extract key information - Use LLM to identify important content
  4. Generate brochure - Format into a professional layout
  5. Export - Output as PDF, Markdown, or HTML

Tools We'll Use

  • Python - Our programming language
  • BeautifulSoup - HTML parsing and content extraction
  • OpenAI API - GPT-4 for content analysis and generation
  • Requests - HTTP requests to fetch web pages
  • Markdown/HTML - Output formats

Step 1: Fetching Website Content

First, we need to retrieve the HTML content from the URL:

import requests
from bs4 import BeautifulSoup

def fetch_website_content(url):
    """Fetch and parse website content."""
    try:
        # Send GET request with headers to mimic a browser
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse HTML
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Remove unwanted elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        
        return soup
    
    except requests.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

Key Points:

  • User-Agent header - Some websites block requests without it
  • Timeout - Prevent hanging on slow websites
  • Remove noise - Strip scripts, styles, and navigation elements

Step 2: Extracting Structured Content

Now let's extract the meaningful text while preserving structure:

def extract_structured_content(soup):
    """Extract main content with structure preserved."""
    content = {
        'title': '',
        'headings': [],
        'paragraphs': [],
        'lists': []
    }
    
    # Get page title
    title_tag = soup.find('title')
    content['title'] = title_tag.get_text().strip() if title_tag else ''
    
    # Extract headings
    for heading in soup.find_all(['h1', 'h2', 'h3']):
        content['headings'].append({
            'level': heading.name,
            'text': heading.get_text().strip()
        })
    
    # Extract paragraphs
    for p in soup.find_all('p'):
        text = p.get_text().strip()
        if len(text) > 20:  # Filter out short snippets
            content['paragraphs'].append(text)
    
    # Extract lists
    for ul in soup.find_all(['ul', 'ol']):
        items = [li.get_text().strip() for li in ul.find_all('li')]
        if items:
            content['lists'].append(items)
    
    return content

Step 3: Using LLM to Analyze Content

Here's where the magic happens. We'll use GPT-4 to understand the website and extract key information:

import openai
import json

def analyze_with_llm(content, api_key):
    """Use LLM to analyze and extract key information."""
    openai.api_key = api_key
    
    # Prepare the content for analysis
    text_content = f"""
Website Title: {content['title']}

Headings:
{chr(10).join([f"{h['level']}: {h['text']}" for h in content['headings'][:10]])}

Main Content:
{chr(10).join(content['paragraphs'][:15])}
"""
    
    # Create the prompt
    prompt = f"""Analyze this website content and extract key information for a professional brochure.

{text_content}

Please provide the following in JSON format:
1. company_name: The company or website name
2. tagline: A compelling one-line description
3. overview: 2-3 sentence overview of what they do
4. key_features: List of 4-6 main features/services
5. benefits: List of 3-5 key benefits
6. call_to_action: Suggested CTA text

Respond ONLY with valid JSON."""

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a professional marketing analyst who extracts key information from websites."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=1000
        )
        
        # Parse the JSON response
        result = json.loads(response.choices[0].message.content)
        return result
        
    except Exception as e:
        print(f"LLM analysis error: {e}")
        return None

Why This Approach Works:

  • Structured output - JSON makes it easy to work with the results
  • Low temperature - More deterministic, factual responses
  • Clear instructions - Specific format requirements
  • Token limits - Prevents excessive costs

Step 4: Generating the Brochure

Now let's create a beautifully formatted brochure:

def generate_brochure_markdown(analysis):
    """Generate a professional brochure in Markdown format."""
    
    brochure = f"""# {analysis['company_name']}

## {analysis['tagline']}

---

## About Us

{analysis['overview']}

---

## What We Offer

"""
    
    # Add key features
    for i, feature in enumerate(analysis['key_features'], 1):
        brochure += f"{i}. **{feature}**\n"
    
    brochure += "\n---\n\n## Why Choose Us?\n\n"
    
    # Add benefits
    for benefit in analysis['benefits']:
        brochure += f"✓ {benefit}\n"
    
    brochure += f"\n---\n\n## {analysis['call_to_action']}\n"
    
    return brochure

Step 5: Complete Working Example

Here's the complete script that ties everything together:

#!/usr/bin/env python3
"""
Brochure Generator from URL using LLM
"""

import requests
from bs4 import BeautifulSoup
import openai
import json
import sys

class BrochureGenerator:
    def __init__(self, api_key):
        self.api_key = api_key
        openai.api_key = api_key
    
    def generate(self, url):
        """Main method to generate brochure from URL."""
        print(f"Fetching content from {url}...")
        soup = self.fetch_website(url)
        
        if not soup:
            return None
        
        print("Extracting structured content...")
        content = self.extract_content(soup)
        
        print("Analyzing with LLM...")
        analysis = self.analyze_with_llm(content)
        
        if not analysis:
            return None
        
        print("Generating brochure...")
        brochure = self.generate_markdown(analysis)
        
        return brochure
    
    def fetch_website(self, url):
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            for tag in soup(['script', 'style', 'nav', 'footer']):
                tag.decompose()
            
            return soup
        except Exception as e:
            print(f"Error: {e}")
            return None
    
    def extract_content(self, soup):
        # Implementation from earlier
        pass
    
    def analyze_with_llm(self, content):
        # Implementation from earlier
        pass
    
    def generate_markdown(self, analysis):
        # Implementation from earlier
        pass

# Usage
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python brochure_generator.py ")
        sys.exit(1)
    
    url = sys.argv[1]
    api_key = "your-openai-api-key"  # Or use environment variable
    
    generator = BrochureGenerator(api_key)
    brochure = generator.generate(url)
    
    if brochure:
        # Save to file
        filename = "brochure.md"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(brochure)
        print(f"Brochure saved to {filename}")
    else:
        print("Failed to generate brochure")

Real-World Example

Let's say we run this on a SaaS company's website:

python brochure_generator.py https://example-saas.com

Output brochure.md:

# CloudMaster Analytics

## Transform Your Data Into Actionable Insights

---

## About Us

CloudMaster Analytics is a leading cloud-based analytics platform that helps 
businesses make data-driven decisions. We combine powerful data visualization 
with AI-powered insights to turn complex datasets into clear, actionable intelligence.

---

## What We Offer

1. **Real-time Dashboard Analytics**
2. **AI-Powered Predictive Models**
3. **Custom Report Builder**
4. **Multi-source Data Integration**
5. **Collaborative Workspace**
6. **Enterprise-grade Security**

---

## Why Choose Us?

✓ 99.9% uptime guarantee with 24/7 support
✓ Seamless integration with 100+ data sources
✓ Advanced AI that learns from your data patterns
✓ GDPR and SOC 2 compliant
✓ Trusted by 10,000+ businesses worldwide

---

## Start Your Free 14-Day Trial Today!

Advanced Enhancements

1. Add PDF Export

from markdown2 import markdown
from weasyprint import HTML

def export_to_pdf(markdown_content, output_file):
    """Convert Markdown to PDF."""
    html_content = markdown(markdown_content)
    
    # Add some CSS styling
    styled_html = f"""
    <html>
    <head>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #2c3e50; }}
            h2 {{ color: #3498db; border-bottom: 2px solid #3498db; }}
        </style>
    </head>
    <body>
        {html_content}
    </body>
    </html>
    """
    
    HTML(string=styled_html).write_pdf(output_file)

2. Add Image Extraction

def extract_images(soup, url):
    """Extract main images from the website."""
    images = []
    
    for img in soup.find_all('img'):
        src = img.get('src')
        alt = img.get('alt', '')
        
        if src:
            # Handle relative URLs
            if not src.startswith('http'):
                from urllib.parse import urljoin
                src = urljoin(url, src)
            
            images.append({'url': src, 'alt': alt})
    
    return images[:3]  # Get top 3 images

3. Multi-page Crawling

def crawl_multiple_pages(base_url, max_pages=5):
    """Crawl multiple pages for comprehensive content."""
    visited = set()
    to_visit = [base_url]
    all_content = []
    
    while to_visit and len(visited) < max_pages:
        url = to_visit.pop(0)
        
        if url in visited:
            continue
        
        visited.add(url)
        soup = fetch_website(url)
        
        if soup:
            all_content.append(extract_content(soup))
            
            # Find internal links
            for link in soup.find_all('a', href=True):
                href = link['href']
                if href.startswith('/'):
                    full_url = base_url + href
                    if full_url not in visited:
                        to_visit.append(full_url)
    
    return all_content

Best Practices & Tips

1. Handle Rate Limits

  • Add delays between requests
  • Respect robots.txt
  • Cache API responses to avoid repeated calls
  • Use environment variables for API keys

2. Error Handling

  • Validate URLs before processing
  • Handle timeouts gracefully
  • Provide fallback content if LLM fails
  • Log errors for debugging

3. Cost Optimization

  • Limit token usage by summarizing content first
  • Use GPT-3.5-turbo for initial analysis, GPT-4 for final polish
  • Cache results for frequently accessed URLs
  • Implement retry logic with exponential backoff

4. Quality Improvements

  • Add human review step before finalizing
  • Use few-shot examples in prompts for better results
  • Implement content validation rules
  • Allow manual override of extracted information

Use Cases

This technique is incredibly versatile:

  • Sales teams - Quickly create prospect briefs
  • Marketing agencies - Generate client presentations
  • Consultants - Research and summarize competitor offerings
  • Product teams - Analyze feature sets of competitors
  • Recruiters - Understand company culture and benefits

Limitations & Considerations

Technical Limitations:

  • JavaScript-rendered content - May need Selenium/Playwright
  • Authentication required - Can't access logged-in content
  • Dynamic content - APIs or AJAX-loaded data might be missed
  • Token limits - Large websites may exceed context window

Ethical Considerations:

  • Respect copyright and terms of service
  • Don't scrape excessively or cause server load
  • Always attribute content to original source
  • Use for internal research, not republishing

Future Enhancements

Potential improvements for this system:

  1. Multi-language support - Translate brochures automatically
  2. Brand customization - Apply company colors and logos
  3. Template variations - Different layouts for different industries
  4. Competitive analysis - Compare multiple competitors side-by-side
  5. Auto-updating - Regenerate when website changes
  6. Integration - Connect with CRM or marketing tools

Key Takeaways

  • LLMs can dramatically accelerate content creation workflows
  • Combining web scraping with AI produces powerful automation
  • Structured prompts yield consistent, usable results
  • Start simple, then add sophistication based on needs
  • Always include human oversight for quality control
  • Be mindful of costs, rate limits, and ethical considerations

Conclusion

Automating brochure creation from URLs using LLMs is a game-changer for anyone who regularly needs to understand and present information about businesses. What used to take hours of manual work now takes minutes, and the results are often better than what you'd create manually.

The key is combining the right tools—web scraping for data collection, LLMs for intelligent analysis, and good old-fashioned programming for orchestration. Start with the basic implementation I've shown here, then customize it for your specific needs.

I've used variations of this approach for everything from competitive analysis reports to client onboarding materials, and it's saved countless hours while improving consistency. Give it a try and let me know what you create!

Have you automated content creation with LLMs? What challenges did you face? Let's connect and share experiences!