Creating a Brochure from a URL Using LLM

Ever needed to create a professional brochure from a company's website but dreaded the manual copy-paste process? Large Language Models (LLMs) have made this task remarkably simple. In this guide, I'll show you how to automatically extract website content and generate a polished brochure using AI.

Why This Matters

In my consulting work, I often need to quickly understand a client's business and create presentation materials. Manually browsing websites, extracting key information, and formatting brochures is time-consuming. With LLMs, we can:

Save hours of manual work - Automate content extraction and formatting
Ensure consistency - Use the same structure across all brochures
Focus on what matters - Spend time on strategy, not copy-paste
Generate multiple versions - Easily create variations for different audiences

The Architecture

Here's the high-level approach we'll use:

Fetch website content - Extract HTML from the URL
Parse and clean - Remove scripts, styles, and navigation
Extract key information - Use LLM to identify important content
Generate brochure - Format into a professional layout
Export - Output as PDF, Markdown, or HTML

Tools We'll Use

Python - Our programming language
BeautifulSoup - HTML parsing and content extraction
OpenAI API - GPT-4 for content analysis and generation
Requests - HTTP requests to fetch web pages
Markdown/HTML - Output formats

Step 1: Fetching Website Content

First, we need to retrieve the HTML content from the URL:

import requests
from bs4 import BeautifulSoup

def fetch_website_content(url):
    """Fetch and parse website content."""
    try:
        # Send GET request with headers to mimic a browser
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse HTML
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Remove unwanted elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        
        return soup
    
    except requests.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

Key Points:

User-Agent header - Some websites block requests without it
Timeout - Prevent hanging on slow websites
Remove noise - Strip scripts, styles, and navigation elements

Step 2: Extracting Structured Content

Now let's extract the meaningful text while preserving structure:

def extract_structured_content(soup):
    """Extract main content with structure preserved."""
    content = {
        'title': '',
        'headings': [],
        'paragraphs': [],
        'lists': []
    }
    
    # Get page title
    title_tag = soup.find('title')
    content['title'] = title_tag.get_text().strip() if title_tag else ''
    
    # Extract headings
    for heading in soup.find_all(['h1', 'h2', 'h3']):
        content['headings'].append({
            'level': heading.name,
            'text': heading.get_text().strip()
        })
    
    # Extract paragraphs
    for p in soup.find_all('p'):
        text = p.get_text().strip()
        if len(text) > 20:  # Filter out short snippets
            content['paragraphs'].append(text)
    
    # Extract lists
    for ul in soup.find_all(['ul', 'ol']):
        items = [li.get_text().strip() for li in ul.find_all('li')]
        if items:
            content['lists'].append(items)
    
    return content

Step 3: Using LLM to Analyze Content

Here's where the magic happens. We'll use GPT-4 to understand the website and extract key information:

import openai
import json

def analyze_with_llm(content, api_key):
    """Use LLM to analyze and extract key information."""
    openai.api_key = api_key
    
    # Prepare the content for analysis
    text_content = f"""
Website Title: {content['title']}

Headings:
{chr(10).join([f"{h['level']}: {h['text']}" for h in content['headings'][:10]])}

Main Content:
{chr(10).join(content['paragraphs'][:15])}
"""
    
    # Create the prompt
    prompt = f"""Analyze this website content and extract key information for a professional brochure.

{text_content}

Please provide the following in JSON format:
1. company_name: The company or website name
2. tagline: A compelling one-line description
3. overview: 2-3 sentence overview of what they do
4. key_features: List of 4-6 main features/services
5. benefits: List of 3-5 key benefits
6. call_to_action: Suggested CTA text

Respond ONLY with valid JSON."""

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a professional marketing analyst who extracts key information from websites."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=1000
        )
        
        # Parse the JSON response
        result = json.loads(response.choices[0].message.content)
        return result
        
    except Exception as e:
        print(f"LLM analysis error: {e}")
        return None

Why This Approach Works:

Structured output - JSON makes it easy to work with the results
Low temperature - More deterministic, factual responses
Clear instructions - Specific format requirements
Token limits - Prevents excessive costs

Step 4: Generating the Brochure

Now let's create a beautifully formatted brochure:

def generate_brochure_markdown(analysis):
    """Generate a professional brochure in Markdown format."""
    
    brochure = f"""# {analysis['company_name']}

## {analysis['tagline']}

---

## About Us

{analysis['overview']}

---

## What We Offer

"""
    
    # Add key features
    for i, feature in enumerate(analysis['key_features'], 1):
        brochure += f"{i}. **{feature}**\n"
    
    brochure += "\n---\n\n## Why Choose Us?\n\n"
    
    # Add benefits
    for benefit in analysis['benefits']:
        brochure += f"✓ {benefit}\n"
    
    brochure += f"\n---\n\n## {analysis['call_to_action']}\n"
    
    return brochure

Step 5: Complete Working Example

Here's the complete script that ties everything together:

#!/usr/bin/env python3
"""
Brochure Generator from URL using LLM
"""

import requests
from bs4 import BeautifulSoup
import openai
import json
import sys

class BrochureGenerator:
    def __init__(self, api_key):
        self.api_key = api_key
        openai.api_key = api_key
    
    def generate(self, url):
        """Main method to generate brochure from URL."""
        print(f"Fetching content from {url}...")
        soup = self.fetch_website(url)
        
        if not soup:
            return None
        
        print("Extracting structured content...")
        content = self.extract_content(soup)
        
        print("Analyzing with LLM...")
        analysis = self.analyze_with_llm(content)
        
        if not analysis:
            return None
        
        print("Generating brochure...")
        brochure = self.generate_markdown(analysis)
        
        return brochure
    
    def fetch_website(self, url):
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            }
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            for tag in soup(['script', 'style', 'nav', 'footer']):
                tag.decompose()
            
            return soup
        except Exception as e:
            print(f"Error: {e}")
            return None
    
    def extract_content(self, soup):
        # Implementation from earlier
        pass
    
    def analyze_with_llm(self, content):
        # Implementation from earlier
        pass
    
    def generate_markdown(self, analysis):
        # Implementation from earlier
        pass

# Usage
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python brochure_generator.py ")
        sys.exit(1)
    
    url = sys.argv[1]
    api_key = "your-openai-api-key"  # Or use environment variable
    
    generator = BrochureGenerator(api_key)
    brochure = generator.generate(url)
    
    if brochure:
        # Save to file
        filename = "brochure.md"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(brochure)
        print(f"Brochure saved to {filename}")
    else:
        print("Failed to generate brochure")

Real-World Example

Let's say we run this on a SaaS company's website:

python brochure_generator.py https://example-saas.com

Output brochure.md:

# CloudMaster Analytics

## Transform Your Data Into Actionable Insights

---

## About Us

CloudMaster Analytics is a leading cloud-based analytics platform that helps 
businesses make data-driven decisions. We combine powerful data visualization 
with AI-powered insights to turn complex datasets into clear, actionable intelligence.

---

## What We Offer

1. **Real-time Dashboard Analytics**
2. **AI-Powered Predictive Models**
3. **Custom Report Builder**
4. **Multi-source Data Integration**
5. **Collaborative Workspace**
6. **Enterprise-grade Security**

---

## Why Choose Us?

✓ 99.9% uptime guarantee with 24/7 support
✓ Seamless integration with 100+ data sources
✓ Advanced AI that learns from your data patterns
✓ GDPR and SOC 2 compliant
✓ Trusted by 10,000+ businesses worldwide

---

## Start Your Free 14-Day Trial Today!

Advanced Enhancements

1. Add PDF Export

from markdown2 import markdown
from weasyprint import HTML

def export_to_pdf(markdown_content, output_file):
    """Convert Markdown to PDF."""
    html_content = markdown(markdown_content)
    
    # Add some CSS styling
    styled_html = f"""
    <html>
    <head>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #2c3e50; }}
            h2 {{ color: #3498db; border-bottom: 2px solid #3498db; }}
        </style>
    </head>
    <body>
        {html_content}
    </body>
    </html>
    """
    
    HTML(string=styled_html).write_pdf(output_file)

2. Add Image Extraction

def extract_images(soup, url):
    """Extract main images from the website."""
    images = []
    
    for img in soup.find_all('img'):
        src = img.get('src')
        alt = img.get('alt', '')
        
        if src:
            # Handle relative URLs
            if not src.startswith('http'):
                from urllib.parse import urljoin
                src = urljoin(url, src)
            
            images.append({'url': src, 'alt': alt})
    
    return images[:3]  # Get top 3 images

3. Multi-page Crawling

def crawl_multiple_pages(base_url, max_pages=5):
    """Crawl multiple pages for comprehensive content."""
    visited = set()
    to_visit = [base_url]
    all_content = []
    
    while to_visit and len(visited) < max_pages:
        url = to_visit.pop(0)
        
        if url in visited:
            continue
        
        visited.add(url)
        soup = fetch_website(url)
        
        if soup:
            all_content.append(extract_content(soup))
            
            # Find internal links
            for link in soup.find_all('a', href=True):
                href = link['href']
                if href.startswith('/'):
                    full_url = base_url + href
                    if full_url not in visited:
                        to_visit.append(full_url)
    
    return all_content

Best Practices & Tips

1. Handle Rate Limits

Add delays between requests
Respect robots.txt
Cache API responses to avoid repeated calls
Use environment variables for API keys

2. Error Handling

Validate URLs before processing
Handle timeouts gracefully
Provide fallback content if LLM fails
Log errors for debugging

3. Cost Optimization

Limit token usage by summarizing content first
Use GPT-3.5-turbo for initial analysis, GPT-4 for final polish
Cache results for frequently accessed URLs
Implement retry logic with exponential backoff

4. Quality Improvements

Add human review step before finalizing
Use few-shot examples in prompts for better results
Implement content validation rules
Allow manual override of extracted information

Use Cases

This technique is incredibly versatile:

Sales teams - Quickly create prospect briefs
Marketing agencies - Generate client presentations
Consultants - Research and summarize competitor offerings
Product teams - Analyze feature sets of competitors
Recruiters - Understand company culture and benefits

Limitations & Considerations

Technical Limitations:

JavaScript-rendered content - May need Selenium/Playwright
Authentication required - Can't access logged-in content
Dynamic content - APIs or AJAX-loaded data might be missed
Token limits - Large websites may exceed context window

Ethical Considerations:

Respect copyright and terms of service
Don't scrape excessively or cause server load
Always attribute content to original source
Use for internal research, not republishing

Future Enhancements

Potential improvements for this system:

Multi-language support - Translate brochures automatically
Brand customization - Apply company colors and logos
Template variations - Different layouts for different industries
Competitive analysis - Compare multiple competitors side-by-side
Auto-updating - Regenerate when website changes
Integration - Connect with CRM or marketing tools

Key Takeaways

LLMs can dramatically accelerate content creation workflows
Combining web scraping with AI produces powerful automation
Structured prompts yield consistent, usable results
Start simple, then add sophistication based on needs
Always include human oversight for quality control
Be mindful of costs, rate limits, and ethical considerations

Conclusion

Automating brochure creation from URLs using LLMs is a game-changer for anyone who regularly needs to understand and present information about businesses. What used to take hours of manual work now takes minutes, and the results are often better than what you'd create manually.

The key is combining the right tools—web scraping for data collection, LLMs for intelligent analysis, and good old-fashioned programming for orchestration. Start with the basic implementation I've shown here, then customize it for your specific needs.

I've used variations of this approach for everything from competitive analysis reports to client onboarding materials, and it's saved countless hours while improving consistency. Give it a try and let me know what you create!

Have you automated content creation with LLMs? What challenges did you face? Let's connect and share experiences!