Ever needed to create a professional brochure from a company's website but dreaded the manual copy-paste process? Large Language Models (LLMs) have made this task remarkably simple. In this guide, I'll show you how to automatically extract website content and generate a polished brochure using AI.
Why This Matters
In my consulting work, I often need to quickly understand a client's business and create presentation materials. Manually browsing websites, extracting key information, and formatting brochures is time-consuming. With LLMs, we can:
- Save hours of manual work - Automate content extraction and formatting
- Ensure consistency - Use the same structure across all brochures
- Focus on what matters - Spend time on strategy, not copy-paste
- Generate multiple versions - Easily create variations for different audiences
The Architecture
Here's the high-level approach we'll use:
- Fetch website content - Extract HTML from the URL
- Parse and clean - Remove scripts, styles, and navigation
- Extract key information - Use LLM to identify important content
- Generate brochure - Format into a professional layout
- Export - Output as PDF, Markdown, or HTML
Tools We'll Use
- Python - Our programming language
- BeautifulSoup - HTML parsing and content extraction
- OpenAI API - GPT-4 for content analysis and generation
- Requests - HTTP requests to fetch web pages
- Markdown/HTML - Output formats
Step 1: Fetching Website Content
First, we need to retrieve the HTML content from the URL:
import requests
from bs4 import BeautifulSoup
def fetch_website_content(url):
"""Fetch and parse website content."""
try:
# Send GET request with headers to mimic a browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Remove unwanted elements
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
return soup
except requests.RequestException as e:
print(f"Error fetching URL: {e}")
return None
Key Points:
- User-Agent header - Some websites block requests without it
- Timeout - Prevent hanging on slow websites
- Remove noise - Strip scripts, styles, and navigation elements
Step 2: Extracting Structured Content
Now let's extract the meaningful text while preserving structure:
def extract_structured_content(soup):
"""Extract main content with structure preserved."""
content = {
'title': '',
'headings': [],
'paragraphs': [],
'lists': []
}
# Get page title
title_tag = soup.find('title')
content['title'] = title_tag.get_text().strip() if title_tag else ''
# Extract headings
for heading in soup.find_all(['h1', 'h2', 'h3']):
content['headings'].append({
'level': heading.name,
'text': heading.get_text().strip()
})
# Extract paragraphs
for p in soup.find_all('p'):
text = p.get_text().strip()
if len(text) > 20: # Filter out short snippets
content['paragraphs'].append(text)
# Extract lists
for ul in soup.find_all(['ul', 'ol']):
items = [li.get_text().strip() for li in ul.find_all('li')]
if items:
content['lists'].append(items)
return content
Step 3: Using LLM to Analyze Content
Here's where the magic happens. We'll use GPT-4 to understand the website and extract key information:
import openai
import json
def analyze_with_llm(content, api_key):
"""Use LLM to analyze and extract key information."""
openai.api_key = api_key
# Prepare the content for analysis
text_content = f"""
Website Title: {content['title']}
Headings:
{chr(10).join([f"{h['level']}: {h['text']}" for h in content['headings'][:10]])}
Main Content:
{chr(10).join(content['paragraphs'][:15])}
"""
# Create the prompt
prompt = f"""Analyze this website content and extract key information for a professional brochure.
{text_content}
Please provide the following in JSON format:
1. company_name: The company or website name
2. tagline: A compelling one-line description
3. overview: 2-3 sentence overview of what they do
4. key_features: List of 4-6 main features/services
5. benefits: List of 3-5 key benefits
6. call_to_action: Suggested CTA text
Respond ONLY with valid JSON."""
try:
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a professional marketing analyst who extracts key information from websites."},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=1000
)
# Parse the JSON response
result = json.loads(response.choices[0].message.content)
return result
except Exception as e:
print(f"LLM analysis error: {e}")
return None
Why This Approach Works:
- Structured output - JSON makes it easy to work with the results
- Low temperature - More deterministic, factual responses
- Clear instructions - Specific format requirements
- Token limits - Prevents excessive costs
Step 4: Generating the Brochure
Now let's create a beautifully formatted brochure:
def generate_brochure_markdown(analysis):
"""Generate a professional brochure in Markdown format."""
brochure = f"""# {analysis['company_name']}
## {analysis['tagline']}
---
## About Us
{analysis['overview']}
---
## What We Offer
"""
# Add key features
for i, feature in enumerate(analysis['key_features'], 1):
brochure += f"{i}. **{feature}**\n"
brochure += "\n---\n\n## Why Choose Us?\n\n"
# Add benefits
for benefit in analysis['benefits']:
brochure += f"✓ {benefit}\n"
brochure += f"\n---\n\n## {analysis['call_to_action']}\n"
return brochure
Step 5: Complete Working Example
Here's the complete script that ties everything together:
#!/usr/bin/env python3
"""
Brochure Generator from URL using LLM
"""
import requests
from bs4 import BeautifulSoup
import openai
import json
import sys
class BrochureGenerator:
def __init__(self, api_key):
self.api_key = api_key
openai.api_key = api_key
def generate(self, url):
"""Main method to generate brochure from URL."""
print(f"Fetching content from {url}...")
soup = self.fetch_website(url)
if not soup:
return None
print("Extracting structured content...")
content = self.extract_content(soup)
print("Analyzing with LLM...")
analysis = self.analyze_with_llm(content)
if not analysis:
return None
print("Generating brochure...")
brochure = self.generate_markdown(analysis)
return brochure
def fetch_website(self, url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
return soup
except Exception as e:
print(f"Error: {e}")
return None
def extract_content(self, soup):
# Implementation from earlier
pass
def analyze_with_llm(self, content):
# Implementation from earlier
pass
def generate_markdown(self, analysis):
# Implementation from earlier
pass
# Usage
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python brochure_generator.py ")
sys.exit(1)
url = sys.argv[1]
api_key = "your-openai-api-key" # Or use environment variable
generator = BrochureGenerator(api_key)
brochure = generator.generate(url)
if brochure:
# Save to file
filename = "brochure.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(brochure)
print(f"Brochure saved to {filename}")
else:
print("Failed to generate brochure")
Real-World Example
Let's say we run this on a SaaS company's website:
python brochure_generator.py https://example-saas.com
Output brochure.md:
# CloudMaster Analytics
## Transform Your Data Into Actionable Insights
---
## About Us
CloudMaster Analytics is a leading cloud-based analytics platform that helps
businesses make data-driven decisions. We combine powerful data visualization
with AI-powered insights to turn complex datasets into clear, actionable intelligence.
---
## What We Offer
1. **Real-time Dashboard Analytics**
2. **AI-Powered Predictive Models**
3. **Custom Report Builder**
4. **Multi-source Data Integration**
5. **Collaborative Workspace**
6. **Enterprise-grade Security**
---
## Why Choose Us?
✓ 99.9% uptime guarantee with 24/7 support
✓ Seamless integration with 100+ data sources
✓ Advanced AI that learns from your data patterns
✓ GDPR and SOC 2 compliant
✓ Trusted by 10,000+ businesses worldwide
---
## Start Your Free 14-Day Trial Today!
Advanced Enhancements
1. Add PDF Export
from markdown2 import markdown
from weasyprint import HTML
def export_to_pdf(markdown_content, output_file):
"""Convert Markdown to PDF."""
html_content = markdown(markdown_content)
# Add some CSS styling
styled_html = f"""
<html>
<head>
<style>
body {{ font-family: Arial, sans-serif; margin: 40px; }}
h1 {{ color: #2c3e50; }}
h2 {{ color: #3498db; border-bottom: 2px solid #3498db; }}
</style>
</head>
<body>
{html_content}
</body>
</html>
"""
HTML(string=styled_html).write_pdf(output_file)
2. Add Image Extraction
def extract_images(soup, url):
"""Extract main images from the website."""
images = []
for img in soup.find_all('img'):
src = img.get('src')
alt = img.get('alt', '')
if src:
# Handle relative URLs
if not src.startswith('http'):
from urllib.parse import urljoin
src = urljoin(url, src)
images.append({'url': src, 'alt': alt})
return images[:3] # Get top 3 images
3. Multi-page Crawling
def crawl_multiple_pages(base_url, max_pages=5):
"""Crawl multiple pages for comprehensive content."""
visited = set()
to_visit = [base_url]
all_content = []
while to_visit and len(visited) < max_pages:
url = to_visit.pop(0)
if url in visited:
continue
visited.add(url)
soup = fetch_website(url)
if soup:
all_content.append(extract_content(soup))
# Find internal links
for link in soup.find_all('a', href=True):
href = link['href']
if href.startswith('/'):
full_url = base_url + href
if full_url not in visited:
to_visit.append(full_url)
return all_content
Best Practices & Tips
1. Handle Rate Limits
- Add delays between requests
- Respect robots.txt
- Cache API responses to avoid repeated calls
- Use environment variables for API keys
2. Error Handling
- Validate URLs before processing
- Handle timeouts gracefully
- Provide fallback content if LLM fails
- Log errors for debugging
3. Cost Optimization
- Limit token usage by summarizing content first
- Use GPT-3.5-turbo for initial analysis, GPT-4 for final polish
- Cache results for frequently accessed URLs
- Implement retry logic with exponential backoff
4. Quality Improvements
- Add human review step before finalizing
- Use few-shot examples in prompts for better results
- Implement content validation rules
- Allow manual override of extracted information
Use Cases
This technique is incredibly versatile:
- Sales teams - Quickly create prospect briefs
- Marketing agencies - Generate client presentations
- Consultants - Research and summarize competitor offerings
- Product teams - Analyze feature sets of competitors
- Recruiters - Understand company culture and benefits
Limitations & Considerations
Technical Limitations:
- JavaScript-rendered content - May need Selenium/Playwright
- Authentication required - Can't access logged-in content
- Dynamic content - APIs or AJAX-loaded data might be missed
- Token limits - Large websites may exceed context window
Ethical Considerations:
- Respect copyright and terms of service
- Don't scrape excessively or cause server load
- Always attribute content to original source
- Use for internal research, not republishing
Future Enhancements
Potential improvements for this system:
- Multi-language support - Translate brochures automatically
- Brand customization - Apply company colors and logos
- Template variations - Different layouts for different industries
- Competitive analysis - Compare multiple competitors side-by-side
- Auto-updating - Regenerate when website changes
- Integration - Connect with CRM or marketing tools
Key Takeaways
- LLMs can dramatically accelerate content creation workflows
- Combining web scraping with AI produces powerful automation
- Structured prompts yield consistent, usable results
- Start simple, then add sophistication based on needs
- Always include human oversight for quality control
- Be mindful of costs, rate limits, and ethical considerations
Conclusion
Automating brochure creation from URLs using LLMs is a game-changer for anyone who regularly needs to understand and present information about businesses. What used to take hours of manual work now takes minutes, and the results are often better than what you'd create manually.
The key is combining the right tools—web scraping for data collection, LLMs for intelligent analysis, and good old-fashioned programming for orchestration. Start with the basic implementation I've shown here, then customize it for your specific needs.
I've used variations of this approach for everything from competitive analysis reports to client onboarding materials, and it's saved countless hours while improving consistency. Give it a try and let me know what you create!
Have you automated content creation with LLMs? What challenges did you face? Let's connect and share experiences!