Read Webpage
Extract readable text content from web pages using various methods.
Process
- Choose extraction method - lynx for speed, readability for articles
- Test basic fetch - Try curl first to check if page loads
- Apply extraction tool - Use appropriate tool for content type
- Handle authentication - Add headers/cookies if needed
- Clean up output - Remove unwanted elements, format text
- Verify quality - Check if content extraction was successful
Examples
| Method | Best For | Command |
|---|---|---|
curl + html2text |
Simple pages | curl -sL URL | html2text |
lynx |
Text extraction | lynx -dump -nolist URL |
w3m |
Alternative text | w3m -dump URL |
Python readability |
Article extraction | See Python section |
Method 1: curl + html2text
Best for simple HTML to text conversion:
# Basic usage
curl -sL "https://example.com" | html2text
# With width control
curl -sL "https://news.ycombinator.com" | html2text -width 80
# Save to file
curl -sL "https://example.com/article" | html2text > article.txt
Method 2: lynx
Best for clean text extraction:
# Basic dump (includes link references at bottom)
lynx -dump "https://example.com"
# Without link list
lynx -dump -nolist "https://example.com"
# With custom width
lynx -dump -width=100 "https://example.com"
# Follow redirects
lynx -dump -accept_all_cookies "https://example.com"
Method 3: w3m
Alternative text browser:
# Basic dump
w3m -dump "https://example.com"
# With cookies
w3m -dump -cookie "https://example.com"
Method 4: Python with Readability
Best for article extraction (removes navigation, ads, etc):
# Install dependencies via nix-shell
nix-shell -p python3Packages.readability-lxml python3Packages.requests \
--run "python3 << 'EOF'
import sys
from readability import Document
import requests
url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'
response = requests.get(url)
doc = Document(response.text)
print('Title:', doc.title())
print()
print(doc.summary())
EOF" "https://example.com/article"
Method 5: Python with BeautifulSoup
For more control over parsing:
nix-shell -p python3Packages.beautifulsoup4 python3Packages.requests \
--run "python3 << 'EOF'
import sys
from bs4 import BeautifulSoup
import requests
url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script and style elements
for script in soup(['script', 'style', 'nav', 'header', 'footer']):
script.decompose()
# Get text
text = soup.get_text()
# Clean up whitespace
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(' '))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
EOF" "https://example.com"
Extracting Specific Elements
Using jq with curl (if site returns JSON):
curl -sL "https://api.example.com/data" | jq .
Using xpath with xmllint:
curl -sL "https://example.com" | \
xmllint --html --xpath "//article//text()" - 2>/dev/null
Handling Authentication
# Basic auth
curl -u username:password -sL "https://example.com" | html2text
# Bearer token
curl -H "Authorization: Bearer TOKEN" -sL "https://api.example.com" | jq .
# Cookies
curl -b "session=abc123" -sL "https://example.com" | html2text
Reusable Function
read_webpage() {
local url="$1"
local method="${2:-lynx}" # Default to lynx
case "$method" in
lynx)
lynx -dump -nolist "$url"
;;
html2text)
curl -sL "$url" | html2text
;;
w3m)
w3m -dump "$url"
;;
readability)
nix-shell -p python3Packages.readability-lxml python3Packages.requests \
--run "python3 -c \"
from readability import Document
import requests
doc = Document(requests.get('$url').text)
print(doc.title())
print()
print(doc.summary())
\""
;;
*)
echo "Unknown method: $method" >&2
return 1
;;
esac
}
# Usage
read_webpage "https://example.com" lynx
Tips
-
Choose the right tool:
lynx: Fast, clean texthtml2text: Preserves some formattingreadability: Best for articlesBeautifulSoup: Custom parsing
-
Handle errors:
if ! curl -sL "$url" | html2text; then echo "Failed to fetch page" >&2 exit 1 fi -
Set User-Agent if needed:
curl -A "Mozilla/5.0" -sL "$url" | html2text -
Follow redirects: curl uses
-Lflag -
Check status code:
status=$(curl -sL -o /dev/null -w "%{http_code}" "$url") if [ "$status" != "200" ]; then echo "HTTP $status" >&2 exit 1 fi
Common Issues
JavaScript-heavy sites: These tools can’t execute JavaScript. For dynamic sites, you need a real browser (see browser skill).
Rate limiting: Add delays between requests:
for url in url1 url2 url3; do
lynx -dump -nolist "$url"
sleep 2
done
Character encoding: Usually handled automatically, but if needed:
curl -sL "$url" | iconv -f iso-8859-1 -t utf-8 | html2text