Read Webpage

Extract readable text content from web pages using various methods.

Process

  1. Choose extraction method - lynx for speed, readability for articles
  2. Test basic fetch - Try curl first to check if page loads
  3. Apply extraction tool - Use appropriate tool for content type
  4. Handle authentication - Add headers/cookies if needed
  5. Clean up output - Remove unwanted elements, format text
  6. Verify quality - Check if content extraction was successful

Examples

Method Best For Command
curl + html2text Simple pages curl -sL URL | html2text
lynx Text extraction lynx -dump -nolist URL
w3m Alternative text w3m -dump URL
Python readability Article extraction See Python section

Method 1: curl + html2text

Best for simple HTML to text conversion:

# Basic usage
curl -sL "https://example.com" | html2text

# With width control
curl -sL "https://news.ycombinator.com" | html2text -width 80

# Save to file
curl -sL "https://example.com/article" | html2text > article.txt

Method 2: lynx

Best for clean text extraction:

# Basic dump (includes link references at bottom)
lynx -dump "https://example.com"

# Without link list
lynx -dump -nolist "https://example.com"

# With custom width
lynx -dump -width=100 "https://example.com"

# Follow redirects
lynx -dump -accept_all_cookies "https://example.com"

Method 3: w3m

Alternative text browser:

# Basic dump
w3m -dump "https://example.com"

# With cookies
w3m -dump -cookie "https://example.com"

Method 4: Python with Readability

Best for article extraction (removes navigation, ads, etc):

# Install dependencies via nix-shell
nix-shell -p python3Packages.readability-lxml python3Packages.requests \
  --run "python3 << 'EOF'
import sys
from readability import Document
import requests

url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'

response = requests.get(url)
doc = Document(response.text)

print('Title:', doc.title())
print()
print(doc.summary())
EOF" "https://example.com/article"

Method 5: Python with BeautifulSoup

For more control over parsing:

nix-shell -p python3Packages.beautifulsoup4 python3Packages.requests \
  --run "python3 << 'EOF'
import sys
from bs4 import BeautifulSoup
import requests

url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Remove script and style elements
for script in soup(['script', 'style', 'nav', 'header', 'footer']):
    script.decompose()

# Get text
text = soup.get_text()

# Clean up whitespace
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split('  '))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)
EOF" "https://example.com"

Extracting Specific Elements

Using jq with curl (if site returns JSON):

curl -sL "https://api.example.com/data" | jq .

Using xpath with xmllint:

curl -sL "https://example.com" | \
  xmllint --html --xpath "//article//text()" - 2>/dev/null

Handling Authentication

# Basic auth
curl -u username:password -sL "https://example.com" | html2text

# Bearer token
curl -H "Authorization: Bearer TOKEN" -sL "https://api.example.com" | jq .

# Cookies
curl -b "session=abc123" -sL "https://example.com" | html2text

Reusable Function

read_webpage() {
  local url="$1"
  local method="${2:-lynx}"  # Default to lynx
  
  case "$method" in
    lynx)
      lynx -dump -nolist "$url"
      ;;
    html2text)
      curl -sL "$url" | html2text
      ;;
    w3m)
      w3m -dump "$url"
      ;;
    readability)
      nix-shell -p python3Packages.readability-lxml python3Packages.requests \
        --run "python3 -c \"
from readability import Document
import requests
doc = Document(requests.get('$url').text)
print(doc.title())
print()
print(doc.summary())
\""
      ;;
    *)
      echo "Unknown method: $method" >&2
      return 1
      ;;
  esac
}

# Usage
read_webpage "https://example.com" lynx

Tips

  1. Choose the right tool:

    • lynx: Fast, clean text
    • html2text: Preserves some formatting
    • readability: Best for articles
    • BeautifulSoup: Custom parsing
  2. Handle errors:

    if ! curl -sL "$url" | html2text; then
      echo "Failed to fetch page" >&2
      exit 1
    fi
    
  3. Set User-Agent if needed:

    curl -A "Mozilla/5.0" -sL "$url" | html2text
    
  4. Follow redirects: curl uses -L flag

  5. Check status code:

    status=$(curl -sL -o /dev/null -w "%{http_code}" "$url")
    if [ "$status" != "200" ]; then
      echo "HTTP $status" >&2
      exit 1
    fi
    

Common Issues

JavaScript-heavy sites: These tools can’t execute JavaScript. For dynamic sites, you need a real browser (see browser skill).

Rate limiting: Add delays between requests:

for url in url1 url2 url3; do
  lynx -dump -nolist "$url"
  sleep 2
done

Character encoding: Usually handled automatically, but if needed:

curl -sL "$url" | iconv -f iso-8859-1 -t utf-8 | html2text