Read Webpage

Extract readable text content from web pages using various methods.

Process

Choose extraction method - lynx for speed, readability for articles
Test basic fetch - Try curl first to check if page loads
Apply extraction tool - Use appropriate tool for content type
Handle authentication - Add headers/cookies if needed
Clean up output - Remove unwanted elements, format text
Verify quality - Check if content extraction was successful

Examples

Method	Best For	Command
`curl + html2text`	Simple pages	`curl -sL URL \| html2text`
`lynx`	Text extraction	`lynx -dump -nolist URL`
`w3m`	Alternative text	`w3m -dump URL`
`Python readability`	Article extraction	See Python section

Method 1: curl + html2text

Best for simple HTML to text conversion:

# Basic usage
curl -sL "https://example.com" | html2text

# With width control
curl -sL "https://news.ycombinator.com" | html2text -width 80

# Save to file
curl -sL "https://example.com/article" | html2text > article.txt

Method 2: lynx

Best for clean text extraction:

# Basic dump (includes link references at bottom)
lynx -dump "https://example.com"

# Without link list
lynx -dump -nolist "https://example.com"

# With custom width
lynx -dump -width=100 "https://example.com"

# Follow redirects
lynx -dump -accept_all_cookies "https://example.com"

Method 3: w3m

Alternative text browser:

# Basic dump
w3m -dump "https://example.com"

# With cookies
w3m -dump -cookie "https://example.com"

Method 4: Python with Readability

Best for article extraction (removes navigation, ads, etc):

# Install dependencies via nix-shell
nix-shell -p python3Packages.readability-lxml python3Packages.requests \
  --run "python3 << 'EOF'
import sys
from readability import Document
import requests

url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'

response = requests.get(url)
doc = Document(response.text)

print('Title:', doc.title())
print()
print(doc.summary())
EOF" "https://example.com/article"

Method 5: Python with BeautifulSoup

For more control over parsing:

nix-shell -p python3Packages.beautifulsoup4 python3Packages.requests \
  --run "python3 << 'EOF'
import sys
from bs4 import BeautifulSoup
import requests

url = sys.argv[1] if len(sys.argv) > 1 else 'https://example.com'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Remove script and style elements
for script in soup(['script', 'style', 'nav', 'header', 'footer']):
    script.decompose()

# Get text
text = soup.get_text()

# Clean up whitespace
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split('  '))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)
EOF" "https://example.com"

Extracting Specific Elements

Using jq with curl (if site returns JSON):

curl -sL "https://api.example.com/data" | jq .

Using xpath with xmllint:

curl -sL "https://example.com" | \
  xmllint --html --xpath "//article//text()" - 2>/dev/null

Handling Authentication

# Basic auth
curl -u username:password -sL "https://example.com" | html2text

# Bearer token
curl -H "Authorization: Bearer TOKEN" -sL "https://api.example.com" | jq .

# Cookies
curl -b "session=abc123" -sL "https://example.com" | html2text

Reusable Function

read_webpage() {
  local url="$1"
  local method="${2:-lynx}"  # Default to lynx
  
  case "$method" in
    lynx)
      lynx -dump -nolist "$url"
      ;;
    html2text)
      curl -sL "$url" | html2text
      ;;
    w3m)
      w3m -dump "$url"
      ;;
    readability)
      nix-shell -p python3Packages.readability-lxml python3Packages.requests \
        --run "python3 -c \"
from readability import Document
import requests
doc = Document(requests.get('$url').text)
print(doc.title())
print()
print(doc.summary())
\""
      ;;
    *)
      echo "Unknown method: $method" >&2
      return 1
      ;;
  esac
}

# Usage
read_webpage "https://example.com" lynx

Tips

Choose the right tool:
- lynx: Fast, clean text
- html2text: Preserves some formatting
- readability: Best for articles
- BeautifulSoup: Custom parsing

Handle errors:

if ! curl -sL "$url" | html2text; then
  echo "Failed to fetch page" >&2
  exit 1
fi

Set User-Agent if needed:

curl -A "Mozilla/5.0" -sL "$url" | html2text

Follow redirects: curl uses -L flag

Check status code:

status=$(curl -sL -o /dev/null -w "%{http_code}" "$url")
if [ "$status" != "200" ]; then
  echo "HTTP $status" >&2
  exit 1
fi

Common Issues

JavaScript-heavy sites: These tools can’t execute JavaScript. For dynamic sites, you need a real browser (see browser skill).

Rate limiting: Add delays between requests:

for url in url1 url2 url3; do
  lynx -dump -nolist "$url"
  sleep 2
done

Character encoding: Usually handled automatically, but if needed:

curl -sL "$url" | iconv -f iso-8859-1 -t utf-8 | html2text