Parse Hacker News
Extract structured data from Hacker News posts and comments using the Firebase API.
Process
- Extract item ID - Parse ID from HN URL or use given ID
- Fetch item data - Call Firebase API to get story/comment JSON
- Parse main item - Extract title, author, score, text, timestamp
- Fetch comments - Get top-level comments from kids array
- Parse comments - Extract comment text, authors, timestamps
- Format output - Structure data as readable text or JSON
Examples
Given a HN URL or item ID, fetch and parse to extract:
- Post author, title, text, score, date
- Top comments with authors and content
API Endpoints
Base URL: https://hacker-news.firebaseio.com/v0
# Get a story or comment by ID
curl -s "https://hacker-news.firebaseio.com/v0/item/ITEM_ID.json"
Extract Item ID from URL
HN URLs look like:
https://news.ycombinator.com/item?id=41824973
Extract the ID after id=.
JSON Structure
Story Fields
by- Usernametitle- Post titletext- Post body (for Ask HN, Show HN, etc.)url- For link posts, the external URLscore- Pointsdescendants- Total comment counttime- Unix timestampkids- Array of comment IDs (top-level only)type- “story”, “comment”, “job”, etc.
Comment Fields
by- Usernametext- Comment body (HTML encoded)time- Unix timestampparent- Parent item IDkids- Array of reply IDs
Example Commands
Fetch a story with basic info:
curl -s "https://hacker-news.firebaseio.com/v0/item/41824973.json" | jq '{
author: .by,
title: .title,
text: .text,
score: .score,
comments: .descendants,
time: .time,
url: .url
}'
Fetch top 3 comments:
# First get the story to find comment IDs
KIDS=$(curl -s "https://hacker-news.firebaseio.com/v0/item/41824973.json" | jq -r '.kids[:3][]')
# Then fetch each comment
for kid in $KIDS; do
curl -s "https://hacker-news.firebaseio.com/v0/item/$kid.json" | jq '{author: .by, text: .text}'
done
One-liner to get story + comments:
ID=41824973
echo "=== STORY ===" && \
curl -s "https://hacker-news.firebaseio.com/v0/item/$ID.json" | jq '{author: .by, title, text, score, time}' && \
echo "=== COMMENTS ===" && \
for kid in $(curl -s "https://hacker-news.firebaseio.com/v0/item/$ID.json" | jq -r '.kids[:3] // [] | .[]'); do
curl -s "https://hacker-news.firebaseio.com/v0/item/$kid.json" | jq '{author: .by, text}'
done
Output Format
## Hacker News Post
**username** (DATE, SCORE points, N comments)
Title: POST TITLE
> Post text here (if any)...
URL: https://external-link.com (if link post)
### Top Comments
1. **commenter1**
> Comment text...
2. **commenter2**
> Comment text...
Date Conversion
# Convert Unix timestamp
date -d @1728790511 # Returns: Sun Oct 13 2024
HTML Decoding
HN text contains HTML entities. Common ones:
'→'/→/"→"&→&<p>→ paragraph break<i>...</i>→ italics
Use sed or handle in output:
| sed 's/<p>/\n\n/g; s/<[^>]*>//g; s/'/'"'"'/g; s///\//g'
Rate Limiting
The Firebase API is generous but:
- Don’t hammer it with parallel requests
- Add small delays if fetching many items