Skip to content

Commit 8e17156

Browse files
authored
example: add the example hn_trending_topics (#1209)
1 parent 4193dfc commit 8e17156

File tree

6 files changed

+512
-0
lines changed

6 files changed

+512
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,7 @@ It defines an index flow like this:
201201
| [Custom Source HackerNews](examples/custom_source_hn) | Index HackerNews threads and comments, using *CocoIndex Custom Source* |
202202
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
203203
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
204+
| [HackerNews Trending Topics](examples/hn_trending_topics) | Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |
204205

205206
More coming and stay tuned 👀!
206207

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
OPENAI_API_KEY=
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# HackerNews Trending Topics Example
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
In this example, we use [CocoIndex Custom Source](https://cocoindex.io/docs/custom_ops/custom_targets) to define a source to get HackerNews recent content by calling [HackerNews API](https://hn.algolia.com/api).
6+
We build an index for HackerNews threads and their comments, and use LLM to extract trending topics from the text.
7+
8+
The pipeline uses `ExtractByLlm` to identify topics like product names, technologies, models, and company names mentioned in threads and comments, storing them in canonical form (avoiding acronyms unless very popular).
9+
10+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
11+
12+
## Features
13+
14+
- **Custom Source Integration**: Fetches HackerNews threads and comments via API
15+
- **LLM Topic Extraction**: Automatically extracts topics using `ExtractByLlm` function
16+
- **Canonical Topic Forms**: Topics are stored in canonical form (e.g., "Large Language Model" instead of "LLM")
17+
- **Multiple Query Handlers**:
18+
- `search_by_topic`: Search content by specific topic
19+
- `get_trending_topics`: Get trending topics ranked by mention count
20+
21+
## Steps
22+
23+
### Indexing Flow
24+
25+
1. We define a custom source connector `HackerNews` to get HackerNews recent threads by calling HackerNews API.
26+
2. For each thread and comment, we extract topics using LLM (`ExtractByLlm`).
27+
3. We build two indexes:
28+
- `hn_messages`: Full text of threads and comments
29+
- `hn_topics`: Extracted topics with references to their source content, keyed by (topic, message_id)
30+
31+
## Prerequisite
32+
33+
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
34+
35+
## Run
36+
37+
Install dependencies:
38+
39+
```bash
40+
pip install -e .
41+
```
42+
43+
Update the target:
44+
45+
```bash
46+
cocoindex update --setup main
47+
```
48+
49+
Each time when you run the `update` command, cocoindex will only re-process threads that have changed, and keep the target in sync with the recent 500 threads from HackerNews.
50+
51+
You can also run `update` command in live mode, which will keep the target in sync with the source continuously:
52+
53+
```bash
54+
cocoindex update --setup -L main.py
55+
```
56+
57+
## Query Examples
58+
59+
After running the pipeline, you can query the extracted topics:
60+
61+
```bash
62+
# Get trending topics
63+
cocoindex query main.py get_trending_topics --limit 20
64+
65+
# Search content by specific topic
66+
cocoindex query main.py search_by_topic --topic "Claude"
67+
68+
# Search by text content
69+
cocoindex query main.py search_text --query "artificial intelligence"
70+
```
71+
72+
## CocoInsight
73+
74+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
75+
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
76+
77+
```
78+
cocoindex server -ci -L main
79+
```
80+
81+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).

0 commit comments

Comments
 (0)