johnwatson11218
a month ago
I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.
I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
johnwatson11218
22 days ago
I posted my code https://github.com/johnwatson11218/LatentTopicExplorer
johnwatson11218
a month ago
Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.
-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --
CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;
-- Recreate tables in correct dependency order
CREATE TABLE topic_terms AS
SELECT
dt.term_id,
dot.topic_id,
COUNT(DISTINCT dt.document_id) as document_count,
SUM(frequency) as total_frequency
FROM document_terms dt
JOIN document_topics dot ON dt.document_id = dot.document_id
GROUP BY dt.term_id, dot.topic_id;
CREATE TABLE term_tf AS
SELECT
topic_id,
term_id,
SUM(total_frequency) as term_frequency
FROM topic_terms
GROUP BY topic_id, term_id;
CREATE TABLE term_df AS
SELECT
term_id,
COUNT(DISTINCT topic_id) as document_frequency
FROM topic_terms
GROUP BY term_id;
CREATE TABLE topic_term_tfidf AS
SELECT
tt.topic_id,
tt.term_id,
tt.term_frequency as tf,
tdf.document_frequency as df,
tt.term_frequency * LN( (SELECT COUNT(id) FROM topics) / GREATEST(tdf.document_frequency, 1)) as tf_idf
FROM term_tf tt
JOIN term_df tdf ON tt.term_id = tdf.term_id;
CREATE TABLE topic_top_terms AS
WITH ranked_terms AS (
SELECT
ttf.topic_id,
t.term_text,
ttf.tf_idf,
ROW_NUMBER() OVER (PARTITION BY ttf.topic_id ORDER BY ttf.tf_idf DESC) as rank
FROM topic_term_tfidf ttf
JOIN terms t ON ttf.term_id = t.id
)
SELECT
topic_id,
term_text,
tf_idf,
rank
FROM ranked_terms
WHERE rank <= 5
ORDER BY topic_id, rank;
RAISE NOTICE 'All topic tables refreshed successfully';
EXCEPTION
WHEN OTHERS THEN
RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM;
END;
$$;ct0
a month ago
This sounds amazing, totally interested in seeing the approach and repo.
hellisad
a month ago
Sounds a lot like Bertopic. Great library to use.
fittingopposite
a month ago
Yes. Please publish. Sounds very interesting