User Tools

Site Tools


reddit_hashtags

This is an old revision of the document!


Reddit Hashtags

Note: These projects are defunct circa Dec 2023 (when Reddit clamped down on their public API to retrieve comments etc - it was massively rate-limited). So the project was abandoned (including the Matrix firehose room).

https://projects.glasgow.social/rglasgow/Lets you enter a reddit username and it'll return a list of interests based on keywords in their post and comment history (if available)
https://starflyer.armchairscientist.co.uk/data/reddit/scan.phpReturns more keywords and tries to assign them a weight

It's far from perfect, but it does a few things to try and remove irrelevant words:

  • Tokenisation - seperates the words by whitespace and removes duplicates
  • Normalisation - ignores tokens that are too short (less than five characters) and too long (more than 12 characters - those are usually URLs etc)
  • Stop word removal - Remove common stop words, based on this list of stop words.
  • Lemmatisation - reduces (non-noun) words to their root origins
reddit_hashtags.1742345295.txt.gz · Last modified: 2025/03/19 00:48 by admin