This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
reddit_hashtags [2021/07/28 14:30] admin |
reddit_hashtags [2025/03/19 00:48] (current) admin |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Reddit Hashtags ====== | ====== Reddit Hashtags ====== | ||
| - | This link https://projects.glasgow.social/rglasgow/ will let you enter a reddit username and it'll return a list of interests based on keywords in their post and comment history (if available). | + | **Note: These projects are defunct circa Dec 2023 (when Reddit clamped down on their public API to retrieve comments etc - it was massively rate-limited). So the project was abandoned (including the [[Matrix firehose]] room).** |
| - | It does a few things to try and remove irrelevant words: | + | |https://projects.glasgow.social/rglasgow/|Lets you enter a reddit username and it'll return a list of interests based on keywords in their post and comment history (if available)| |
| + | |https://starflyer.armchairscientist.co.uk/data/reddit/scan.php|Returns more keywords and tries to assign them a weight| | ||
| + | |||
| + | It's far from perfect, but it does a few things to try and remove irrelevant words: | ||
| * Tokenisation - seperates the words by whitespace and removes duplicates | * Tokenisation - seperates the words by whitespace and removes duplicates | ||
| * Normalisation - ignores tokens that are too short (less than five characters) and too long (more than 12 characters - those are usually URLs etc) | * Normalisation - ignores tokens that are too short (less than five characters) and too long (more than 12 characters - those are usually URLs etc) | ||
| * Stop word removal - Remove common [[wp>Stop_word|stop words]], based on this [[https://www.ranks.nl/stopwords|list of stop words]]. | * Stop word removal - Remove common [[wp>Stop_word|stop words]], based on this [[https://www.ranks.nl/stopwords|list of stop words]]. | ||
| * [[wp>Lemmatisation|Lemmatisation]] - reduces (non-noun) words to their root origins | * [[wp>Lemmatisation|Lemmatisation]] - reduces (non-noun) words to their root origins | ||