User Tools

Site Tools


mastodon_spam_scanner

Mastodon SEO Spam

An example of the type of accounts this script finds I occasionally notice some spam accounts being created on my Mastodon instance. If they haven't posted to the timeline then they generally aren't spotted/reported and the only way I can see them is to manually review new accounts when they sign up. On those rare days that there is a massive spike in signups (we had 11k sign up to https://glasgow.social over a few days in November) it's just not feasible to manually review. I made this script to let me review the accounts after the fact (this also helps catch those spammers who create the account, wait a few days, then modify them).

I'm still working out how best to identify a spammer. At the moment, I'm just looking at the custom fields (called 'attachment' in Mastodon) and counting the URLs there. If there are four URLs then it's often spam.

First, I get a list of all the local users by connecting to my postgres database:

copy (SELECT username,suspended_at FROM accounts WHERE DOMAIN IS NULL) TO '/tmp/users.csv' WITH delimiter ',';

Then, I run this code to generate a score: '0' for no URLs, '4' for four URLs found.

php scan_for_spammers.php > output.csv

Then I can sort the output and look for only accounts with spam and suspend them.

sort -r output.csv | head

Which outputs something like:

4       xoilac33
4       work
4       waterproofepoxy
4       w88malayu20
4       w88indi18
4       vandanamanturgekar
4       urvam
4       urbanloveulcer
4       underfillepoxy
4       traigavietnet

I can then search for these in the moderation interface and review them.

The php code to generate the scores (remember to create a cache directory with mkdir cache):

scan_for_spammers.php
<?php
$filename = "users.csv";
$cachedir = "cache/";
$mastodon_host = "https://glasgow.social";
 
$data = file_get_contents($filename);
 
$lines = explode("\n", $data);
$total = count($lines);
$progress = 0;
 
foreach($lines as $line) {
   $fields = explode(",", $line);
   if(!is_array($fields))
      continue;
 
   $username = $fields[0];
   $suspended_at = $fields[1];
 
   if(empty($suspended_at)) {
      $cache_file = $cachedir."/".$username;
      $content_size = 0;
      if(file_exists($cache_file)) {
         $json_content = file_get_contents($cache_file);
      } else {
         $json_content = file_get_contents($mastodon_host."/users/".$username.".json");
         $content_size = strlen($json_content);
         file_put_contents($cache_file, $json_content);
      }
      $json = json_decode($json_content, true);
      $attachment = $json['attachment'];
      $score = 0;
      foreach($attachment as $key=>$values) {
         if(strpos($values['value'], "http") !== false)
            $score++;
      }
      $percent_complete = number_format(($progress/$total)*100,1);
      $moderation_link = "<a href='$mastodon_host/admin/accounts?origin=local&username=".$username."'>mod link</a><br />";
      echo $score."\t$username\t$moderation_link\n";
      // this outputs a progress indicator to stderr
      // reporting content size of the json file in case I run into any rate limit issues
      fwrite(STDERR, "Downloaded ".number_format($content_size,0)." bytes... ($percent_complete%)\n");
   } else {
      // do nothing, user already suspended
   }
   $progress++;
}
 
?>

I added a moderation link to the CSV output so I can just open that file in a browser with this for example:

php scan_for_spammers.php | grep -E "^4" > output.html

To answer a question on Mastodon; You could add a list of spam keywords or suspicious urls at the top of the file, for example:

$spam_keywords = array('spam_term', 'spamwebsite.com');

Then add a loop just after the foreach($attachment.. to search the profile text for a url or keyword, for example, adding this would increase the score generated based on more keywords matching:

      foreach($spam_keywords as $keyword) {
         if(preg_match("/$keyword/i", $json['summary']))
            $score++;
      }

Back to the Mastodon page.

mastodon_spam_scanner.txt · Last modified: 2023/05/02 10:11 by neil