Skip to content

Plugin Development Guide

matamorphosis edited this page Jan 30, 2022 · 24 revisions

Support

Support for developing Scrummage plugins is provided for tier 3 and 4 monthly sponsors, and ad-hoc support is provided based on the relevant time-limit. For more details, you can contact us via the sponsors page. Anyone is able to request a new plugin idea, and it will be built assuming it matches our criteria, explained further below. Plugin suggestions will be prioritised according to the level of support provided. Supporters who are of a higher tier will naturally have their suggestions prioritised. However, we still encourage members of the community to suggest plugin ideas also.

For Developers Only

Notes:

  1. This guide assumes you are competent programming in Python3. Scrummage is platform that is heavily dependent on its own platform, where most common functionality is customised and handled in the General.py and Common.py libraries. Please refer to these libraries and existing plugins to ensure that you implement the framework in your custom plugins. While in theory, you can build these without the framework, in order for plugins to be approved and merged into the main repository on GitHub, the plugin will be reviewed for framework compliance amongst other things, which is outlined in greater depth below.
  2. Please do not begin developing your plugin before checking your plugin idea is not in Scrummage's list of rejected plugins here. If you find the plugin rejected, but you still believe to have a viable idea for one of these rejected plugins, please create a discussion so that the Scrummage team can discuss it further.

One of the goals of the Scrummage project is to be a framework in which members of the open-source community can contribute to by developing plugins. While centralising all OSINT sites is unrealistic, this helps add breadth to the project, and minimises the need for users of the platform to need to pivot to a site due to lack of coverage. Maintainers of the Scrummage project have developed a strong baseline of plugins, with great variety for most OSINT needs, this has been made possible by following a plugin-development cycle that consists of the following:

  1. Selecting new types of plugins, where one popular, third-party application is selected and implemented to increase coverage of the kinds of searches. For example, for Cryptocurrency searches we started with by using Bitcoin search to create the plugin.
  2. Adding new third-party applications to existing plugins that provide greater breadth to search types already implemented. For examples, adding searches for Ethereum and Monero to the cryptocurrencies supported by the Blockchain_Search plugin.

If users of Scrummage find the included plugins don't suit all their needs, we highly encourage them to develop plugins for the gaps they can identify, and to share them back with others in the open-source community. Developers are free to develop plugins in their own forked repository and request to have it merged with the main GitHub branch, subject to a revision process, before it being approved and having it added to the list of plugins. The maintainers of the Scrummage project have a standard revision process, to ensure any newly developed plugin follows SSI, which consists of:

  • Security: The code of the plugin does not put the platform at risk, and ensures all steps are taken to ensure secure programming practices are followed.
  • Simplicity: The code is as simple and efficient as possible.
  • Compliance: The code is following best practices according to the Scrummage framework, the format of which is laid out below. This is achieved using central Scrummage libraries for functionality like API calls, regex pattern searches, and log handling.

This wiki page documents the available functions and classes and well as their default parameters, in the General.py and Common.py libraries as well as a breakdown of a standard plugin. We realise not all plugins fit into a standard set of requirements, hence we are expanding our frameworks capabilities as required.

The Libraries

Both General.py and Common.py are a collection of classes and functions for broad use. The reason there are two files is due to increasing concerns in functions and classes used by the libraries in the plugins/common directory that existed before the creation of Common.py. For example, the function used to set the date needs to be accessed by all plugins, libraries, as well as the core Scrummage.py file. And it is not a good idea for two libraries to be co-dependent. Libraries can be dependent on other libraries, but if it goes both ways there is the potential for infinite loops. Therefore, the Common.py file has functionality that is used in the General.py library; thus, the General.py library has dependencies on Common.py.


General.py

Selenium()
This function is a dependency for the Screenshot class below, and is not a function the user needs to call, unless using selenium will help them scrape data better than the provided Request_Handler() function in the Common.py library.

[CLASS] Screenshot()
This class is only called by the General.py Output class and the main Scrummage.py file, so it will not be covered in this document as it doesn't impact plugin development.

Get_Limit(Limit)
This function receives the Limit argument fed into the plugin. This function checks if a limit has been provided and is in the correct format. If any of these conditions is not met, it uses the default limit of 10, and returns that value. Otherwise, it returns the limit in the filtered format, required by the plugin.

Logging(Directory, Plugin_Name)
Unfortunately, logging has to be done in the plugin file itself, otherwise, the log file would reflect actions in the General.py library. This function receives a given directory, and the plugin's name. It uses this to construct the name of a log file in a location specific to the plugin. This file name is returned and used as the location to log events from the plugin.

Get_Plugin_Logging_Name(Plugin_Name)
This function provides formatting rules for the name of a plugin used for logging purposes.

[CLASS] Cache(Directory, Plugin_Name)
Cache files are used to mitigate the risk of plugins overwriting output files, and attempting to add items to the database that already exist. Similarly to the Logging() function described above, this class has an init function that constructs a file in the same directory, based on the required parameters; however, it is a text file for caching, and not a log file. After this the Get_Cache() function can be called to receive Cached Data, and the Write_Cache(Current_Cached_Data, Data_to_Cache) function can be called to update the cached data, or create if no data currently exists. With the required inputs.

Convert_to_List(String)
This function simply converts a string with Comma-Separated Values (CSV) to a list format. The primary use of this is to split a task's query into a list of queries, if the query has multiple items in in. For example, a query for the Twitter_Search containing "JoeBiden, BarackObama".

[CLASS] Connections(Input, Plugin_Name, Domain, Result_Type, Task_ID, Concat_Plugin_Name)
This class is responsible for outputting the final data to the configured formats, such as the main DB, CSV and DOCX reports, and other configured systems like Elasticsearch, JIRA, RTIR, etc. The initialisation of this class creates a set of variable that represent the data as it is outputted. This includes the Input (or Query) provided by the task, plugin name, the domain of the third party site, the type of result, task id (provided by the task), and the concatenated plugin name (Twitter_Search would just be twitter, but something like NZ_Business_Search, would have a secondary plugin name for this called "nzbusiness"). The type of result has to fit into a pre-defined list, that can be found towards the top of the main Scrummage.py file. They are listed below for convenience:

[
  "Account",
  "Account Source",
  "Application",
  "BSB Details",
  "Blockchain - Address",
  "Blockchain - Transaction",
  "Certificate",
  "Company Details",
  "Credentials",
  "Darkweb Link",
  "Data Leakage",
  "Domain Information",
  "Domain Spoof",
  "Economic Details",
  "Email Information",
  "Exploit",
  "Forum",
  "Hash",
  "IP Address Information",
  "Malware",
  "Malware Report",
  "News Report",
  "Phishing",
  "Phone Details",
  "Publication",
  "Repository",
  "Search Result",
  "Social Media - Group",
  "Social Media - Media",
  "Social Media - Page",
  "Social Media - Person",
  "Social Media - Place",
  "Torrent",
  "Vehicle Details",
  "Web Application Architecture",
  "Wiki Page"
]

If you require this list to be extended a separate request would need to be made to the Scrummage team, otherwise altering this can cause issues for the Scrummage Dashboard.
Once initialised, the Output(self, Complete_File_List, Link, DB_Title, Directory_Plugin_Name, Dump_Types=[]) function can be called.

  • Complete_File_List: A list of the location of all output files. So the value will mostly look like [Main_File, Output_File], with as many output file names as you like. (The actual file data is not stored in the database).
  • Link: The link for the individual result
  • DB_Title: Don't be thrown off by the DB part of the name, this is just the Title of your result.
  • Directory_Plugin_Name: Just the Plugin Name, or Concat_Plugin_Name is there are both.
  • Dump_Types: This option is only used if your plugin uses the Data_Type_Discovery() function, listed above.

Main_File_Create(Directory, Plugin_Name, Output, Query, Main_File_Extension)
This function is responsible for creating the main file for a plugin, the main file usually represents the first data retrieved from the 3rd part site that the plugin leverages. For example, in Twitter Search, this file is a JSON file that is returned as a result of searching Twitter for the given query. The Main file doesn't always exist in plugins, but does in most.

Create_Query_Results_Output_File(Directory, Query, Plugin_Name, Output_Data, Query_Result_Name, The_File_Extension)
This function is responsible for creating files for each result. For example if we follow the Twitter example, let's say we search for "JoeBiden", with a provided limit of 5. The main file will be the returned JSON data with the last 5 tweets from the account @JoeBiden. The plugin then iterates through the results and makes an HTTP request (using the Request_Handler() function from the Common.py library) for each tweet's link. The returns HTML data is then stored in a query file. As part of this process HTML filtering is leveraged for the best results, which is explained more in depth on the wiki page here.

Data_Type_Discovery(Data_to_Search)
This function is quite niche, and is currently used by only one plugin. But essentially it's for any plugin that works by scraping data. This is the process of obtaining data, and iterating through it to understand what is there. The Data_Type_Discovery() function returns a list of discovered content, which can include:

  • Hashes (MD5, SHA1, and SHA256)
  • Credentials
  • Email Addresses
  • URLs This function ultimately helps you better understand data.

Create_Query_Results_Output_File(Directory, Query, Plugin_Name, Output_Data, Query_Result_Name, The_File_Extension)
This function is used to create output files for each identified result produced by a plugin.

Make_Directory(Plugin_Name)
This function is imperative to all plugins, as it creates the directory all plugin-specific data is stored in. For any new plugin it will create the following directory structure in the <SCRUMMAGE_DIR>/app/static/protected/output directory:

  • {Plugin_Name}/{Year}/{Month}/{Day}

For example, running Twitter_Search on the 01/01/2021, will firstly create if it doesn't already exist, and return "twitter/2021/01/01"

Get_Title(URL, Requests=False)
This function is helpful for when you have a link representing each result returned in a plugin. Let's say you have the 5 latest tweets from the Twitter account @JoeBiden, and when creating each result, we want the title from each link. While some API's will return this in the original data, most won't so that's where this function comes into play. This function will send an HTTP request to the desired link, and returned the title of it using the BeautifulSoup web scraping library. The option Requests, when set to True will leverage the Requests_Handler() function from the Common.py library, but sometimes it is preferable to use the urllib library, rather than the requests library leveraged by Requests_Handler(). There is no correct answer, as results vary on a case-by-case basis.
Note: If you have the choice, you should always use the option with the least load, if you are able to get the title via the initial API request, that would be the preferred option over this function.

JSONDict_to_HTML(JSON_Data, JSON_Data_Output, Title)
Note: JSON_Data is the data used to make the conversion, JSON_Data_Output is the data that is being output to a file, this is placed into a raw data text area in the created HTML file. In rare cases, your plugin will only be able to retrieve JSON data. This might be because you're calling an API that has no website for the same data. This option is provided to convert input JSON data to a more visually pleasing HTML report. For this to work you need to provide a JSON payload that starts with a list, then a dictionary, following by attributes. Similar to as follows:

[
  {
    "key1", "value1",
    "key2", "value2"
  }
]

This still doesn't really answer the question when to use this. Thus, I will refer to current examples. When Not To: Plugins like Twitter search, first create a JSON file as the main file, and an HTML file for each result (Query file). As there are already HTML files being produced for the result, there isn't much need for this. While it wouldn't be a problem to use this, it would just be unnecessary. When To Plugins like IPStack_Search, query data for an IP address and receive JSON data. But this JSON data is the full result for the task, no further action is required. There is also no simple way to query the web for this data in an HTML format, so we are stuck with just the JSON data. We would then use this function to create an HTML version of this data for improved reporting.

CSV_to_HTML(CSV_Data, Title)
Same concept as the above function but for CSV data. The raw data is not included in the created HTML report, so it does not need to be provided. The only plugin that currently uses this is Domain_Fuzzer.

CSV_to_JSON(Query, CSV_Data)
Again, currently only used by the Domain_Fuzzer, but this should be used when your only true data is in a CSV format, as JSON is more versatile.

Common.py

Set_Configuration_File()
This function, returns the absolute path of the config.json file, used to access API secrets, as well as other configuration information.

Date(Additional_Last_Days=0, Date_Only=False, Elastic=False, Full_Timestamp=False)
By default this function returns the current date and time in the format (YYYY-MM-DD H:M:S), which is used mostly for logging.

  • Additional_Last_Days: Used to return a list of Dates, starting from the current date and working back the amount of dates specified in this parameter. For example, if it is set to 5, the function would return the dates of the last 5 days. This is mainly used by the Dashboard to get records from the last 5 days to show successful and unsuccessful logins. So not very relevant to plugin development.
  • Date_Only: As the name suggests only returns the date and not the time.
  • Elastic: Returns the timestamp in the format for the Elasticsearch output option.
  • Full_Timestamp: Returns the raw, unformatted, current timestamp.

[CLASS] JSON_Handler()
This class removes the need for plugins and libraries to each import and manage the json module. Additionally, this help with standardisation as the class has defaults that reflect Scrummage standards.

  • [Inner Function] init(raw_data): The initialisation function sets the input value as the objects core value.
  • [Inner Function] Is_JSON(): Returns true if the core value is valid JSON.
  • [Inner Function] To_JSON_Load(): Loads JSON to a Python Dict using the .load method.
  • [Inner Function] To_JSON_Loads(): Loads JSON to a Python Dict using the .loads method.
  • [Inner Function] Dump_JSON(Indentation=2, Sort=True): Uses the .dumps method used for outputting data in a JSON format. By default it beautifies the JSON with an indentation of two, and sorts keys in alphabetical and numerical order. (Indentation set to 0, will result in no indentation at all)

Request_Handler(URL, Method="GET", User_Agent=True, Application_JSON_CT=False, Accept_XML=False, Accept_Language_EN_US=False, Filter=False, Risky_Plugin=False, Full_Response=False, Host="", Data={}, Params={}, Optional_Headers={}, Scrape_Regex_URL={}, Proxies={})
This function removes the need for plugins and libraries to each import and manage the requests module. Additionally, this help with standardisation as the class has defaults that reflect Scrummage standards.

  • URL: This is a string with the URL to send the request to.
  • Method: Default Method is GET, but also supports POST. (Other methods can be added as required, with verification of the Scrummage team)
  • User_Agent: Default is True, which means Scrummage sets the User_Agent header to the latest Firefox User_Agent, this helps make the requests appear to be normal.
  • Application_JSON_CT: When True, sets a Content-Type header with a value of "application/json"
  • Accept_XML: When True, sets an Accept header with a value of "ext/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8"
  • Accept_Language_EN_US: When True, sets and Accept-Language header with a value of "en-US,en;q=0.5"
  • Filter: When True, and must be used in conjunction with a valid value provided to the Host parameter, this calls the response filter function mentioned below.
  • Host: Only set this when using the Filter parameter.
  • Risky_Plugin: When True, this indicates that data returned in the response can contain malicious JavaScript code. Also to only be used in conjunction when the Filter parameter is set to True.
  • Full_Response: When True returns the full response, as by default this function normally only returns the response data.
  • Data: Optional field, to provide data to the HTTP request.
  • Params: Can be use to supply HTTP parameters.
  • JSON_Data: Can be used to supply a JSON data payload.
  • Optional_Headers: Allows the user to set custom headers. If the headers conflict with defaults, the custom headers will override the defaults.
  • Scrape_Regex_URL: Used to Scrape URLs from the response data and returns them.
  • Proxies: Can be used to provide a proxy server.

[CLASS] Configuration

  • [Inner Function] init(Input=False, Output=False, Core=False): Specifies type of configuration to load
  • [Inner Function] Load_Keys(): Loads the keys for the specified configuration.
  • [Inner Function] Load_Values(Object=""): Loads the values for the specified configuration for an object within that configuration.
  • [Inner Function] Set_Field(Object="", Config={}): THERE IS NO NEED TO USE THIS FUNCTION AS A PLUGIN DEVELOPER, using it incorrectly can cause undoable damage to the configuration file.
  • [Inner Function] Load_Configuration(Location=False, Postgres_Database=False, Object="", Details_to_Load=[]): Can be used to load configuration from a plugin. Typically it would be used like Load_Configuration(Object="device", Details_to_Load=["api_key", "api_secret"]). Note the values in the list are in the order they are listed in the config.json file.

Response_Filter(Response, Host, Risky_Plugin=False)
This function goes through the Response data and converts any relative links to absolute links using the Host parameter's value. If Risky_Plugin is set to True, depending on the security settings you have configured in config.json for web scraping (refer to the guide here), this may prevent the function from doing this, incase the data can be potentially malicious.

Load_Web_Scrape_Risk_Configuration()
This function loads web scraping configuration settings used by the Response_Filter() function above.

Regex_Handler(Query, Type="", Custom_Regex="", Findall=False, Get_URL_Components=False)
This function performs regular expressions against a given Query. Type can be used to select a pre-defined regex pattern. Otherwise Custom_Regex can be used to supply your own. Findall, when set to True, returns a list of matches, vs the default search, that finds the first match. Get_URL_Components can only be used when Type is used and set to "URL". This breaks any discovered URLs into three components (Prefix, Body, and Extension) which can be used to extract domains from URLs, and much more.


Breakdown of a Standard Plugin

All plugins start with a standardised base template, which only ever needs to be slightly customised. It is even not too common to need to extend the imported modules. This is only needed if you require a custom library for accessing your plugins API, or the API returns data in a weird format that can only be converted with the help of another module.

import plugins.common.General as General, plugins.common.Common as Common, os, logging

class Plugin_Search:
    # Note: DO NOT CHANGE THE NAME OF THE CLASS AS THIS IS A STANDARD ACROSS ALL PLUGINS.

    def __init__(self, Query_List, Task_ID, Type, Limit=10):
        self.Plugin_Name = "Fake_Search_Engine"
        # When your Plugin_Name is longer than one word and uses _ to separate the words, a second variable needs to be set.
        self.Concat_Plugin_Name = "fakesearchengine"
        self.Logging_Plugin_Name = General.Get_Plugin_Logging_Name(self.Plugin_Name)
        self.Task_ID = Task_ID
        self.Query_List = General.Convert_to_List(Query_List)
        self.The_File_Extension = ".html"
        # If your main file and your output files use different file extensions, the above would end up looking something more like:
        # The_File_Extensions = {"Main": ".json", "Query": ".html"}
        self.Domain = "fakesearchengine.com"
        # ONLY ADD THE BELOW ITEMS IF THEY ARE NEEDED, IF THEY AREN'T, PLEASE REMOVE ARGUMENTS FROM __init__()
        self.Type = Type
        self.Limit = General.Get_Limit(Limit)
        # The below should be added, unless your plugin has multiple search types where the results fall into different categories. (Such as Instagram search)
        self.Result_Type = "Result type from predefined list in the Scrummage.py file."

    def Search(self, Query_List, Task_ID):
        # If like most plugins, you will be returning more than one result related to the provided query you must add the argument Limit=10.
        # Additionally, if your plugin has multiple tasks, such as the Instagram plugin having four separate tasks, you must add an argument for Type, and use conditional programming to behave according to the provided Type.

        try:
            # In the following code, the term self.Simplified_Plugin_Name is a placeholder for self.Concat_Plugin_Name if available, otherwise self.Plugin_Name.lower().
            Data_to_Cache = []
            Directory = General.Make_Directory(self.Simplified_Plugin_Name)
            logger = logging.getLogger()
            logger.setLevel(logging.INFO)
            Log_File = General.Logging(Directory, self.Simplified_Plugin_Name)
            handler = logging.FileHandler(os.path.join(Directory, Log_File), "w")
            handler.setLevel(logging.DEBUG)
            handler.setFormatter(logging.Formatter("%(levelname)s - %(message)s"))
            logger.addHandler(handler)
            Cached_Data_Object = General.Cache(Directory, self.Plugin_Name)
            Cached_Data = Cached_Data_Object.Get_Cache()

            for Query in self.Query_List:
                <CUSTOMISED_CODE_GOES_HERE>

            Cached_Data_Object.Write_Cache(Data_to_Cache)

        except Exception as e:
            logging.warning(f"{Common.Date()} - {self.Logging_Plugin_Name} - {str(e)}")

At this point, you are ready to begin the fun. Depending on if your plugin uses an API or not, you may be required to add a Load_Configuration option to import the details from the config.json file. This would look similar to the below example from the Twitter_Search plugin (Note there is a lot of consistency with this function across all plugins. The main difference exists in the Details_to_Load parameter. Do note that the returned Result will be a list of the values in the order provided in this list):

    def Load_Configuration(self):
        logging.info(f"{Common.Date()} - {self.Logging_Plugin_Name} - Loading configuration data.")
        Result = Common.Configuration(Input=True).Load_Configuration(Object=self.Plugin_Name.lower(), Details_to_Load=["consumer_key", "consumer_secret", "access_key", "access_secret"])

        if Result:
            return Result

        else:
            return None

After developing the plugin there will be additional steps required if this is the case. Please configure this function exactly the same as other plugins. This will be reviewed and corrected if submitted incorrectly. From here use the details, if required, to perform the necessary search against the desired target, and from the result obtain a unique URL for the result, even if it means you have to craft it from something else, as well as a unique identifier such as a title. If the request is made via POST, where the response is the stored result, it is acceptable to create a bogus URL to get around the unique link constraint; however, at the very least the bogus URL should contain the domain. Something such as: https://www.domain.com?UID. Please note this only occurs in rare circumstances.

If a Limit has been implemented, unless your API allows you to set a limit field in the request to control the amount of results. Current_Step variables will need to be implemented to help count how many requests are being made; furthermore, a for loop should be used to iterate through results; furthermore, the for loop should verify whether the Current_Step is less than the Limit. If only one result is generated the for loop and limit parts can be omitted. Twitter_Search is an example where limit can be included, so the plugin permits the use of this in the line shown below

[Line 33] Latest_Tweets = API.user_timeline(screen_name=Handle, count=self.Limit)

For other example the Current_Step iterator is used around the for loop controlling result output. For this take the following lines from Ahmia_Darkweb_Search as an example:

[Line 52] Current_Step = 0
[Line 53] Output_Connections = General.Connections(Query, self.Tor_Plugin_Name, self.Domain, self.Result_Type, self.Task_ID, self.Plugin_Name.lower())
[Line 54]
[Line 55] for URL in Tor_Scrape_URLs:
[Line 56]
[Line 57]     if URL not in Cached_Data and URL not in Data_to_Cache and Current_Step < int(self.Limit):
[Line 58]         Title = f"Ahmia Tor | {URL}" 
[Line 59]         Output_Connections.Output([Output_file], URL, Title, self.Plugin_Name.lower())
[Line 60]         Data_to_Cache.append(URL)
[Line 61]         Current_Step += 1
  1. Almost any result link should be requested and the response stored in an output file using the Common.Request_Handler function, unless you are sure the HTML is going to rendered perfectly, a filtered response should be requested for best reporting. This is also used as a verification. If we refer back to Twitter_Search again for the below example (Note if the website contains www., be sure to have https://www. before the Domain variable in the Host parameter.):
[Line 63] Item_Responses = Common.Request_Handler(Link, Filter=True, Host=f"https://{self.Domain}")
[Line 64] Item_Response = Item_Responses["Filtered"]
  1. The response should be outputted to a local file to create a local copy of the link. This function will return the output file, which will be used later on. Unique_Result_Identifier can be any text that makes your result unique that is not a Link.
Output_file = General.Create_Query_Results_Output_File(Directory, Query, self.Plugin_Name, Filtered_Response, Unique_Result_Identifier, self.The_File_Extension)
  1. If the Output_file is set, then the "General.Connections" class should be initialised and called as per below:
Output_Connections = General.Connections(Query, self.Plugin_Name, self.Domain, self.Result_Type, self.Task_ID, self.Plugin_Name)
Output_Connections.Output([Main_File, Output_file], URL, Title, self.Simplified_Plugin_Name)

Please use one of the exploit kinds from the approved list, that can be viewed in main.py. 12. If the Limit is implemented increase Current_Step by 1, also append the link to the Data_to_Cache list regardless of the limit.

Data_to_Cache.append(URL)
# If using the Current_Step iterator:
Current_Step += 1
  1. Finally, for the plugin to be made available in the Scrummage platform, the file needs to be saved in the plugins directory. In addition, an entry needs to be created in the static/json/plugins_definitions.json file:
{'Ahmia I2P Darkweb Search': {'Requires_Configuration': false, 'Requires_Limit': true, 'Module': 'plugins.Ahmia_Darkweb_Search', 'Type': 'I2P'},
...
'Display Name of Plugin': {'Requires_Configuration': <true if plugin has an entry in then config.json file>, 'Requires_Limit': <true or false>, 'Module': 'plugins.<plugin file name without .py at the end>'),
...

There are other attributes you can assign to the dictionary as required, such as 'Type' if your plugin has multiple, sub-searches. Also 'Organisation_Presets' if you would like to link pre-loaded identity information to the plugin, giving users the option to search using the pre-loaded identity information.

Clone this wiki locally