Source corpora:
Licence :
GNU Affero General Public License v3.0
The Web as CoNLL-U (WAC) uses the aforementioned corpora as q source. The corpora are available in the Leipzig Corpora Collection.
It processes it with the fr_dep_news_trf pipeline and then converts it to the CoNLL-U format, as of the SUD recommendations
The source-code is now available in this very repository, GPU_WACoNNLU.py is the main file, and it needs the requirements listed in requirements.txt to run.
This tranformation was made to be used with GREW-March, a tool to match patterns on graphs.
As of this tool uses the CoNLL-U format, we needed to convert the WAC into this format.
The GPU_WACoNNLU.py script is the result of this conversion.
But we still had one issue, the WAC is a huge corpus, and the tool, meant to be used on the UD corpora, was not able to handle it.
That is why a second script, wackier_wac.py was made, in order to split the WAC into folders of only 4 files each.
The script also generates a corpora_list.json file, which is used by the GREW-Match tool to know which corpora do we have and what are their properties (the path to the folder still needs to be completed manually).