-
Notifications
You must be signed in to change notification settings - Fork 0
Part 2 HDT file generation
(This is for the Java library)
To index into an HDT file, the rdf2hdt tool is here, but it come with some options, I suggest you to put all of these options into one file and run it with the -config options.hdtspec", but you can also use the -options "options" argument.
I usually run this one:
option.hdtspec
loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.loadertype=disk
loader.cattree.location=cattree
loader.cattree.memoryFaultFactor=1
loader.disk.futureHDTLocation=future_msd.hdt
loader.disk.location=gen
loader.type=cat
parser.ntSimpleParser=true
loader.disk.compressWorker=3
profiler=true
profiler.output=prof.opt
loader.cattree.kcat=40
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt
But you can remove or add some parts if required
The canonical parser can be used with this option
parser.ntSimpleParser=true
this parser is usually 4 to 10 times faster than the default parser (Using Jena Riot), so if you are parsing a canonical ntriples file (like the Wikidata dump files), you should definetely consider it.
Important: Only the space between the nodes should be respected, but no verification are made with this parser.
You can profile the steps during the HDT generation with this option:
profiler=true
It will write the profiling information after generating the HDT, but you can also write it to disk with
profiler.output=prof.opt
This can then be used with the library using the Profiler class.
To select an algorithm, you can use this option:
loader.type=ALGORITHM
By default the HDT generation is done using the one-pass algorithm, fast, but using a lot of memory.
Id: two-pass
The two-pass algorithm can be used, it takes less memory, but need to read twice the file, so you can't use streamed value, but you can trick your system with a fifo file
mkfifo mypipe.nt
# send the cat twice to the pipe
(cat myfile1.nt myfile2.nt > mypipe.nt ; cat myfile1.nt myfile2.nt > mypipe.nt) &
rdf2hdt mypipe.nt myhdt.hdt
# don't forget to remove it ;)
rm mypipe.ntId: cat
This is a basic algorithm creating small HDTs a merge them with HDTCat, you can select few things with this algorithm:
Using the k-HDTCat algorithm with the loader.cattree.kcat option, it will set the maximum number of HDT merged at the same time, 20 is an acceptable value for an SSD
loader.cattree.kcat=20
You can then select the sub HDT generation method with the loader.cattree.loadertype option, disk or memory, the disk generation is the best choice because it can handle an higher number of triples per sub HDT for a small difference compared to the memory implementation. But you need to configure the same options as the disk algorithm.
loader.cattree.loadertype=disk
The location of the sub HDTs and the future location of the HDT can be selected, by default the HDT is loaded into memory, so for big dataset, a future location is usually required if the final HDT can't be loaded into memory.
loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.location=cattree
k-HDTCat also need a location to cat and map the final HDT, so it's better to add them to the options:
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt
Id: disk
The disk generation algorithm is an algorithm to generate the HDT file using the disk instead of the memory, but this algorithm has a limit, if the number of triples is too high, the algorithm will try its best, but the algorithm will be slower than creating two HDTs and then cat them using k-HDTCat, so the usage of CatTree is better with the disk implementation.
This algorithm is running some parts in parallel, but with many sequential parts, so after more than 4 threads, the speed increase won't be visible as said by the Amdahl's law, you can set this number with loader.disk.compressWorker option.
You can also select the generation with the loader.disk.location, by default this is done in a temporary directory.
Like with the CatTree algorithm, you can select the future HDT location to map the hdt instead of loading it for faster results
The options are:
loader.disk.futureHDTLocation=future_msd.hdt
loader.disk.location=gen
loader.disk.compressWorker=3
If you are using Powershell, you can use this simple command to check the current created HDTs in the loader.cattree.location/hdt-store location
$date = [datetime]::Parse("2023-01-31Z16:31:00"); $date; "`ntriples: $("{0:N0}" -f (Get-Content -TotalCount 10 * | Select-String "<http://rdfs.org/ns/void#triples>" -Raw | % {$s = $_.Split(" ")[2] ; [long]($s.substring(1, $s.Length - 2)) } | Measure-Object -Sum).Sum) triples ($(((Get-Content -TotalCount 10 * | Select-String "<http://rdfs.org/ns/void#triples>" -Raw | % {$s = $_.Split(" ")[2] ; [long]($s.substring(1, $s.Length - 2)) })) -join ", "))`ndtime: $((((ls).LastWriteTime | Measure-Object -Maximum).Maximum - $date).TotalHours)h`nsize: $("{0:N}" -f ((ls | % {$_.Length} | Measure-Object -Sum).Sum / 1000000000))GB`nfiles: $((ls | %{"$($_.Name) ($($_.LastWriteTime))"}) -join ", ")`n"It will give you this result for example:
mardi 31 janvier 2023 17:31:00
triples: 334 740 035 triples (111464180, 111423283, 111852572)
dtime: 1.12650158547222h
size: 3,27GB
files: hdt-1.hdt (01/31/2023 17:55:54), hdt-2.hdt (01/31/2023 18:18:09), hdt-3.hdt (01/31/2023 18:38:35)
Which can be translated to:
CURRENT DATE (my system is in french)
triples: NUMBER_OF_TRIPLES_PARSED triples (TRIPLES_IN_HDT1, TRIPLES_IN_HDT2, ... TRIPLES_IN_HDTK)
dtime: HOURS_SINCE_$date
size: SIZE_OF_THE_DIRECTORY GB
files: HDT_1 (DATE_1), HDT_2 (DATE_2), ... HDT_K (DATE_K)
For bash you can run this one to get the count of triples
(for e in *.hdt; do head -n 3 $e | tail -n 1 | cut -d ' ' -f 3 | cut -d "\"" -f 2; done) | paste -sd+ - | bcSome information might be wrong of misleading, I wrote that using my personal knowledge from HDT.