Replies: 8 comments 4 replies
-
|
Hi @maxx-ukoo
That is a surprising sharp drop off. What kind of storage are you using? (local SSD? local disk? A remote file store? ...)
All in a single run. The loaders all exploit the fact the database is empty and manipulate at a low level based on that. Else loading is unoptimized. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @afs
I tried processing up to 20 files in a batch. So technically, the database is not empty: it contains 8,806,955 triples in the http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset graph after loading the OWL file. |
Beta Was this translation helpful? Give feedback.
-
|
One point : "-Xmx28g -Xms4g" The way TDB works, if you run with a large heap it will slow down loading. TDB uses the OS file system accessing and caching indexing files. This isn't heap but a large heap take space from the OS. Try "-Xmx4g -Xms4g". |
Beta Was this translation helpful? Give feedback.
-
Could this be the VM I/O been throttled?
Correction -- |
Beta Was this translation helpful? Give feedback.
-
|
I'm having problems with PubChem - after downloading all the files (FTP, as described on the website) some of the files are corrupt gz files - about 5% of the files. Retrying usually gets a valid file but at least one needed 3 attempts. I also found some syntax errors but with the gz problems, it isn't yet clear whether they are related or whether the files really are illegal Turtle. Syntax errors are a nuisance when bulk loading. It's hard to know what is actually in the database and harder to find and fix it. Are you trying to load all 2065 files? |
Beta Was this translation helpful? Give feedback.
-
That's the "basic" loader. There is Loading one file of compounds files ( Script: |
Beta Was this translation helpful? Give feedback.
-
|
I don't have access to a hardware setup that is similar to yours. I was using a local NVMe connected SSD. It does seem to be related to I/O load.
So the other choice is a database like oxigraph which uses RocksDB (I checked with the project and it does do incremental loads). RocksDB is more write-oriented. Sorry for the non-answer but without hardware to recreate I can only speculate. |
Beta Was this translation helpful? Give feedback.
-
|
@afs Thanks for the update. After a few extra tests, I think I found the issue — I need a lot of memory for the upload. I ran everything from scratch, starting with a 16 GB VM and a 4 GB Java heap. After the upload speed decreased, I increased the memory to 32 GB, then to 64 GB. Each memory upgrade improved the upload speed. However, when I reduced the memory from 64 GB to 32 GB, the upload speed decreased again. I believe memory is responsible for the upload speed. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
I am going to load the PubChem dataset (https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/, https://pubchem.ncbi.nlm.nih.gov/docs/rdf-load).
I start Fuseki, stop it, and then try to upload the data into the dataset directory using the tdb2.tdbloader utility.
I have a few questions:
Beta Was this translation helpful? Give feedback.
All reactions