This explains why you needed over an hour on an 8GB machine (which almost certainly does not have 8GB free!). If you have less, then you see a lot of swapping, because the insert operation in the indexes will essentially do random accesses to the data files, since the indexed attribute data are not copied into the index but remain in the data files. The 3M documents need around 11 GB of main memory. Reloading the collection in the still running server takes about as long.ĭisclaimer: Sorry, I forgot to introduce myself: My name is Max and I also work for ArangoDB. When I shut down the database server and restart it with loading this collection (which rebuilds the 4 indexes in memory), this takes about 263 seconds, which is well below 10 minutes. The actual data files on disk (data shapes without indexes): 7.280.736.336 bytes, which is about 2692 b/documentįor comparison: The raw (unzipped) JSON data for these documents were 7.390.007.296 bytes (as reported by arangoimp).672.169.048 bytes of data for all four indexes together, this is about 226 b/document.489.390.288 bytes of shape data, this is about 165 b/document.6.811.237.720 bytes actual data (shaped), this is about 2300 b/document.After the WAL was flushed after the actual input it used 9.2 GB.The database used at most 10.0 GB resident memory during the import.The import took 15 minutes on that machine.Gunzip -c | grep -v '^\[' | sed -e 's/},$/}/g' | head -n 2961954 | time arangoimp -file -type json -collection wikidata -overwrite true ArangoDB currently uses (1), but we want to switch to (3). If you expect your server to run stable, then (1) might be much fast during normal operations. So if you expect your server to crash often, then (1) might not be a good idea. (3) depends: with a clean shutdown as fast as (2), with a crash as slow as (1) You need to do much more synching then in (1). If you have a look at what CouchDB you will see what I mean. (2) this is the slowest solution because you need to ensure that there are no inconsistencies even in case of a server crash. (4) other solutions like keeping only parts in memory, use memory as a cache, and so on are also possible (3) disk-backed with a file-system like clean flag (2) use disk-based indexes (this is currently implemented in CouchDB) (1) use memory only indexes (this is currently implemented in ArangoDB) (3) We decided to keep the indexes only in memory. I assume that you are using a fulltext index in your example, right? We want to speed up the process and hopefully can improve there over time (see also the next bullet point). The fulltext index is indeed very slow when building. There is an elastic search plugin to use ElasticSearch as fulltext search engine for ArangoDB. We think that search engines like ElasticSearch, Solr are much better in this - especially when it comes to stemming, different languages, phonetic searches. (2) Fulltext indexes are not our main expertise. Therefore it is indeed true, that we did not add support for TP3 because we believe it will be of limited use. Therefore we decided to create a Javascript version of Gremlin which runs directly on the shards thus minimising the amount of moved data. As soon as you need to shard the data and spread it to many servers you will move a lot of data between Gremlin and the DBservers. This works very well if you can embedded the database and keep it in the same process space. Gremlin is a nice language, but it requires you to move a lot of data into the client. (1) We do not believe that TP is helpful in a shared environment. I still would like to tell you about our opinions on the raised issues, namely full-text indexes and blueprint. Hi, I'm the CTO of ArangoDB, so my comments are most certainly biased.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |