performance and throughput
Up to Dispersed Storage Users
can anyone share what types of throughput that they have seen using the opensource 1.0RC software?
i have a (3 of ) 4 slicestor setup with 1 accessor. using 1 linux client, with 1500 MTU, i am seeing 4MB/s reads and 15MB/s writes doing a simple lmdd of various 4GB to 9GB files.
i would imagine that performance should be better.
any comments on this?
thanks.
-nick
Hi Nick,
Sorry for the late answer. We need a little more information in order to answer your request:
- Is the dsNet you are using on a local network or dispersed? If dispersed can you please specify the latency and bandwidth?
- Can we get a copy of you properties.xml file (you can find it here: /opt/dsnet-accesser/conf/) and vault descriptor file?
- What kind of initiator are you using?
Thanks,
Sarah.
hi sarah,
i am running the slicestors and the accessors all on a local network. each systems has 2 nics. i have created a private network for the slicestors and accessors to talk to each other, and a public network for the iscsi clients.
i am using the linux iscsi initiation running on CentOS5.X.
i have not touched the default properties.xml file that ships with the 1.1 distribution--i can supply that soon.
as for my vault descriptor, i took the example one, and just created 4 slicestors, 100GB total space, 4k block size, and a width 4, threshold 3 configuration--i can supply that soon if necessary.
i have tried using 2 separate accessor systems, one was a dual proc AMD system with 4GB memory, and the other was a quad-core dual socket intel system with 4GB memory as well. both were running 32bit CentOS5.2. the performance was similar with both systems.
the slicestors are also running CentOS5.2, but are dual socket 2.8 Ghz Xeons, with 2 to 4GB memory. i am using linux md to create a raid 5 configuration with 3+1 drives. using lmdd i am able to get at least 50MBs locally to each disk array, so i was hoping to see ~100-150MB/s to the entire dsnet, not 14MB/s.
are there some key values that you are looking for in the properties.xml and the vault-descriptor file?
i'm not that interested in the encryption part, but it would be awesome if performance was decent, and the system had more HA capabilities (i.e. rebuilder, HA accessor, etc). even without the HA, if performance were better, i could start testing out some of our production systems against it.
thanks.
i ran the dsnet-perf-calc tool from the accessor and am wondering about the results it provides, which i've posted below.
the accessor and slicestor are all connected via Gb ethernet, and running iperf shows that transfer rates up to 900Mbs.
it looks like the limits based upon these calculations are what i am seeing, with the bottleneck being the network, but i am wondering why the network performance between the slicestors and the accessor is so low. is there something in the code that throttles the bandwidth to 100Mb?
Calculating theoretical maximum performance for vault:
BlockDevice[4096 x 26214400 = 102400.0MB ] type:block[4/3] IDA
ptimizedcauchy Codecs [] Slice Codecs [crc ] Slice Stores[1. Remote [10.10.99.11:5000]2. Remote[10.10.99.12:5000]3. Remote [10.10.99.13:5000]4. Remote [10.10.99.14:5000]
Maximum Server Latency: 1.0 ms
Accesser to Server Throughput: 157.81990521327015 Mbps
Slicestor #0's throughput: 277.77777777777777 Mbps
Slicestor #1's throughput: 240.3846153846154 Mbps
Slicestor #2's throughput: 248.13895781637714 Mbps
Slicestor #3's throughput: 277.0083102493075 Mbps
Minimum Slicestor Throughput: 240.3846153846154 Mbps
Parallelized Slicestor Throughput: 961.5384615384615 Mbps
Network bottleneck: Accessor Link
Ideal IDA expansion: 1.3333334
Blowup after encoding: 1.3408203125
Blowup after storage on slicestors: 1.3564453125
Storage requirement for 1 GB of data: 1389 MB
Encoding stack performance (Mbps): 3076.923076923077
Decoding stack performance (Mbps): 4604.3165467625895
Slicestor Harddrive Limit (random access) Mbps: 449.9640028797697
Slicestor Harddrive Limit (sequential access) Mbps: 449.9640028797697
Network throughput limit writing (Mbps): 117.70399340013739
Network throughput limit reading (Mbps): 157.81990521327015
Per thread write throughput (Mbps): 73.0316769011339
20 thread write throughput (Mbps): 1460.633538022678
System's Write Bottleneck: Network Throughput
Maximum sequential write speed: 14.712999175017174 MB/s
System's Read Bottleneck: Network Throughput
Maximum sequential read speed: 19.72748815165877 MB/s
Nick,
Thanks for the output of your calculator, it is interesting. Some of the results are more accurate than others, some are more theoretical. I will attempt to explain how each of the values is derived:
Maximum server latency: A series of 30 small no-op messages are sent to each server which essentially act as pings. The average for each server is taken, and the server that took the longest to reply has it's average ping time displayed.
Accesser to Server Throughput: This parameter attempts to find the bottleneck of the Accesser's network connection. If for example, the Accesser was on a gigabit link, and the servers were on a 100 Mbps link, this would tend to show the 1 Gbps speed. It is derived by using multiple threads to write as much data to each of the server connects as possible within a certain amount of time. In this case it comes out to only 157 Mbps, which is noticeably lower than the throughputs achieved writing to single servers. This may be indicative of a bug in the calculator for this parameter or in some limitation or contention issue with writing to all servers in parralllel. Interestingly, this ends up being displayed as the end bottleneck of the system, and 157 Mbps / 14 MB/s which as you mentioned is the maximum write speed you achieve. Given the massive throughput on encoding it seems to be a network issue. The best speeds we have achieved in our tests on gigabit networks is about 40 MB/s. We are not sure but believe this may be a bottleneck in our current Slicestor implementation.
Slicestor# throughput: Accessor sends as much data as possible to a single slice server at a time, in an attempt to see if any are significantly slower than the others. If one is much slower it can slow down the whole operation.
Parralelized slicestor throughput: Takes the slowest Slicestor's throughput and multiplies it by the # of slicestors. This represents the bottleneck of the Sliceserver connections. If the Sliceservers were all on 56 K modems, and the accesser on 100 Mbps etherhnet, the Accesser could only write at 4* 56Kbps (If there were 4 Slicestors)
Ideal IDA exapansion: Due to information theoretic constraints, it is not possible for dispersed data to be less than (width/threshold) In your case 4/3 = 1.3333
Blow up after encoding: Includes overhead from all codecs applied, including padding from the IDA and rounding up to have equal sized slices.
Blow up after storage: Each block when stored on a Slicestor has 16 bytes of overhead. This blowup amount incorporates that into the calculation.
Encoding stack performance: The Accessor's CPU is benchmarked in encoding through all datasource codecs, IDA, then the slice codecs.
Decoding stack performance: Same with Encoding benchmark, only the inverse operations are tested.
Slicestor Harddrive Limit (random): This is the only thing that is completely calculated (not benchmarked) because it relates to properties of the Slicestors. These are taken as parameters in the script, if you edit the launch script you can canfigure the Slicestor hard drive throughput in bytes/second, and the seek time.
Harddrive limit (sequential): This is similar to the random test only it ignores the seek time parameter.
System bottleneck: Takes the minimum of Encdoing Throughput, Network Throughput, and Harddrive throughput.
I would try varying the parameters of the harddrive throughput to approximately 8 MB/s, and re-running the calculator. Then it might turn out out that the true bottleneck is not the network but the Slicestor's ability to write the slices.
I hope this helps,
Jason
I believe we had some issues with Linux (CentOS5 in particular) read performance, and as far as I can remember, the solution is to adjust the sector read-ahead for the iSCSI disk on your initiator to a value higher than the default. Look into the hdparm utility, in particular the -a flag.
Hope this helps as well,
Zach

