Wednesday, November 25, 2009

Crossbow for the win

Here's a very good paper on the Crossbow network virtualization and resource control which is available in OpenSolaris since 2009.06: Crossbow Virtual Wire: Network in a Box.

This paper even won the best paper award for Usenix LISA 2009. The blog of one of the authors is available here where he writes about the award and the BOF at Usenix Lisa 09.

Well worth a read if you don't know what crossbow is and can do, and if you do it's still worth a read.

A little news summary

All this is available in other blogs at blogs.sun.com, but here's a summary of some Solaris related news.

Oracle 11g Release 2 is released for Solaris 10 X64, a download is available here. It's nice to see even more Solaris commitment from Oracle.

VirtualBox 3.1 Beta 3 is also available, more details here. VirtualBox 3.1 comes with new features such as live migration between hosts, enhanced USB support in OpenSolaris, better snapshot functionality, faster 2D acceleration and support for EFI.

US Senators Go to Bat for Oracle, Sun Merger: 59 senators also thinks it's about time to let the Oracle-Sun deal proceed.

While I'm at it, a beta of NetBeans 6.8 is available, sadly they do not seem to put any effort into the Python parts. Support for interpreted languages in NetBeans is mostly for Ruby and PHP . More focus on python would have been nice, it seems like python is the interpreted language of choice in OpenSolaris, the Image Packaging System, IPS is built with python. That said there is a python module available for NetBeans, but it doesn't get the same development attention.

Monday, November 23, 2009

ZFS crypto pushed to next year

With only a few weeks left of open build it might not come as a surprise that crypto for ZFS is not making it into 2010.03.I noticed that the ZFS crypto page have been updated with a new target date "Integration Target: Q1CY10".

This is probably wise with lots of fixes and new features for ZFS integrated since the last OpenSolaris release. This means that two out of four upcoming ZFS features that I wrote about in Mars made it in time for OSOL 2010.03. Hopefully both crypto and BP rewrite will be ready in time for the next (Open)Solaris release, when and how the new masters of Sun* decides to release it.

* Lets hope that the European Commission finally have come to their senses and freed Sun from this limbo by then. I guess they will at least delay this as long as they possibly can (mid January). Keeping the current pace the next release would probably be at least 6 months after 2009.03, so about a year from now.

Sunday, November 22, 2009

Faster resilver for zpools

Previous to this putback ZFS did not do any prefetching of data when resilering or scrubbing a pool. This made such operations more time consuming that the would need too be. Since resilvering a large pool can take days, anything that can speed up such operations can make quite a difference in time spent without sufficient replication of data. Fortunately faster resileving for zpool is on its way into OpenSolaris with the putback of "6678033 resilver code should prefetch". The gain of this will of course depend on your pool, but I'll try to find time for testing so that I can get back with some numbers in a later post.

Since scrub and resilvering shares the same code, this should improve scrubbing performance as well. Scrub prefetch was mentioned in the KCA 2009 keynote.

Friday, November 20, 2009

xVM sync with xen 3.4 integrated

Good news for those of us who use xVM in OpenSolaris, the sync with Xen 3.4 have been integrated into o ONNV. This means that it should be available in build 129 which should be released mid December.

Changes from the original 3.4 announcement from Xen:
" - Device passthrough improvements, with particular emphasis on support for
client devices (further support is available as part of the XCI project at
http://xenbits.xensource.com/xenclient/)
- RAS features: cpu and memory offlining
- Power management - improved frequency/voltage controls and deep-sleep
support. Scheduler and timers optimised for peak power savings.
- Support for the Viridian (Hyper-V) enlightenment interface
- Many other x86 and ia64 enhancements and fixes"

It does not look like there is support for device passthrought like PCI devices in Solaris yet though, so this part of the above announcement is probably irrelevant to xVM at this point.

More info on the putback is avaiable here: http://hg.genunix.org/onnv-gate.hg/rev/fe619717975a

Thursday, November 19, 2009

Deduplication with zones

One of the major strengths of zones in Solaris is that they are very lightweight, since they share the same kernel they have low CPU, I/O and memory overhead. In Solaris 10 the ability to create "sparse" zones is available, with this option the local zones created shares most of the binaries and libraries with the global zone. This does not only save space, it also saves memory since all zones share the same instances of common binaries and libraries. The downside of sparse zones is that they have a very strong relationship with the global zone and no modifications unique to any zone can be made to the shared filesystems.

In OpenSolaris and later updates of Solaris 10 the ability to clone a zone is available. A zone is installed on a ZFS filesystem of which a clone is created for every new zone. Only minor modifications are made to the cloned filesystem to give the zone it's unique identity. This works much like deduplication until you patch or upgrade the system, which will make all the clones contain their own copies of the new data even if it's common to other zone instances.

Sparse zones are not supported by the new packaging system in OpenSolaris and it might never be. But zones in OpenSolaris only installs a very basic set of packages, which makes a clean install of a zone very small to begin with, they can then be placed on a compressed filesystem, and in OpenSolaris 2010.03 this filesystem can also be deduplicated.

I've done a small test to see how much space will be used by every zone instance with both compression and deduplication. These are freshly installed zones with have been booted once so that everything have been initialized in the zones:

A single zone on a LZJB compressed and deduped ZFS filsystem:
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zdup01 9.94G 241M 9.70G 2% 1.01x ONLINE -
Two zones:
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zdup01 9.94G 253M 9.69G 2% 1.99x ONLINE -
Three zones:
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zdup01 9.94G 263M 9.68G 2% 2.96x ONLINE -
So every zone uses a little more than 10MB of disk space and with deduplication you also get the same benefits in memory footprint as a with a sparse zone since there is only one deduplicated instance of Solaris libraries and binaries for the zones. They are however not shared with the global zone, since it boots from a separate pool without compression and deduplication. Unlike zones on a cloned ZFS filesystem the deduplication will continue to work after upgrading the zones and for software added to the zone from for example the pkg repositories post install time.

It looks like I can continue continue to run 20 zones on my thirteen year old Ultra 2 workhorse even if I upgrade it to OpenSolaris one day.

Monday, November 9, 2009

ZFS send dedup integrated

Moments ago, one week after zpool dedup was integrated, similar functionality was added for zfs send streams. It looks like OSOL 2010.03 is going to get quite a lot of new ZFS features.

zfs send with the new -D option will dedup the streams created and thereby possibly reducing bandwidth or disk space used by the stream. It's not dependent on pool level dedup.

From PSARC/2009/559:

"OVERVIEW:

"Dedup" is an overall term for technologies that eliminate duplicate
copies of data in storage or memory. This specific application of
dedup is for ZFS send streams, i.e., the output of the 'zfs send' command.
For some kinds of data, much of the content of a send stream consists
of blocks for which identical copies have already been sent earlier
in the stream. This technology replaces later copies of a block with
a reference to the earlier copy. This can significantly reduce the
size of a send stream, which reduces the time it takes to transfer
such a stream over a communication channel."

Here is the changeset: http://hg.genunix.org/onnv-gate.hg/rev/216d8396182e

If all goes well this will together with pool level deup (and lots of other changes) be part of build 128 which should arrive early december.

Wednesday, November 4, 2009

Quick spin with ZFS dedup

I've had a quick look at deduplication in ZFS, it works as expected and seems quite fast for my simple tests.

Enable dedup couldn't be easier :
# zfs set dedup=on zdedup01

Simplest case, same file different name gives a dedup factor of 2:
# cp Solaris/sol-nv-b121-x86-dvd.iso /zdedup01
# cp Solaris/sol-nv-b121-x86-dvd.iso /zdedup01/duplicate.iso
# zfs list zdedup01
NAME USED AVAIL REFER MOUNTPOINT
zdedup01 6.91G 55.6G 6.90G /zdedup01
# zpool list zdedup01
NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT
zdedup01 63.5G 3.47G 60.0G 5% 2.00x ONLINE -
# ls -lh /zdedup01
total 6.9G
-rw-r--r-- 1 root root 3.5G 2009-11-04 22:52 duplicate.iso
-rw-r--r-- 1 root root 3.5G 2009-11-04 22:51 sol-nv-b121-x86-dvd.iso

ZFS dedup is block based, that is multiple blocks with the same checksum will point to a single block, so if the exact same data appears more than once but with different block alignment it won't get deduped.

Unarchive a tar-archive, here the block alignment will differ and therefor the checksums of the blocks and no dedup:
# cp sunsudio.tar /zdedup01
# cd /zdedup01
# tar xf sunstudio.tar
# zpool list zdedup01
NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT
zdedup01 63.5G 1.76G 61.7G 2% 1.00x ONLINE -

Empty files will give a quite nice dedup ratio:
# mkfile 5G testfile
# zpool list zdedup01
NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT
zdedup01 63.5G 1.73M 63.5G 0% 40960.00x ONLINE -

In practice it should give a ratio that is on pair with the actual duplication when dealing with ordinary files such as binaries, executables, application installations, zones etc. The ratio is harder to estimate with virtual server disk images (or iSCSI LUNs). A very quick test with two VirtualBox Solaris 10 U8 (core installation) images showed 35 percent saved disk space:
NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT
zdedup01 63.5G 984M 62.5G 1% 1.35x ONLINE -

Deduplications of course also works with compression enabled (checksums used for dedup is for compressed data):
# zfs get compressratio zdedup01
NAME PROPERTY VALUE SOURCE
zdedup01 compressratio 1.43x -
# zpool list zdedup01
NAME SIZE USED AVAIL CAP DEDUP HEALTH ALTROOT
zdedup01 63.5G 709M 62.8G 1% 1.25x ONLINE -

Monday, November 2, 2009

ZFS Deduplication!

It looks like ZFS deduplication have finally arrived! Jeff have made the following putback that will be part of build 128:

PSARC 2009/571 ZFS Deduplication Properties
6677093 zfs should have dedup capability

Have a closer look here: http://hg.genunix.org/onnv-gate.hg/rev/e2081f502306

I post more details when I've had some time to look at the change.

Update: No need for me to blog about it, Jeff has his own blog with a brand new entry on dedupliation: http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup