Monday, December 28, 2009

Dell RHEL Cluster Wiki

Dell has a wiki page which documents setting up a Red Hat GFS Cluster which covers using the Dell M1000e blade enclosure. Thank you Dell.

Wednesday, December 23, 2009

xen2kvm: Change a xen VM to run on KVM

If you want to move a guest running on a Xen host to a KVM host you can do the following:

1. Create a place-holder guest on the KVM host, note its mac address and shut it down.

2. Make the following changes to the Xen guest while its still running on the xen host:

  • Replace the xen kernel with a normal kernel: "yum install kernel"
  • Verify /etc/grub.conf will boot the new non-xen kernel by default
  • Comment out the following line in /etc/inittab:
    co:2345:respawn:/sbin/agetty xvc0 9600 vt100-nav
    
  • Uncomment the following lines in /etc/inittab:
     1:2345:respawn:/sbin/mingetty tty1
     2:2345:respawn:/sbin/mingetty tty2
     3:2345:respawn:/sbin/mingetty tty3
     4:2345:respawn:/sbin/mingetty tty4
     5:2345:respawn:/sbin/mingetty tty5
     6:2345:respawn:/sbin/mingetty tty6
    
  • Modify /etc/sysconfig/network-scripts/ifcfg-eth0 to insert the MAC address from the place holder guest.
  • Shut the Xen guest down

3. Move the disk file from the Xen host to the KVM host:

  scp $USER@xen-host:/var/lib/xen/images/guest.img \
      $USER@kvm-host:/var/lib/libvirt/images/guest.img
In the above step you should be overwriting the disk file for the place-holder guest with the Xen guest's disk file.

4. Boot the guest on KVM.

I have tested the above on RHEL 5.4 and Fedora 12.

Thank you Chris Tyler for suggesting this approach in your blog.

Fedora 13 is scheduled to have a Xen to KVM migration script, so we'll see if the above is still necessary in May 2011.

Tuesday, December 22, 2009

xenner

I have been using KVM successfully on Fedora 12 and RHEL 5.4. I have Xen guests which I would like to run on KVM (even if just in testing) so I considered using xenner, but I was not successful. My notes are below if anyone wants to point out what I might be missing.

On an existing system running KVM:

* yum install xenner
(1/3): xen-hypervisor-3.4.1-5.fc12.x86_64.rpm            | 2.8 MB     00:00     
(2/3): xen-runtime-3.4.1-5.fc12.x86_64.rpm               | 4.0 MB     00:00     
(3/3): xenner-0.47-3.fc12.x86_64.rpm                     | 191 kB     00:00     
* /etc/init.d/xenner start
I read an example in RedHat's bugzilla from the Xenner author:
Comment #7 From  Gerd Hoffmann  2008-07-03 03:45:15 EDT: 
"For a quick bootup test you don't need an image, you can grab xen
kernel and initrd from any fedora install dvd (images/xen/), then run
"xenner -kernel vmlinuz -initrd initrd.img -vnc 127.0.0.1:0".
Connecting with a vnc viewer to display :0 should show then the first
stage install screen (probably complaining it couldn't find a disk)."
I couldn't exactly find the files he was talking on a Fedora 12 DVD about so I used a Xen kernel and initrd from an installed RHEL5 system's /boot. I then did the following:
 sudo xenner \
 -kernel vmlinuz-2.6.18-164.6.1.el5xen \
 -initrd initrd-2.6.18-164.6.1.el5xen.img \
 -vnc 127.0.0.1:0 \
 -hda guest.img 
xenner-stats on another terminal as I ran the above reported that it couldn't parse the core file in /var/run/xenner. The results of running the xenner command above follow:
kvm capability clocksource: not enabled
kvm capability nop-iodelay: not enabled
kvm capability mmu-op: not enabled
domid: 14258
-videoram option does not work with cirrus vga device model. Videoram set to 4M.
frontend `/local/domain/14258/device/vbd/51712' devtype `vbd' expected backend `/local/domain/0/backend/tap/14258/51712' got `/local/domain/0/backend/blkbackd/14258/51712', ignoring
frontend `/local/domain/14258/device/vbd/51712' devtype `vbd' expected backend `/local/domain/0/backend/tap/14258/51712' got `/local/domain/0/backend/blkbackd/14258/51712', ignoring
Watching /local/domain/0/device-model/14258/logdirty/next-active
Watching /local/domain/0/device-model/14258/command
/builddir/build/BUILD/xen-3.4.1/tools/ioemu-dir/hw/xen_blktap.c:628: Init blktap pipes
Could not open /var/run/tap/qemu-read-14258
xs_read(): vncpasswd get error. /vm/b254f64e-b232-477c-9285-c693c34fd432/vncpasswd.
==================== setup ====================
xen_guest_parse: ----- parse kernel -----
[xenner,1] domain_builder: memory: emu 4 MB, m2p 4 MB, guest 120 MB
xen_emu_load: ----- load xen emu -----
[xenner,1] xen_load_emu_file: loading /usr/lib64/xenner/emu64.elf (190936 bytes)
xen_emu_setup: ----- memory info -----
xen_emu_setup: ----- emu pgd setup -----
xen_guest_setup: ----- memory setup -----
xen_guest_setup: ----- create start-of-day -----
xen_guest_setup: ----- setup xen hypercall page -----
xen_guest_setup: ----- setup vcpu context -----
domain_builder: ----- kvm: state setup -----
domain_builder: ----- all done -----
xen be: console-0: reading backend state failed
xen be: console-0: reading backend state failed
xen be: console-0: reading frontend path failed
calibrate tsc ... done, 3193 MHz, shift: -1, mul_frac: 2690089865
==================== boot ====================
[xenner,1] setup_regs: 64bit
[xenner,1] vcpu_thread/0: start
[emu] <0>this is emu64 (xenner 0.47), boot cpu #0
[emu] <1>emu64: configuration done
[emu] <1>cpu_alloc: cpu 0
[emu] <1>cpu_init: cpu 0
[emu] <1>pv_init: cpu 0, signature "KVMKVMKVM", features 0x00000000:
[emu] <1>emu64: boot cpu setup done
[emu] <1>emu64: paging setup done
[emu] <0>print_page_fault_info: [kernel-mode] preset read kernel reserved-bit, rip ffff8300000098a3, cr2 ffff820000000020, vcpu 0
[emu] <0>page table walk for va ffff820000000020, root_mfn 1905
[emu] <0>pgd   : ffff830001905000 +260  |  mfn   48  |   global dirty accessed user write present
[emu] <0> pud  : ffff830000048000 +  0  |  mfn   49  |   global dirty accessed user write present
[emu] <0>  pmd : ffff830000049000 +  0  |  mfn   4a  |   global dirty accessed user write present
[emu] <0>   pte: ffff83000004a000 +  0  |  mfn fee00  |   global dirty accessed write present
[emu] <0>panic: ring0 (emu) page fault
[emu] <0>printing registers
[emu] <0>  code   cs:rip e008:ffff8300000098a3
[emu] <0>  stack  ss:rsp e010:ffff83000002bea0
[emu] <0>  rax ffff820000000000 rbx ffff830000033000 rcx ffff830000000000 rdx 00000000fee00163
[emu] <0>  rsi 0000000000000000 rdi 0000000000001000 rsp ffff83000002bea0 rbp ffff830000033000
[emu] <0>  r8  ffff83000004a000 r9  ffff83000000e92a r10 ffff83000000e494 r11 000000000000000a
[emu] <0>  r12 000000000002a000 r13 00000000fee00900 r14 00000000fee00000 r15 0000000000000000
[emu] <0>  cs e008 ds e010 es e010 fs e010 gs e010 ss e010
[emu] <0>  cr0 80050033 cr2 ffff820000000020 cr3 01905000 cr4 000006b0 rflags 00010086
[emu] <0>  cr0: PE MP ET NE WP AM PG
[emu] <0>  cr4: PSE PAE PGE OSFXSR OSXMMEXCPT
[emu] <0>  rflags: ??? PF ??? RF
[emu] <0>printing stack ffff83000002bea0 - ffff83000002bff8
[emu] <0>  ffff83000002bea0: 0000000000000400
[emu] <0>  ffff83000002bea8: 0000000000000800
[emu] <0>  ffff83000002beb0: ffff800000000000
[emu] <0>  ffff83000002beb8: 00000000000001e7
[emu] <0>  ffff83000002bec0: 564b4d564b4d564b
[emu] <0>  ffff83000002bec8: ffff800000000000
[emu] <0>  ffff83000002bed0: 00000f6400000001
[emu] <0>  ffff83000002bed8: 0000e43500010800
[emu] <0>  ffff83000002bee0: ffff8300bfe9c3f1
[emu] <0>  ffff83000002bee8: ffff830000005b70
[emu] <0>  ffff83000002bef0: ffff83000002bf50
[emu] <0>  ffff83000002bef8: ffff830000033000
[emu] <0>  ffff83000002bf00: 000000000002a000
[emu] <0>  ffff83000002bf08: ffff830000028000
[emu] <0>  ffff83000002bf10: 0000000000000000
[emu] <0>  ffff83000002bf18: ffff8300000051cb
[emu] <0>  ffff83000002bf20: 0000000000000000
[emu] <0>  ffff83000002bf28: 0000000000000000
[emu] <0>  ffff83000002bf30: 0000000000000000
[emu] <0>  ffff83000002bf38: 0000000000000000
[emu] <0>  ffff83000002bf40: 0000000000000000
[emu] <0>  ffff83000002bf48: ffff83000000001d
[emu] <0>  ffff83000002bf50: 0000000000000000
[emu] <0>  ffff83000002bf58: 0000000000000000
[emu] <0>  ffff83000002bf60: 0000000000000000
[emu] <0>  ffff83000002bf68: 0000000000000000
[emu] <0>  ffff83000002bf70: 0000000000000000
[emu] <0>  ffff83000002bf78: 0000000000000000
[emu] <0>  ffff83000002bf80: 0000000000000000
[emu] <0>  ffff83000002bf88: 0000000000000000
[emu] <0>  ffff83000002bf90: 0000000000000000
[emu] <0>  ffff83000002bf98: 0000000000000000
[emu] <0>  ffff83000002bfa0: 0000000000000000
[emu] <0>  ffff83000002bfa8: 0000000000000000
[emu] <0>  ffff83000002bfb0: 0000000000000000
[emu] <0>  ffff83000002bfb8: 0000000000000000
[emu] <0>  ffff83000002bfc0: 0000000000000000
[emu] <0>  ffff83000002bfc8: 0000000000000000
[emu] <0>  ffff83000002bfd0: 0000000000000000
[emu] <0>  ffff83000002bfd8: 0000000000000000
[emu] <0>  ffff83000002bfe0: 0000000000000000
[emu] <0>  ffff83000002bfe8: 0000000000000000
[emu] <0>  ffff83000002bff0: 0000000000000000
[emu] <0>  ffff83000002bff8: 0000000000000000
guest requests shutdown, exiting.
xenner_cleanup: ----- statistics -----
hypercalls             :      total     diff
emu faults             :      total     diff
  page fault           :          1        0
  cr3 load             :          2        0
event channels         :      total     diff
xenner_cleanup: ----- cleaning up -----

Friday, December 11, 2009

KVM Ubuntu 2

After installing the basic KVM package as described earlier, I installed:
 apt-get install virt-manager
I was then able to do an install like this to install a Fedora VM:
sudo virt-install \
--accelerate \
--hvm \
--connect qemu:///system \
--network network:default \
--name fedora12 \
--ram 512 \
--file=/var/lib/libvirt/images/fedora12.img \
--vcpus=1 \
--nonsparse \
--check-cpu \
--vnc \
--file-size=6 \
--cdrom=/mnt/isos/datastore1/fedora12_x86_64/Fedora-12-x86_64-DVD.iso

Wednesday, December 9, 2009

Thunderbird 3

I just started using PPA for Ubuntu Mozilla Daily Build Team to upgrade to Thunderbird 3.0 while running Ubuntu 9.10. It's indexing a lot and eating 100% of a CPU. I'll let it crunch away and report if it's actually faster tomorrow.

Update: Thunderbird 3 is working well. Nice and snappy.

Friday, December 4, 2009

KVM virtual network forward options

If you are using KVM and virsh to net-create a virtual network and want to know what forward options there are, the answer is: "none", "nat", "route".

Details:

If you're editing a virtual network xml definition file that might be contained in /etc/libvirt/qemu/networks on RHEL 5.4:

# cat dev.xml 
<network>
  <name>dev</name>
  <uuid>fa07c9c6-e755-71f0-4312-4db325355c24</uuid>
  <forward mode='nat'/>
  <bridge name='virbr1' stp='on' forwardDelay='0' />
  <ip address='123.456.7.89' netmask='255.255.255.0'>
  </ip>
</network>
# 
and want to know what forward modes there are besides the default nat as shown above, it doesn't seem to be found in the documentation. I was able to get it by looking at the source.

 yum provides "*/virsh"
the above returned libvirt. Which do I have?
# rpm -qa | grep libvirt
libvirt-0.6.3-20.1.el5_4
libvirt-python-0.6.3-20.1.el5_4
libvirt-0.6.3-20.1.el5_4
# 
Note the first result. Downloaded the src.rpm from RHN:
* rpm -iv libvirt-0.6.3-20.1.el5_4.src.rpm
* cd /usr/src/redhat/SOURCES/
* tar xzf libvirt-0.6.3.tar.gz
* cd libvirt-0.6.3/src
* grep -i forward *
Eventually I found the following on line 51 of network_conf.c:
VIR_ENUM_IMPL(virNetworkForward,
              VIR_NETWORK_FORWARD_LAST,
              "none", "nat", "route" )
So, I'm pretty sure the answer is: "none", "nat", "route"

Benford's law

A non-*nix post. I just learned about Benford's law. This reminds me of learning that Fibonacci Numbers occur in Nature. A friend told me that Benford was tipped off about this law when he was looking through the old log books and saw that the pages near the beginning (that start with one) were more worn. Would this law have not been discovered if he already had a computer?

Tuesday, December 1, 2009

Snow Leopard Upgrade in the Dark

While upgrading my wife's 13" MacBook Pro to Snow Leopard what happened to the first poster in this forum happened to me. I couldn't believe it, but I was able to see and keep clicking "next" while I held a flash light to the screen. I was able to finish the install and it is working again. Strange.

Friday, November 27, 2009

Getting started with KVM Bridging

I have KVM VMs running on RHEL5.4 but I don't have networking set up yet. My plan was to add another bridge (since virbr0 is a default virtual bridge), connect eth0 to it and then connect my VMs to it with virsh net-define, net-create, ..., etc. However, I seem to have misunderstood something:

[root@kvm-dev0 ~]# brctl addbr virbr1
[root@kvm-dev0 ~]# 

[root@kvm-dev0 ~]# brctl show
bridge name     bridge id               STP enabled     interfaces
virbr0          8000.5690886c5cfa       yes             vnet2
                                                        vnet0
                                                        vnet1
virbr1          8000.000000000000       no
[root@kvm-dev0 ~]#

[root@kvm-dev0 ~]# brctl addif virbr1 eth0
[root@kvm-dev0 ~]# 
Looks like I locked myself out with the command above. Looks like no more playing on this box (yes, it's only a development system) until I get back to the office. Looks like I'll have to do more reading.

Haydn Solomon's KVM blog has an entry on bridged networking. Gavin Carr's blog has a quick guide to KVM bridging.

Update: as per wiki.libvirt.org, adding eth0 to virbr0 is not something that is ever necessary. I did the above out of confusion.

Monday, November 23, 2009

RedHat Doc updates

The Virtualization Guide for RHEL 5.4 discusses KVM and Xen as opposed to the 5.3 document. The GFS2 doucment has also been updated.

Tuesday, November 10, 2009

Vixie article on DNS

A friend sent me this Vixie article on DNS. I haven't read it all, but what I read so far is good so I'll link to it and remember to read more later.

Monday, November 2, 2009

Ubuntu 9.10 boot problem

I upgraded my workstation from Ubuntu 9.04 to 9.10. I got the alert dev disk by-uuid does not exist dropping to a shell error. The fix was to use the busybox shell to mount /dev/sda1 (in the case of my system) on /foo and then "chroot /foo /bin/bash". I then updated:

1. /etc/fstab
Replace "UUID=f3920f76-f982-47b0-b2e7-8d6536482b8f" with "dev/sda1"

2. /boot/grub/menu.lst
Replace the UUID with the device as above. I also took out "quiet splash".

I was then able to reboot.

Wednesday, October 28, 2009

LISA in MD coming soon

I'm wondering if I should attend LISA this year. They're having a storage day that looks relevant too.

Saturday, October 24, 2009

Oracle's Sun acquisition, ZFS, BtrFS, ext4

I came across a blog reporting that Apple will not include ZFS in OS X which speculates that Oracle might kill ZFS because it has BtrFS. Another blog speculates that Apple didn't want to get in the middle of a lawsuit; NetApp claims ZFS infringes on their WAFL patent (the more I learn of NetApp the less I like them).

The news above makes me more interested in BtrFS. It's developed by Oracle but GPL'd and there are blogs and articles comparing BtrFS and ZFS. It looks like the next step for GNU/Linux systems which can't have ZFS but need something more advanced than ext4.

For now I welcome ext4 until BtrFS is ready. ext4 is the default file system for Fedora 11 (benefits list), which also has an icantbelieveitsnotbtr option for experimental BtrFS support. I wonder when either will make it to RHEL. ext4 will probably be in RHEL much sooner and I look forward to fast fsck'ing at work (where I have to fsck 1T every now and then).

Wednesday, October 21, 2009

Thunderbird Slowness

My workstation has 4G of RAM, two 3.2Ghz dual-core Xeons, and a 3Gbps SATA drive with an 8M buffer... which is much more than adequate. However every now and then Thunderbird 2.0.0.23 (20090817) demands 100% of a CPU and hangs it. It even will freeze Gnome to the point that I can't spawn new xterms and I sometimes login on tty1 and kill Thunderbird. I'm just using it for IMAP and SMTP and it does this randomly. Not when I ask it to search for a buried message or while I'm waiting for a large message to be accepted by the SMTP server. I straced it during the slowness and saw:

$ strace -p 12434 2> thun_slow.txt & 
$ tail -f thun_slow.txt
Process 12434 attached - interrupt to quit
lseek(48, 51859456, SEEK_SET)           = 51859456
read(48, "60D]-\n  [-3B60E]-\n  [-3B60F]-\n  ["..., 4096) = 4096
read(48, " [-3B6A0]-\n  [-3B6A1]-\n  [-3B6A2]"..., 4096) = 4096
fstat(48, {st_mode=S_IFREG|0644, st_size=58669928, ...}) = 0
lseek(48, 58667008, SEEK_SET)           = 58667008
read(48, "^96^2E14)(^87=6f3)(^9B=8e27a)(^8E"..., 4096) = 2920
lseek(48, 51863552, SEEK_SET)           = 51863552
read(48, " [-3B6A0]-\n  [-3B6A1]-\n  [-3B6A2]"..., 4096) = 4096
...
To get some stats:
$ wc -l thun_slow.txt 
17160 thun_slow.txt
$ grep read thun_slow.txt | wc -l
9640
$ grep lseek thun_slow.txt | wc -l
5260
$ grep fstat thun_slow.txt | wc -l
1286
$ egrep -v "fstat|read|lseek" thun_slow.txt | wc -l
974
$  
So, it slows down when it decides to read a lot. I have about 1G of mail on the server and 84M in my .mozilla-thunderbird directory. My first thought was that it was indexing, but right-clicking on the folder and asking it to re-index my inbox (with many subdirectories) is very quick. Why does it decide to read a lot (presumably from my disk) at random points during the day?

Update: An article describing similar symptoms suggested corrupt MSF files and suggested stopping thunderbird, backing the profile up, deleting the MSF files and letting thunderbird rebuild them upon restart.

$ tar cvfz thunderbird.tar.gz ~/.mozilla-thunderbird/
$ find ~/.mozilla-thunderbird/0mp6haa5.default -name \*.msf -exec rm {} \;
After restarting Thunderbird it seems zippy enough. The real test will be the non-recurrence the seemingly random sudden spike in read activity.

Update: Two days later and so far so good. I think it might have had trouble reading a corrupt MSF file so it kept trying to read it. The MSF rebuild seems to have done the trick. I'll update again if it comes back.

Friday, October 16, 2009

Becoming a ZFS Ninja

Ben Rockwood's Becoming a ZFS Ninja.

Google steals the web

Is Google Sidewiki evil?. One of my friends is going to see how many days he can go without using Google and will be using Cuil instead.

Update: Switched from Cuil to Ask.

Update: Switched from Ask to Yahoo. (Isn't that just Bing?)

Friday, October 9, 2009

LISA and Change Control

Random thoughts: perhaps I should go to LISA. Sage's Change Control document might have some obvious points but it makes explicit things that every sysadmin should know.

Ubuntu KVM

I installed KVM on my Ubuntu desktop. Documentation made this easy. I then grabbed a KVM virtual appliance, ran: sudo kvm va-lapp-root.img and watched it boot.

Friday, October 2, 2009

SLES kernel updates remove prior kernel

After experiencing a problem at work and verifying it on Novell's forum, I want to make it known that:

When "SUSE Linux Enterprise Server" installs a kernel update, it deletes the kernel on which it is running.

I've never seen something this stupid. What if the old kernel doesn't boot? Why not keep at least one old version available? I've never seen a Unix do this. We're going to write a script:
$ ./cya.sh --before # back up the old kernel somewhere
$ rug upgrade
$ ./cya.sh --after # reinstall the old kernel 
Did Suse always have this problem or did Novell break it?

Monday, September 21, 2009

Sun Storage 7410

I mentioned earlier that Sun seems to be offering the same old SAN as EMC. This is the case for their 6580 but not for their 7410. I mentioned two problems from the EMC and below is how Sun addresses them.
  • LUN Tetris: The Sun Storage 7410 stripes data across all SATA drives using ZFS to serve blocks in the case of iSCSI or files in the case of NFS. This similar to how NetApp uses WAFL. If you configure ZFS-NSP (no single point of failure) with a minimum of two drawers of disk, then redundant copies are written so you can loose more drives (an entire shelf?).
  • SP Bottle Neck: Sun has the same problem here. One smart SP and several disk only drawers. However, you can max a 7410 to 288T.
Other 7410 features:
  • Contains Solid State Flash drives which ZFS manages to improve performance in areas that need it which is similar to Compellent's automated tiered storage.
  • Does not offer fiber channel access so the only way to read blocks from it is iSCSI.
  • All extra features (snapshots, clones, etc) are included without separate licenses (unlike NetApp)
  • Allows admin to SSH into Solaris based SPs for management OR use a Web Client (Ajax).
  • A virtual box simulator let's you try out the management system.

Saturday, September 19, 2009

NetApp

I mentioned earlier that NetApp seems to be offering the same old SAN as EMC. This is only partially true. I mentioned two problems from the EMC and below is how NetApp addresses them.
  • LUN Tetris: NetApp stripes either SATA or SAS drives across a large RAID6 and aggregates them together so that you have large storage pools of a single type, e.g. one for SATA and one for SAS. You can then compute the amount of available disk by subtracting from a total number. You can also have nothing but SATA and the large stripe can address performance problems associated with SATA, however getting IOPS stats first is a good idea. Else a small pool of SAS could address the need for fast disk.
  • SP Bottle Neck: NetApp has the same problem here. One smart SP and several disk only drawers. They do offer PAM modules which add 256G (depending on which head) of cache to increase performance. When upgrading SPs remember that changing out an SP is easier if you have zoned by port number, not WWN.
A FAS3140 will max out at 420T of raw capacity with all SATA. So SP upgrades seem further away for my organization's current usage. Regardless, this design is not as elegant as grid storage and if I'm going to consider it they'll need to offer a good price.

Tuesday, September 15, 2009

google data liberation front

Is this for real? If so I give Google credit for their data liberation front.

Monday, September 14, 2009

Thin Provisioning

According to IBM's XIV Red Book page 3 Thin Provisioning is "the capability to allocate storage to applications on a just-in-time and as needed basis". The wikipedia has more to say. Storage vendors seem to make this sound so much better than it is. Sure you get more blocks when you need them, the problem is that you now have to get your filesystem to use them. Repeat: you don't just run 'df' before and after and say "oh I grew the LUN, I'm done". How does this behave on *nix file systems?

If you're using ext3 on top of LVM, then it seems to be that you don't even need thin provisioning from your SAN. You could just add a new LUN, add it to the LVM storage group of the volume you want to grow and then ext2online your ext3 volume into it. I've done this a few times and it's worked fine, but it was, as Ben Rockwood said, sucky. Ben's blog mentions how ZFS can make this process less painful: ZFS and Thin Provisioning. Aside from needing to know what you're doing when 'df' lies to you, this looks handy. If my SAN allocates the extra space easily and if ZFS can just pick it up and run with it, then that is good and makes SAN based Thin Provisioning seem worthwhile. Looks like I'm going to have to test this feature with ZFS. I'll post an update as I learn more.

Update: Someone who used to work for XIV told me a little bit of how the thin provisioning system works. If I've understood correctly I take the scenario to be:

  • If a project requires x TB over the course of three years, but only y TB this year, then thin provision x TB such that y TB can be accessed now
  • When you create a file system (this includes ext3) on top of that project's LUN, you will see x TB (even though y TB is what will really be there). Thus, the inode table will be built to access blocks which are not yet there and 'df' will lie to you
  • As long as you have x TB available at some point in the future (perhaps in your total SAN) it will be allocated on demand and the file system won't have a problem.
The benefits I see of doing this are:
  • This can save you money if you're planning to purchase x TB within the next three years, but know that you can only afford y TB today
The problems I can see are:
  • If you don't get those extra disks before the user decides to run out of road (hey, you made the road look longer than it was) then you'll have problems
  • When you reach x TB physically and fill them you are back to the original problem: you will have to use ext2online or some other method to grow the filesystem

Sunday, September 13, 2009

Grid Storage

My organization has managed two EMC Clariion cx3-20s for three years. We have had some problems with its overall design which I'll list below. I'll also list some vendors with the same problem and show some using grid storage to avoid these problems.

EMC problems

LUN Tetris

We have a mix of LUNs of approximately two types:
  • Fast and small
    • Used only by Databases and Email
    • Currently 10T not likely to grow fast
    • 146G 15k FC in RAID10
  • Fast enough and large
    • all other applications (Live VM images, Web roots, Home dirs, File svcs, Archives, etc)
    • Currently 40T grows by about 10T a year
    • 1T 7200 SATA in RAID5
We have many LUNs of one of the two types above and they stripe across a number of disks and if visualized would look like the end of a tetris game with differing colors and shapes. The variation of colors and shapes represent LUNs varying in size, meta-LUNs, RAID types, disk types etc. Some empty space represents unused space that is too small to be of use. When a new project comes and requires some space we analyze our tetris game and consider the best way to accommodate the request.

Service Processor bottleneck

Most SANs have Service Processors (SP), which are computers that run an OS: EMC runs Flare, NetApp runs Data on Tap (BSD derivative), etc. The SPs can be thought of as servers which pass block change operations from clients to the block devices which are connected directly to them. In EMC's case, this connection is implemented as daisy chained copper SCSI cables to several drawers of disks. The cx3-20 can hold eight drawers. We want to add an extra drawer this year, but we will have an additional cost to upgrade to a cx3-40, which is basically a new SP which can hold 16 drawers. So, every few years you must upgrade the SP. In our case EMC wants us to buy a cx4 instead of a cx3.

Same old SAN

I'm probably over simplifying the comparison of the products listed below, but since they have the same problems which I listed above, to me they look like the same old SAN. I'm going to speak with sales reps for each of the companies above to let them tell me about some other product that they offer so that I might update this page and list them as offering Grid Storage.

Grid Storage

There are new grid storage based systems which don't have these problems. The basic idea is that rather than have a smart SP and several dumb drawers, each drawer is smart and also known as a node. In IBM's XIV each node is an individual server made from commodity parts: 1 quad-core intel, 8G of RAM, 12 1T SATA drives and a stripped down Linux-based OS. These servers are then networked to speak which each other via 10G ethernet instead of daisy chained SCSI. Each portion of data that is written to any particular LUN is split across all of the disks and the large stripe helps the SATA perform as well as the fast disk. Redundant portions are also written so that one can loose up to three nodes in a six node system. Relative to the problems posed in the beginning we have:
  • LUN Tetris: The only property of an XIV LUN is size. Every LUN has the same speed which is fast. Every LUN is made of commodity SATA. Keep a tally of the total size and subtract the requested size for a new project.
  • SP Bottleneck: Since each node, or drawer, is an SP storage and processing scale at the same rate. There is no sudden need to upgrade the SP during an expansion
I am trying to build a list of vendors which use grid storage to serve block devices (IBM was the first vendor I found doing exactly this, so my description above is biased towards them). NEC's HydraStor and Isilon use grid storage except they are serving NFS volumes. Please post a comment if you know of storage vendors doing something similar.

Tuesday, September 8, 2009

cleversafe.org

Cleversafe has a Dispersed Storage system.

Thursday, September 3, 2009

Searching for Storage

I'm getting started on rethinking how to deal with storage and cheaper ways to deal with what I already have. Some things on my mind include:
  • Black Blaze has a nice article on Petabytes on a budget.
  • IBM is going to tell me how great their XIV SAN is. We'll see.
  • EMC support contracts are too expensive. BL Trading supposedly does it for less. We'll see.

Wednesday, August 19, 2009

sendmail smarthost

If many individual *nix systems run sendmail and send mail to hosts within your network but you don't want them sending mail beyond the Internet without relaying through that network's main SMTP host, then you can configure sendmail as a smart host. On RedHat you will need the sendmail-cf package.

Friday, July 17, 2009

cool cmd

Thomas Habets' signature file (example) features a "cool command" which can bring down a system even when run as a local user:
echo '. ./_&. ./_'>_;. ./_
I'm curious how this works and I'll have to look at it more closely and post anything interesting about it that I find.

Update: A friend of mine pointed out that it is a forkbomb:

It's a forkbomb.  If you're having trouble reading it because of syntax,
it might be clearer as

echo '. foo &
. foo' > foo;
. foo

Note that on most Unices, there are configurable limits on the number of
processes any user may run concurrently, so it's usually possible to
limit the damage of a forkbomb.

Monday, June 29, 2009

Xen Kickstart

Below is an example of passing a kickstart config file to virt-install. It took a lot of searching to find how to do this and it is not in the man page. The trick is to use -x (for extra) and then pass it "ks=$URL". The following command will produce a VM named xen-guest with 2 CPUs, 2G of RAM and 8G of disk which will be installed from a kickstart config file:
name="xen-guest"
cpu="2"
ram="2048"
disk="8"
virt-install \
--paravirt \
--file=/var/lib/xen/images/$name.img  \
--name $name \
--vcpus=$cpu \
--ram $ram \
--file-size=$disk \
--nonsparse \
--check-cpu \
--nographics \
--location=http://kickstart.server.com/rhel5.3/ \
-x ks="http://kickstart.server.com/xen.cfg"
Note that the --location option contains the contents of a mounted ISO file, not the ISO file itself. I.e. it was made by mounting the ISO file for the distro as a loop back device inside of /var/www/html/rhel5.3. It is from this mounted ISO file that virt-install downloads the initrd and kernel to boot from and start the install. The kickstart file is then followed and should contain where to get the medium for the install. I did this because I wanted to have a VM which was based on a config which is standard to all of my servers. However, now that I have a Xen guest that I like I've just been using virt-clone and variations of my backup script to create copies and move them between servers.

Xen Backups

Earlier I pointed out how rapleaf.com migrated xen VMs with minimal downtime by using LVM snapshots. I wrote a script to do this very thing to backup my VMs and it works well.

I have three Xen servers and each copies its VMs to the other [backup_target = server_N + 1 % count(xen_servers)]. I shut down each VM, make a snapshot and bring each back up. This results in a 60 second outage for all of the VMs plus boot time with doing a virsh shutdown and then xm create for each VM. I am copying only the image files from /var/lib/xen/images/ and each VM's config file from /etc/xen/.

The majority of time is then spent doing an rsync, however it is not time without service because the VMs would already be back up and the snapshot serves a stable shutdown VM for the backup. My rsync user has the scpsftprsynconly shell, which works nicely and is cleaner than setting up chroot jail. I then only need to boot the VMs on their other host to restore service if there is a hardware problem.

Monday, June 8, 2009

Copying Xen images with minimal downtime & Automating Xen Deployment

This blogger talks about pausing Xen VMs and using LVM snapshots to move Xen images between servers. I also found this white paper on: Automating Xen Virtual Machine Deployment.

Saturday, June 6, 2009

rhel tcptraceroute

RedHat's tcptraceroute is a useless symlink to regular old UDP traceroute. The symlink violates the principle of least surprise. This can get in your way as you quickly try to solve a problem.
$ ls -l `which tcptraceroute` 
lrwxrwxrwx 1 root root 10 May 29 05:35 /bin/tcptraceroute -> traceroute
$ 
They also have nothing in their repository that you can 'yum install'. Come on RedHat, it's GPL'd! To get past this you can update your repository or simply:
wget http://ftp.belnet.be/packages/dries.ulyssis.org/redhat/el5/en/x86_64/RPMS.dries/tcptraceroute-1.5-0.beta7.el5.rf.x86_64.rpm
rpm -iv tcptraceroute-1.5-0.beta7.el5.rf.x86_64.rpm
rm /bin/tcptraceroute
which tcptraceroute
The 'rm' removes the symlink and the 'which' should return /usr/bin/tcptraceroute.

Tuesday, June 2, 2009

Falconstor

Falconstor has a fibre to fibre appliance to stick between a fabric switch and a SAN head (e.g. EMC Clariion cx3-20) to abstract away a set of disks so that you can do live migration between different brands of SAN.

Slice Host

slicehost.com

Sunday, May 31, 2009

Xen, KVM and my use of Free Virtualization

I'm about to roll out some VMs on Xen with RHEL5.3. With RedHat's acquisition of KVM there's a question of waiting to change platforms. I don't think this is a good idea since I want to roll out these VMs next month, not next year. There will also be support from RedHat on Xen until 2014, but the main reason I don't think it will be a big deal is that in theory it's just a matter of changing kernels on the dom0 host as well as its guest paravirtualized systems. This blogger claims success with this technique. I'm going to reproduce his results myself by migrating one of my Xen VMs on RHEL to a KVM VM on Fedora and I'll post an update with the results when I do.

Wednesday, May 27, 2009

ELisp in Macros

I've been writing macros in emacs and I've been modifying my .emacs with elisp to make emacs more useful but I never wrote elisp in macros on the fly until now. Glad I finally got around to it.

Monday, May 25, 2009

RedHat Virtualization Guide

I've read the RedHat Virtualization Guide. Get's you going but contains lots or redundant sections: point 1.x will be exactly the same as point 1.x+y (it's as if it was moved via a copy/paste not a cut/paste). I found the following useful so this post will contain my bookmarks:
  • yum install xen kernel-xen libvirt libvirt-python libvirt-python python-virtinst
  • CLI virt-install of guest (can accept kickstart URL)
  • Paravirtualized guests need a xen kernel too so during installation of a guest, select Customize Now and install the kernel-xen package in the System directory (I don't see this option on a new install on metal). If possible, kickstart files for guests should specify this kernel. If you try to boot from a non-xen kernel try this. Update: the menus during the text install did not offer this option. Fortunately the following CLI install method worked well for me and I got a xen kernel without having to select anything related to a Xen Kernel during the install:
    virt-install \
    --paravirt \
    --file=/var/lib/xen/images/webhost0.img  \
    --name webhost0 \
    --vcpus=2 \
    --ram 512 \
    --file-size=4 \
    --check-cpu \
    --nographics \
    --location=http://astromirror.uchicago.edu/fedora/linux/releases/10/Fedora/x86_64/os/
    
    A local http mirror containing a RHEL DVD mounted in /var/www/html worked well for the --location option for RedHat. Otherwise the Fedora example above should work too.
  • Adding Storage
  • Xen bridging (bridging overview)
  • Live migration
  • Remote management with Virtual Machine Manager: I wouldn't install GTK on a VM server but I might on a workstation for remote control.
  • Tips and Tricks
  • Troubleshooting
  • Commands to know:

RedHat's partially hidden manuals

Yes, RedHat does have useful manuals. Googling for them can result in running into marketing materials so I'm creating this entry.

Saturday, May 23, 2009

MySQL RAID10 Dell 2960

I have six 73G SAS drives each in two Dell 2960 servers which will each go in one of two data centers and provide MySQL service and be configured for binary replication. I am configuring these servers with RAID 10 and looking at different options to optimize performance.

Dell's Perc 5/i hardware RAID controller's integrated BIOS configuration menu can be accessed by holding Ctrl-r at boot. Dell's Getting Started With RAID explains the difference between write-back (done when in cache) and write-through (done when on disk) caching. The BIOS gives you the following options (a * denotes the defaults):

  • Stripe Size: 8k, 16k, 32k, 64k*, 128k
  • Read Ahead: No Read Ahead*, Read Ahead, Adaptive Read Ahead
  • Write Cache: Writhe Through, Write Back*
It also offers options to enable things like write through with no battery (doesn't seem like a good idea). Note that it's recommended that you recondition the battery every six months. I found a cheat sheet from a DBA who explicitly sets the same hardware to all of the defaults but he doesn't explain his reasoning.

O'Reilly's High Performance MySQL discusses the RAID stripe chunk size and RAID cache in Chapter 7. The author contrasts the idea of maximizing the RAID stripe chunk size in theory vs practice and doesn't leave me with a conclusion as to what would be appropriate for "general mysql data" since I can't be sure enough of the content. I have had good performance with Zimbra's large MySQL database using 128k stripes on RAID 10 with an EMC Clariion. When creating the file system that would host MySQL I followed a recommendation to set the stride to 32k since I had 4k blocks and already had 128k RAID stripes:

mke2fs -j -L Z_DB -O dir_index -m 2 -J size=400 -b 4096 -R stride=32 /dev/sde2
This might be specific to Zimbra's MySQL database, but because my more generic MySQL service will be running on ext3 it will be important to set the stride relative to how the RAID is striped. The author warns that many RAID controllers don't work well with large stripe sizes because it wastes cache and it unlikely that the content will fit nicely into the chunks. Perhaps the 64k stripes perform better on average and that is why it is the default unless you have a specific need. It seems that upping it to 128k might be too extreme of a direction, especially since the author does not seem to think it's a good idea. I'm going to stick with 64k. The author's general advice on caching is to not waste cache on reads (which are likely to not be as good as the DB's own reads) and to save the cache for writes. Thus, I'm sticking with the defaults here too.

The MySQL manual has a disk issues section which discusses mount and hdparm options (gulp... it [hdparm] can crash a computer and make data on its disk inaccessible if certain parameters are misused... gulp). It also mentions how delicate setting the stripe size can be: The speed difference for striping is very dependent on the parameters. Depending on how you set the striping parameters and number of disks, you may get differences measured in orders of magnitude. You have to choose to optimize for random or sequential access. Once again the stripe size will be difficult to guess for an any-DB-goes type of MySQL server.

When creating the file system on top of these default RAID 10 settings it will be a good idea to:

  • mke2fs -j -L VAR -O dir_index -m 2 -J size=X -b 4096 -R stride=16 /dev/sdY
  • hdparm -m 16 -d 1
  • mount -o async -o noatime

The conclusion I come to is: Use RAID 10 with all of the Perc 5/i's defaults. You might be able to find a better setting for the stripe size but this will be difficult to determine so sticking with the default of 64k is probably reasonable for your initial settings. Also, be aware the RAID settings when creating your file system, tuning it with hdparm and mounting it.

Monday, May 18, 2009

Nmap Network Scanning

Nmap Network Scanning

Fisher Space Pen

This has nothing to do with *nix. I'm considering getting a more reliable pen: Chrome Plated Shuttle Space Pen.

Wednesday, May 13, 2009

iptables tip

After adding rules to iptables save them with service iptables save then verify them in /etc/sysconfig/iptables. On RHEL it is easy to yum install system-config-securitylevel and then you'll get a TUI which you can use to poke open arbitrary hosts.

Saturday, May 2, 2009

RedHat Cluster Suite HowTo

I'm in the process of reading this howto.

Friday, April 24, 2009

Dell M610 without QLogic?

Dell's M610, unlike the 605 or 600, does not seem to be available with a QLogic HBA. I have had good results with the native qlaxxx module and I like saving time by using stock RHEL5 kernels. The 610 comes with an Emulex LPE1205-M and I'm trying to verify that the lpfc module will work with it and verify that it is native in the RHEL stock kernel. RedHat seems to imply that it is.

Update: The lpfc driver shipped with RHEL 5.3 will work with the Emulex LPe1205-M

Details:
According to RedHat, the lpfc driver shipped with RHEL5 U3 is: 8.2.0.33.3p. The Emulex site for the LPe1205-M links to a Linux kernel drivers section which links to SuSE's lpfc download site and EMC Manual describing the lpfc. According to another Emulex site, the 8.2.x.x driver supports the LPe12xx and LPe1200x series adapters. They also go so far as to list Red Hat 5.2 (which ships with 8.2.0.22) as a distro which "include[s] a driver on the installation media supporting the LPe12xx and LPe1200x series adapters.

Update 2: A stock kernel from RHEL5.3 is working with the Emulex LPe1205-M. The correct module was loaded automatically and I'm able to see my LUNs:

# lsmod | grep lpfc
lpfc                  353933  0 
scsi_transport_fc      73801  1 lpfc
scsi_mod              196825  6 scsi_dh,sg,lpfc,scsi_transport_fc,megaraid_sas,sd_mod
# cat /etc/issue
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Kernel \r on an \m

# fdisk -l | tail -6
Disk /dev/sde: 697.9 GB, 697932185600 bytes
255 heads, 63 sectors/track, 84852 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1       84852   681573658+  8e  Linux LVM
# 

Tuesday, April 21, 2009

vpnc

I'm using the FOSS vpnc for ipsec VPN connections on my College's Cisco network.

 

Tuesday, April 7, 2009

Standard Server Debug Questions

I'm working on a set of standard questions to ask if a server has a problem:
  1. What do the logs say? (check dmesg or /var/log/messages if nothing else)
  2. Is any other process running during the same window?
  3. Is your server dropping packets? (ifconfig eth0, ifconfig eth1, ethtool eth0?)
  4. What type of storage is your server using?
  5. What filesystem and mount options are you using? (output of /bin/mount)
  6. Have you examined the filesystems with iostat?
  7. When was your last fsck?
  8. Have you strace'd the process with a problem?

Monday, April 6, 2009

No Free Runner?

Open Moko cancels Free Runner. Is Android Free enough?

Wednesday, April 1, 2009

yum claims there are no updates

If a yum update reports there are updates when it is clear that there are, try unregistering the system with the repository and re-registering. If that fails you might have a corrupt yum cache. Yum downloads items into cache, and then checks it going forward. If this is corrupt, then it will break updates. This fix is to clear out the cache by emptying the contents of /var/cache/yum and /var/spool/up2date.

Wednesday, March 18, 2009

Pulling Gnome off of RedHat

If X and Gnome are already installed on RedHat it can be tricky to remove them with yum. However, a friend of mine reports success using pirut to remove them. It warns that it is about to remove itself, but it does seem to work. Packages that are not dependencies of the X and Gnome package tree are not removed.

Tuesday, March 17, 2009

fscking

I've become very interested in fscking. I'm going to keep a link to various pages on fscking that I find interesting. We'll start with something friendly, like how even apples need fscking and then get into some more detail:
  1. Apple Knowledgebase: Resolve startup issues and perform disk maintenance with Disk Utility and fsck (TS1417)
  2. Andy's Unix FAQ: The In's and Out's of Fsck - Dealing with corrupt filesystems

Thursday, February 26, 2009

Red Hat's Virtualization Plan for 2009

RedHat is moving away from Xen and towards KVM. KVM will be available in a package called RHEV-H: a stateless hypervisor, with a tight footprint of under 128MB, which presents a libvirt interface to the management tier. Enterprise servers will no longer need to go through an installation process, and will instead be able to boot RHEV-H from flash or a network server, and be able to immediately begin servicing virtual guests. KVM will also be native to RHEL5.4. RedHat will also be working with Microsoft to certify hosting of Windows servers on top of their virtualization platform.

Update: Useful links:

Wednesday, February 4, 2009

od and dos2unix

$ head -1 foo.sh | od -c | head -1
0000000   #   !   /   b   i   n   /   s   h  \r  \n
$ 
                                             ^^
I've known about od and dos2unix so I suppose I should have put this together earlier: You can use od to see if running dos2unix is even necessary. If foo.sh were a Unix file it would not have the \r.

Wednesday, January 28, 2009

https redirect

My colleagues ran into minor issues when configuring mod_ssl. I helped out and wrote them some tips that I want to {b}log.

On a vanilla RHEL5 install with httpd already running all you need to do is:

yum install mod_ssl
/etc/init.d/httpd restart
yum creates an ssl.conf.rpmnew file so as to not touch an existing ssl.conf but it won't restart the service the way apt does.

If you wish to redirect all traffic to https add the following to your configuration file [0]:

    # note that Options FollowSymLinks had to be added above
    RewriteEngine On
    RewriteCond %{HTTPS} off
    RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI}
I found the above directives from a Google search. When I added just the above I got "permission denied". So I ran a tail -f error_log and saw something like the following:
[Wed Jan 28 05:19:27 2009] [error] [client 123.456.7.89] Options
FollowSymLinks or SymLinksIfOwnerMatch is off which implies that
RewriteRule directive is forbidden: /usr/lib/mailman/cgi-bin/listinfo
I then looked up FollowSymLinks in Apache's manual. I then went back and changed Options ExecCGI to Options ExecCGI FollowSymLinks and read the documentation on what this does. Since this seemed OK (the only accounts on this server are good ones) I left it.

If I hadn't looked at the logs there would have been little hope of figuring it out. I also find it's good to take the time to set up three terminals (edit conf, restart apache, tail logs) and a browser with three tabs (page being tested, manual, Google) when doing this.

Footnotes:
[0]
On RedHat the configuration file should be /etc/httpd/conf/httpd.conf. If you originally did a "yum install mailman" then it installs httpd as a dependency. If you're now securing that httpd after installing mailman the RedHat way, then add the rewrite directives to /etc/httpd/conf.d/mailman.conf.

Tuesday, January 27, 2009

Anatomy of a Program in Memory

Found an Anatomy of a Program in Memory.

Exchange Mail not Reaching you?

If you hear about a message like the one below ask if the "queue is frozen".

I had a case of mail not reaching my server. The sending server was an Exchange server. I asked the admin to check his queue. He had several messages that didn't make it for two weeks and his users didn't get any warning (my servers send warnings after 12 hours and return the message after 2 days). He read me the message below. When I heard the the destination server it made me think of my spam filter dropping connections, but I didn't have any logs of that. Turns out the other admin of the Exchange server had frozen the queue and he didn't know. Upon unfreezing it all of the mail went through. I think this queue freezing feature should at least warn the users of his server.

Your message did not reach some or all of the intended recipients.

Subject: RE: test message

Sent: 1/26/2009 2:54 PM

The following recipient(s) could not be reached:

me@my-domain.com on 1/27/2009 9:02 AM

This message was rejected due to the current administrative policy by the destination server. 
Please retry at a later time. If that fails, contact your system administrator.