Art,
Incredible, and useful information for now and for the future.
Luckily I have convinced everyone involved here into the
wisdom of going with Raw db chunks.
One Sys Admin did mention a new angle that I had not mentioned
previously.
I don't know if you do much with IDS in VMWare instances, but
we will have ours on an ESX Server. The Admin's concern,
which regards scenarios of power failure or failure of some
sort with the ESX Server, is loss of data and/or data
integrity due to the buffering that takes place between the VM, containing
my dbspaces, and the physical ESX Server host.
Any info on this?
Thanks again for your help,
Jim
-----Original Message-----
From: ids-bounces@iiug.org [mailto:ids-bounces@iiug.org] On Behalf Of Art
Kagel
Sent: Wednesday, July 28, 2010 7:49 AM
To: ids@iiug.org
Subject: Re: Are there issues when using RAW dbspaces o.... [20668]
My last comments on journaling. On meta-data change journaling:
- This (logical metadata only journaling) is the method used by EXT3,
EXT4, and ZFS
- All three use block relocation instead of physical block journaling.
This means that on write a block is always written to a new location rather
than overwriting the existing block on disk. A properly designed JFS
(Journaled File System) will commit the new version of the disk block before
updating the metadata or the logical journal (that's the problem with EXT4 -
and EXT3 with write-back enabled they write the metadata first, then the
journal entry before actually committing the physical change to disk). Once
the write and journal are completed the FS metadata is updated. This means
that on a crash there are three possibilities:
- The new block version was partially or completely written but the
journal entry was not written.
- The new block version and journal entry were written and committed.
- The new block version, journal, and metadata were written and
committed.
In the first case, after recovery, the file remains unchanged. In the
second case, after recovery, the FS makes the missing metadata entries and
the file is modified during recovery and the original block version is freed
for reuse. In the third case all was well before the crash and the original
version of the block was released for reuse.
The problem with EXT4 (and EXT3 with write-back enabled) is that the
application (meaning in this case Informix) thinks everything is hunky dory
since the FS acknowledged the change as committed. However, immediately
after the acknowledgement the physically modified block is still ONLY in
cache and only the metadata and journal entry have been saved to disk. At
this point if there is a crash, the file is actually unrecoverable! The
metadata and the journal entry say the block has been moved to a new
location and rewritten, but the new location has garbage in it from some
previous block. This one made Linus Torvalds absolutely livid and he tore
the EXT4 designers a new one over the design. Last I heard you could not
disable the write-back behavior of EXT4 - Linus was pushing to have that
fixed, but I don't know if it ever was.
EXT3 in default mode and ZFS at least are safe, but the problem with them is
just the fact of the block relocations. There is the performance problem of
rewriting a whole block every time the database changes a single page within
the block and so negating much of the gains of caching and there is the
bigger problem that the file is no longer even as contiguous as a
non-journaled filesystem would have it be. Standard UNIX filesystems
allocate blocks of contiguous space and try to leave free space that is
contiguous with those allocated blocks unused when allocating space for
other files so that as a file grows it remains mostly contiguous in
multi-block chunks. This fragments the free space in an FS making it
difficult to write vvery large files (like Informix chunks) that are
contiguous, but if you keep the chunks on an FS that's dedicated to Informix
chunks that's not a real problem (at least currently) since Informix does
not currently extend existing chunks over time. JFS's break that rule
keeping the contiguous bits of a file the same as the block level. Even if
a chunk were allocated as contiguous initially, over time the JFS will cause
the file to become fragmented. If you make the FS block size smaller to
alleviate the costs of multiple block rewrites, you make the file
fragmentation worse.
These problems don't affect filesystems and normal files as much as
databases because the nature of the IO to files is different than IO to
databases. When you write to a flat file, you write mostly sequentially,
your rarely rewrite a portion of the file (unless you rewrite the entire
file) and you never sync the file to disk before you close the file. That
means that the cache will coalesce all writes until an entire block has been
written out before the FS and OS cause a flush and sync of the cache to
disk. That means that the FS has the ability to try to keep the rewritten
blocks contiguous by allocating the replacement blocks contiguously.
Essentially the file is relocated whole if it is rewritten.
Databases don't work that way. Informix writes every block to a COOKED
device or file either under O_SYNC or O_DIRECT control both of which force
the single write (and Informix only ever writes a single page or eight
contiguous pages at a time) to be physically written and committed before
the write() call returns. That means that the coalescing features of the FS
and OS cache management are bypassed in favor of data safety. That means
that if the engine performs what it thinks is a sequential scan, it is
actually performing a random read of the file swinging the read/write heads
back and forth across the disk. If the physical structure is shared with
other applications (can you say massive SAN?) that will also be competing
with those other applications for head positioning. In normal sequential
scanning (ie RAW or COOKED device or non-JFS files) the read ahead reduces
the performance impact of this head contention somewhat. In a JFS, it
cannot help at all. So I guess I have to change my mantra:
NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO
RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!!
NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO
RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!!
NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO
RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!!
NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO RAID5!!! NO JFS, NO
RAID5!!! NO JFS, NO RAID5!!!
Art
Art S. Kagel
Advanced DataTools (www.advancedatatools.com)
IIUG Board of Directors (art@iiug.org)
Disclaimer: Please keep in mind that my own opinions are my own opinions and
do not reflect on my employer, Advanced DataTools, the IIUG, nor any other
organization with which I am associated either explicitly, implicitly, or by
inference. Neither do those opinions reflect those of other individuals
affiliated with any entity with which I am affiliated nor those of the
entities themselves.
On Wed, Jul 28, 2010 at 4:18 AM, Fernando Nunes
<domusonline@gmail.com>wrote:
> I think nobody explains it better than Art, but I'd like to stress out two
> points:
>
> 1- Every sysadmin talks about the advantages of journaling, but it's
> amazing
> how these people forget about reality. Let's start with this Wikipedia
> article:
> http://en.wikipedia.org/wiki/Journaling_file_system
> They split the journaling system into two: Physical and Logical. The first
> logs all blocks (data blocks also) and the second only file metadata.
> Let's dig into this: Physical has a lot of performance impact (which is
> obvious) and it's ABSOLUTELY useless for databases, since the databases
> MUST
> do this.
> The logical journaling only stores metadata changes. PLEASE, can anybody
> ask
> the sysadmin what kind of metadata is changes in a filesystem where only
> Informix chunks are stored?! We don't (currently at least) change the size
> of chunks...
>
> 2- The backup argument is alarming. And it does because this is not
> sysadmin
> task or responsability, and the DBA must make sure this is his
> responsability. Surely no one is thinking about doing filesystem backups
of
> database chunks with a live database...
>
> Regards.
>
> On Tue, Jul 27, 2010 at 10:59 PM, Art Kagel <art.kagel@gmail.com> wrote:
>
> > YOW! Cooked spaces, Journaled filesystems (bet they want to use EXT4 to
> > boot), AND VM images. You've got the triple crown there Jim! Lord God in
> > Heaven, tell me they're not also saddling you with RAID5, RAID6, or
RAIDZ
> > on
> > top of all that!
> >
> > 1) ESX doesn't care about RAW spaces AFAIK and VMs and RAW disk work
> about
> > as well as VMs and any other form of storage, which is to say VERY
BADLY!
> > My testing for a major developer of highly embedded systems show that IO
> > under a VM performs 10x SLOWER than IO performance on the underlying
> > hardware/OS. I would seriously consider running your server on a
> commodity
> > Linux box instead.
> >
> > 2) You are correct, of course. Any backup of Informix chunks made at the
> > operating system level, especially if they are made, as seems to be the
> > intent of your SAs, at the level of the underlying host OS, will be
> > completely useless for restoring the database unless you put the engine
> > into
> > external back up mode and block transactions for the duration of the
> > backup. Otherwise only an ontape or onbar backup will be usable to
> restore
> > the engine.
> >
> > 3) That "little bit" is 10-20% performance increase of RAW device versus
> > COOKED device, and an additional 5-10% for non-journaled filesystem
> chunks
> > versus COOKED devices. That is without O_DIRECT enabled, but the cuts
the
> > cost down to about 5-10% RAW over COOKED and another 2-5% for
> filesystems,
> > which while it is MUCH better, is not trivial. Add to that the extra
cost
> > of journaling (at least 5% but usually more like 15%) and the cost of
> doing
> > all of this on a VM versus raw hardware (about 90% in my testing).
> >
> > 4) The journaling, as I've already stated is redundant and therefore a
> > performance hit that buys you NOTHING! Informix's logical and physical
> > logging is FAR more efficient at recovering the database after a crash
> and
> > adding the filesystem recovery to that will only delay the beginning of
> the
> > engine's fast recovery mechanism.
> >
> > 5) Run! Run fast and run far. You do NOT want to be associated with this
> > system once it's rolled out.
> >
> > Art
> >
> > Art S. Kagel
> > Advanced DataTools (www.advancedatatools.com)
> > IIUG Board of Directors (art@iiug.org)
> >
> > Disclaimer: Please keep in mind that my own opinions are my own opinions
> > and
> > do not reflect on my employer, Advanced DataTools, the IIUG, nor any
> other
> > organization with which I am associated either explicitly, implicitly,
or
> > by
> > inference. Neither do those opinions reflect those of other individuals
> > affiliated with any entity with which I am affiliated nor those of the
> > entities themselves.
> >
> > On Tue, Jul 27, 2010 at 3:14 PM, Jim Cramer <jim-cramer@uiowa.edu>
> wrote:
> >
> > > HELP!
> > >
> > > Here is another angle to the recent questions, and explanations
> > > by Art, et. al, regarding using RAW instead of COOKED dbspaces
> > > on IDS 11.5 running on Linux.
> > >
> > > For reasons cited by Art and the others, I (the DBA) wish to use
> > > RAW dbspaces when I move my instances to Linux (SUSE 11 SLES)
> > > Virtual Machines on an ESX Server running VMWare.
> > >
> > > But my Sys Admins here are refusing to allow RAW spaces and citing
> > > all kinds of vague, generalized reasons why, such as:
> > >
> > > 1) the "host administration utilities" will not fly right
> > > with Raw Spaces but will not be specific about what would
> > > go wrong. In general, my sense is that they feel that
> > > the Management Console and Tools for the ESX host might
> > > see the Raw Dbspace and, because it does not contain not a formatted
> > > filesystem, allocate it to something on the box.
> > >
> > > 2) using raw space will not allow the VMs containing the IDS
> > > dbspaces to be backed up or fit into their backup strategy
> > > and backup utility.
> > >
> > > They cannot seem to understand that a normal backup utility
> > > would not know how to deal with the dbspaces even if they
> > > were Cooked.
> > >
> > > 3) that the "little bit" of performance that they claim I might
> > > get with Raw will not pay back for the increased Sys Admin
> > > overhead.
> > >
> > > 4) that they will use a Journaled File System (along with Cooked
> > > dbspaces) because it is more robust, fault-tolerant, comes
> > > back up quicker after a crash, etc.
> > >
> > > Can anybody provide me with any concrete information/experience
> > > that is related to the above points, particularly (1) .
> > >
> > > Does anyone know if raw dbspaces can cause problems in a Virtual
> > > Machine on a VMWare ESX Server.
> > >
> > > If you have some info and have time to send it soon, that would
> > > be appreciated because I am about to do battle over this with
> > > our Sys Admins.
> > >
> > > Thanks much in advance,
> > >
> > > Jim Cramer
> > > Database Administrator and
> > > Applications Developer III
> > > University of Iowa
> > > College of Engineering
> > > Engineering Computer Systems Support
> > > 1256 SC
> > > Iowa City, Iowa 52245
> > > (319)-335-5757
> > > jim-cramer@uiowa.edu
> > > jcramer@engineering.uiowa.edu
> > > http://css.engineering.uiowa.edu
> > > http://www.engineering.uiowa.edu
> > > http://www.uiowa.edu
> > >
> > >
> > >
> > >
> >
> >
>
>
****************************************************************************
***
> > > Forum Note: Use "Reply" to post a response in the discussion forum.
> > >
> > >
> >
> > --0016e659f7fea2c12b048c659cf6
> >
> >
> >
> >
>
>
****************************************************************************
***
> > Forum Note: Use "Reply" to post a response in the discussion forum.
> >
> >
>
> --
> Fernando Nunes
> Portugal
>
> http://informix-technology.blogspot.com
> My email works... but I don't check it frequently...
>
> --0016364585e4393f9a048c6e4358
>
>
>
>
****************************************************************************
***
> Forum Note: Use "Reply" to post a response in the discussion forum.
>
>
--000e0cd28d429a2ba1048c720899
****************************************************************************
***
Forum Note: Use "Reply" to post a response in the discussion forum.