torkell: (Default)
Today's discovery is that if the "mv" command can't simply rename a directory due to permissions, then it'll copy the directory structure. Then attempt to delete the old directory structure, fail because of the aforementioned permissions, and spew umpteen bajillion lines of error messages on the console.

Now, a sensible program would have just said that the file permissions didn't permit renaming and stopped there.

Sigh. Now to unpick the mess and check if it did actually copy everything or not.
torkell: (Default)
Today's discovery is that if you're foolish enough to enable DHCP on an alternate network interface (eth3, in my case) on Linux, then the DHCP client will overwrite your hand-configured default route that actually works with the one it received from the DHCP server.

Which is not particularly helpful when eth3 is connected to a different LAN with a router that actually checks the IP addresses of packets it forwards. At which point it eats the SSH connection I was using (because that was to an IP address on eth0) and I had to wander down to the lab and dig out a keyboard and monitor.

I am becoming more and more convinced that the Linux network stack just Does Not Work as soon as you plug it into more than one network.
torkell: (Default)
Today's discovery is that Debian isn't smart enough to include its own hostname when sending a DHCP request. The only workaround is to edit dhclient.conf and manually hardcode the computer's hostname. Then restart "networking" and hope the DHCP client actually does come back up (it didn't when I had to do this earlier today, leaving me with a box that didn't want to speak to anything).

For added hilarity, look at Debian bug 151820 (and yes, that was first raised nearly 10 years ago).

Interestingly, on that mini-network the only devices which didn't require funky non-obvious configuration to make DHCP and hostnames work (the setup is such that the DHCP server updates DNS as devices come and go) were some of the SIP phones and a Windows 2000 server.
torkell: (Default)
Linux has this glorious thing known as the out of memory killer. The documentation claims that when the system runs out of memory it carefully works out which process is responsible, and kills it. This is of course completely false. What it really does when the system runs out of memory is select the most mission-critical process on the system and kill that instead. It then kills a few more processes for good measure.

Linux also (by default) happily overcommits memory. This means that you've got no guarantee that any memory you've malloc'd is actually available for use until you try to write to it.

Interestingly there is a justification for this. On Linux, the only way to start a new process is to fork your existing process, creating a complete copy of it. You then replace the copy of your process with whatever you actually want to run (ok, there's vfork(), but the man page for that contains the wonderful gem "[don't use] vfork() since a following exec() might fail, and then what happens is undefined").

So back in ye olden days when you only had 8MB of memory, your 5MB emacs process would fork itself. Both emacs instances now come to a total of 10MB which is 2MB more than you have, but that's OK because the second process hasn't changed anything and so shares the physical memory of the existing process (via copy-on-write semantics). The second one then gets replaced by your 1MB shell or whatever, taking the total down to 6MB. But if the second process actually wants exclusive use of the entire 5MB, then you've got a problem. And that's "solved" by the out-of-memory killer.

It's a perfectly sensible way to work around the insanity of the fork()/exec() model, except computers today have crazy amounts of memory and so don't actually need this workaround. And this workaround would never have been needed if there was an actual "create new process" syscall. Remind me again why Linux is better?

This rant brought to you by three servers getting broken in various ways due to the out-of-memory killer nuking about a half-dozen processes per server. Hope you didn't actually need Tomcat. Or MySQL. Or cron.
torkell: (Default)
Today I discovered the following:

If the KVM isn't set to my Redhat Linux desktop when it starts, then it misdetects the monitor and decides that it is only capable of 800x600. Not unreasonable, but rather annoying.

KDE does not have an equivalent of the Windows "Hide modes that this monitor cannot display" checkbox. Instead KDE only shows modes that it thinks the monitor is capable of, and provides no way to override this in KDE's desktop properties.

Using the display properties to explicitly force the monitor type to "LCD, 1280x1024" requires logging out and back in.

Changing the resolution using the same program requires logging out and back in.


Yes, I have ranted about similar before. Except this time I wasn't running an unusual multi-monitor setup and I wasn't running from a LiveCD.
torkell: (Default)
Today's annoyance was discovering that Linux apparently cannot cope with a system where multiple logical volume groups have the same name. Instead of doing something useful like complaining about this (or even better, warning about this and continuing), it just hangs. Part-way through startup. Before the virtual consoles are up, or any daemons have started.

Fortuantly someone else had seen this before and suggested removing a bunch of volume mappings from the SAN. Turns out all the virtual machines on that system (configured with a separate volume and LUN mapping on the SAN) used the same default name for the volume group, and that causes lvm to sit there burning CPU without actually doing anything. For added annoyance, once the system booted I then had to add the mappings back so the virtual machines on the system could start.

Of course, at no point is there any debug pointing out that I have multiple volume groups with the same name, or for that matter anything beyond a "please wait" message and an apparently hung system. Why must Linux be so hard to use?
torkell: (Default)
How can changing the screen resolution be *so* hard?

All I want to do is set up two monitors, one running at 1280x960@85Hz and the other at 1152x864@75Hz (though I'll settle for 1024x768 on the other). Now, neither monitor advertises this resolution as being supported, so in Windows I have to go to the Monitor properties and untick the box that says "Only display settings supported by this monitor". Fair enough, it stops people doing stupid things.

On Ubuntu, which is supposed to be nice and friendly and easy... I have to open a shell, use cvt to generate a modeline, use xrandr to tell X about that modeline, use xrandr *again* to tell X which outputs can use that modeline (this is getting too close to manually editing X config files for my liking, and that's never gone well for me). Then because I want to test the resolution first to make sure I remembered it right I then open the Display properties window and pick the resolution. It asks me some mumbo-jumbo about setting the virtual desktop size in a config file, then tells me I have to log out to apply the change!

Ok, so log out, wait for the Ubuntu LiveCD to auto-login... and it's still running at the old resolution. Open the Display properties... and the resolution I added has gone! At this point I gave up, because I'm only using the LiveCD to run a smart test on a hard disk (the Windows smartctl doesn't like the controller it's on).

Come on, Linux, join the 21st Century! I've been able to do this easily in other operating systems for over a decade.


Edit: Oh, and running the BBC's flash-based iPlayer in fullscreen mode has horrendous tearing, along with some clicks and pops from the sound. From a quick Google search it looks like if I spend quite some time easter egging various settings I *might* fix this... or I could fire up the Windows-based laptop which works perfectly.

(Credit where credit's due: Ubuntu detected and got at least partially working sound, graphics, Bluetooth, the SATA controller, the IDE controller, an ethernet card and the 802.11g dongle without needing any hacking or driver hunting)
torkell: (Default)
This set of commands, run in a fresh gdb instance and intended to set a conditional watchpoint, results in gdb segfaulting leaving an orphaned debuggee running:

(gdb) watch function::variable if function::variable == 0xff
(gdb) run


This set of commands, again in a fresh gdb instance, works:

(gdb) info address variable
(gdb) info address function::variable
(gdb) whatis variable
(gdb) whatis function::variable
(gdb) watch function::variable if function::variable == 0xff
(gdb) run


Discuss.

gdb fail

Sep. 11th, 2009 08:07 pm
torkell: (Default)
It never ceases to amaze me just how backward the Linux development environment is.

Today I attempted to debug a test program that segfaults about 5 minutes after startup for no apparent reason. I managed to get a core dump of it, and loaded it into gdb in the hope of finding what was going on. Hahaha.

gdb could give me a valid stack trace showing the error, and could disassemble the program around the error to show me the actual instructions involved. However, gdb could not tell me the value of all the variables there (it claimed that some variables weren't even defined, nevermind that the program uses them all over the place!), nor could it actually match the disassembly up to the source.

Come on, folks, Visual C++ has been able to do this for decades! The Windows debugging tools are so far ahead it's embarassing for Linux.

I did actually discover a patch to gdb to achieve this, submitted April last year. Unfortuantly it's not in the latest released version of gdb (released March last year), and I really don't fancy building gdb from source myself.
torkell: (Default)
Today's discovery is that Linux gets rather offended when the SAN containing / vanishes.

Interestingly, one of the Linux variants involved was much less offended, and "worked" as long as you didn't try to actually touch the disc. Running stuff like "cat" and "ls" even continued to work, presumably because I'd already run them recently and so they were cached. The other linux variant, however, was far more offended and limited me to just bash. Specifically, those portions of bash that were currently loaded into memory. And while the first remounted everything read-only when the SAN came back, the second refused to believe the reappearance of the SAN and wouldn't even let me log in at a real console. For added fun, Ctrl+Alt+Del had no effect because, rather than sending a command to the kernel as in Windows, under Linux merely causes /bin/shutdown to be run.

To be fair I've no idea how Windows would hold up in a similar situation, although I did once uninstall the IDE controller drivers for the C: drive, and while that drive did vanish from the system nothing spectacular happened as a result. Even Internet Explorer continued to work, although stylesheets stopped being applied, and I could even do a controlled shutdown without anything being able to write to C:. Possibly enough of the driver remained to let the kernel speak to the disk, even if it wouldn't admit so.
torkell: (Default)
Today's discovery was a full /var/log/wtmp file.

Those of you who know what that is are probably staring at this going "WTF?". For those that don't know (i.e. non-die-hard-linux-geeks), this file tracks all logins and logouts. Every time someone (or something) logs in or out, an entry gets added to this. And, following the Unix philosophy, no program ever expects that this file might become full. Because, of course, such a thing could never possibly happen. Ever.

Ha. Ha. Ha.

It turned out that the ftpd variant we were using wrote to this file on login/out (oh yes - on Linux it's the responsibility of each individual program to log account usage, not the operating system), and this particular system had a 2GB file size limit. Why, I don't know - even FAT could handle files larger than that. Anyway, given that this is a load box it was quite easy to hit the 2GB limit, and when this happened rather than return an error code Linux's default behaviour is apparently to send a SIGXFSZ signal. And the default behaviour for *that* is to terminate the process.
torkell: (Default)
Memo to self: Linux doesn't like it when I change the IDE controller. It *really* doesn't like it.

In fact, it hates it so much that it's gone and not so much mangled as randomised small but critical bits of the filesystem, and now various parts of it fall over in interesting ways whenever I run them.

Well, I've been meaning to upgrade to Slackware 10.2 for a while.
torkell: (Default)
Today's linux fortune:

"You'll be sorry..."

May 2025

S M T W T F S
    123
45678910
111213141516 17
18192021222324
25262728293031

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 10th, 2025 12:23 pm
Powered by Dreamwidth Studios