20131124

Tracking down the problem...

OK. After fighting several gdb build issues, I gave up on earlier versions. I thought I'd go old school to try to find the problem. First step was to figure out where we were dying. I had only one clue, the crash location form /var/log/kernel:
Nov 24 19:12:02 (none) kernel: emulate_load_store_insn: sending signal 10 to fred(8249)
Nov 24 19:12:02 (none) kernel: $0 : 00000000 9001fc00 ffffffff 0000001f 9fff7b3c 7fff7b40 00000000 00000000
Nov 24 19:12:02 (none) kernel: $8 : 0000fc00 7fff7b44 00000000 ffffffff 80151b78 000067b2 000033d9 80184020
Nov 24 19:12:02 (none) kernel: $16: 0040bb00 7fff7c44 004015c4 00000003 00407ccc 1002340c ffffffff 100267ac
Nov 24 19:12:02 (none) kernel: $24: 00000000 020cfe48 1000b5f0 7fff7aa0 7fff7aa0 0040899c
Nov 24 19:12:02 (none) kernel: Hi : 00000000
Nov 24 19:12:02 (none) kernel: Lo : 0000004e
Nov 24 19:12:02 (none) kernel: epc : 00408b70 Tainted: P
Nov 24 19:12:02 (none) kernel: Status: 8001fc13
Nov 24 19:12:02 (none) kernel: Cause : 00000010
Nov 24 19:12:02 (none) kernel: 8001e9fc 8001eac0 80022bb4 800215b4 8001d6b8 00408a7c
Nov 24 19:12:02 (none) kernel: 00408a7c 0040899c
We know that epc is the address of the faulting PC in this message. So we died at 0x00408b70. Normally, gdb would tell us where this location is. Something is busted with stabs in gcc 3.0 and 3.3, so I couldn't get a good traceback when we died. So, I had to go old school. First, I disassembled the code:
mips-TiVo-linux-objdump -S -d tivovbi > tivovbi.dis
So I needed find 0x408a7c. Looking for it, we find:
408b64: 8c420000 lw v0,0(v0)
408b68: 00000000 nop
408b6c: 3043001f andi v1,v0,0x1f
408b70: 8c820000 lw v0,0(a0) ----- here ----
408b74: 00000000 nop
408b78: 00621007 srav v0,v0,v1
408b7c: 30420001 andi v0,v0,0x1
408b80: 1040007f beqz v0,408d80 main+0x1054
408b84: 00000000 nop
408b88: 8f99822c lw t9,-32212(gp)
408b8c: 00000000 nop
408b90: 0320f809 jalr t9
408b94: 00000000 nop
Sadly, this isn't as helpful as I'd like. There's no STABs interleaved, nor any source listed. This is less useful than I'd hoped, and likely the reason that gdb can't cope either. So what to do? Have gcc tell me the raw assembler that it is listing...
mips-TiVo-linux-gcc -g -DTIVO_S2 -c tivocc.c -S > tivocc.s
So, now we have to look for the 'lw v0,0(a0)' instruction and hope for the best. Trouble is, gcc outputs raw register numbers, so we have to lookup that in OABI v0 is $2 and a0 is $4. So, we have to look for lw\t$2,0($4) in the output:
$LM506:
.stabn 68,0,930,$LM506
lw $2,76($fp)
srl $2,$2,5
sll $3,$2,2
addu $2,$fp,160
addu $4,$2,$3
lw $2,76($fp)
andi $3,$2,0x1f
lw $2,0($4) ----------- here
sra $2,$2,$3
andi $2,$2,0x1
beq $2,$0,$L348
There's other places where this instruction is used, but this the only place where andi comes in the immediate instruction before the lw. The rest matches tolerably well. So we've found where we started. It turns out .stabn structure 930 is the line number. This leads me back to the following line:
if(FD_ISSET(remote_fd, &rfds))
How can this be wrong? It turns out it isn't wrong. Elsewhere, remote_fd is overwritten. But this turns out to be due to a bug in gcc 3.0. When I updated to gcc 3.3, the crash goes away. It turns out that
if (flags & TEXT_MODE && FD_ISSET(0, &rfds) && (ret = read(fileno(stdin), &c, 1)) == 1 && c == 'q')
is compiled such that it corrupts remote_fd. When parens are put around the first bit, the problem also seems to be fixed. gcc 3.3 compiles both of them identically. This has ended unsatisfyingly, since I don't see the bug in the assembler (so maybe it just moves the actual bug such that it doesn't bite this variable). Despite the odd end, I thought I'd share how I traced down what line this was at, in case others could benefit from the techniques.

No comments: