Moi sur le 'net
|Écrit par JLangbridge|
|Lundi, 08 Novembre 2010 10:17|
The patient arrived on my desk, and I had to perform an autopsy. I had to know the reason of death, what happened, and exactly how it was possible. What went wrong? How could this have happened? And what could I suggest that would make things safer. I looked at the "client" with a sense of sadness. Far too young to die. Hardly even a month old. The embedded system lay dead on my desk.
I took my time. Everyone wanted to know what happened, but they were told, politely, but firmly, to go away. I had things to do, and they would take time. Everyone would be informed when it was over, but in the mean time, I needed to be alone. As far as embedded systems go, this one was big. From the exterior, a metal box. A screen on one end, and connections on the other. I spent my time looking at every angle, every curve and every little imperfection. I was looking for traces; did it fall? Did something hit it? Was there an electrical discharge? Nothing visible. Whatever happened, it wasn't external.
I cleared my desk. I made sure that every screw, every jumper and every cable was far, far away. I had space all around. I needed to perform an autopsy, and open it up. Almost like a surgeon, my tools were lined up, ready for use. Little dishes were available, to place the organs and other parts I needed to cut away. I selected a torx screwdriver, and looked at the first screw. The paint was still on, no signs of tampering; no-one had taken it apart. The first screw came loose with a little force. It had been assembled correctly. One down, 11 other screws to go. A few minutes later, all the screws were removed, and placed in a little dish.
I placed a screwdriver delicately under the plate and turned. It opened up just a bit. I finally cut my client open. I removed the plate, inspecting it as I went. The cuts were clean and no metal scraps or shards were on the plate. It was well cut and well prepared. That was my first suspicion; I've seen too many systems die because of small or microscopic metal fragments dropping onto the circuitry. It was doubtful that anything fell into the machine, creating a short circuit. The root cause had to be something else. I put the plate down on the desk, and grabbed the torch.
I took my time to look at every component, starting from the power supply. There were no visible effects; no burnt surfaces, no marks on the epoxy suggesting intense heat, and all the solder points looked well done. This board was done professionally; the solder was done in a machine, not by hand. Only one component was done by hand, the tool connector, with a total of 15 solder points, but they were all done correctly. If a component had failed, it had done so silently.
I must have spent about 40 minutes looking at every single component, every connector, every solder joint, looking for any signs, but I couldn't find one. It was time to try a more direct approach. I plugged in the mains, and, carefully, turned it on. It didn't blow up. The screen back light came on, and the power light, but that was it. Looking at the schematics confirmed what I thought; both the backlight and the power light were controlled directly from the power supply, not from the CPU. The power looked good, but no activity from the CPU. I connected the oscilloscope and had a look at the debug connections .The clocks looked good, power levels were well within their tolerances, but the bus was silent. No activity anywhere. Time to dig deeper.
I plugged in a JTAG debugger on the debug port, and tried to connect. The auto-diagnostics took a second or two, and the interface came up. Connected. Awaiting order. Ok, so the CPU looks good. Run. Frozen. Ok, something must be wrong in the flash. The only problem is, if the system cannot start (if the memory is corrupted or any other problem), then the boot-loader is supposed to come up. Maybe the flash circuit is dead? Let's dump it out into a file. The debugger started to flash, and a few seconds later, the contents of the AMD 29F800B were on file. No problems there, apparently the CPU managed to read everything in. Strange. If the flash worked, the boot-loader was supposed to kick in, why didn't it? Hex editor, vector tables. Zero. Hang on... The first 64k of the flash were zeroed out, all the rest had data. This shouldn't be. I had my theory, but I needed to test it. On the shelf was another embedded system, identical to the one being taken apart, except for the "R&D" sticker. I booted it up into maintenance mode, and ran an internal program that can erase and re-flash the boot-loader. I erased it and crashed the system whilst loading the new boot-loader. The system remained on, the lights flickering, holding on. Then it went silent. I restarted it, and there it was. Silent. Exactly the same behaviour. Plugging in the JTAG debugger confirmed it; the first 64k of the flash were zeroed.
It was murder. Cluedo. The professor, in the lab, with the flash tool. I now knew how it died. But was it really dead? If I could talk to the CPU via JTAG, then I could surely re-flash the beginning of the flash chip. I rooted around to find an entire distribution; boot-loader, stage 1, system, libraries and data. I re-flashed the entire chip on the victim, whilst remembering to do the same on the R&D test victim. Afterwards, I unplugged the debugger, and the system rebooted. Everything was silent for what seemed to be a dozen seconds, but in reality it must have been more like a single second. Then it flickered, and a version number appeared on the screen. A second later, the entire system booted up. Just like in 2010, there was the same emotion, where Doctor Chandra reconnects HAL, and HAL says "Good morning Doctor Chandra, this is HAL. I'm ready for my first lesson". The system was alive. It was brand new, and had no memories at all. It didn't know what happened to it, and it had no recollection of its death, or any detail before then. It had no idea it was in a factory, and that it was doing an important job. For all it knew, it had just been reborn. It went off to intensive care, where it would be put through all the hardships of a new system; test cycles, wear and tear, and the usual burn-in, to make sure that everything was OK.
Time to file my report, and let everyone know what had really happened. How did a technician manage to get hold of an application to erase the boot sector? That wasn't my problem, someone else would do that. My job ended here, but I still needed to revive the R&D machine...