Saturday, October 19, 2013

Hung Task Bug in Xenomai Kernel

Thanks to Ralf Roesch, some glitches I had been writing off to bad SD cards have been shown to be a significant bug in the Xenomai patched kernel for the BeagleBone.  The symptom is the kernel hangs and prints a message similar to the following:

[26160.894920] INFO: task mmcqd/0:74 blocked for more than 60 seconds.
[26160.901577] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26160.909811] mmcqd/0         D c0699e08     0    74      2 0x00000000
[26160.916607] [] (__schedule+0x5b8/0x774) from [] (schedule_timeout+0x1c/0x21c)
[26160.925943] [] (schedule_timeout+0x1c/0x21c) from [] (wait_for_common+0x130/0x170)
[26160.935718] [] (wait_for_common+0x130/0x170) from [] (mmc_wait_for_req_done+0x1c/0x74)
[26160.945851] [] (mmc_wait_for_req_done+0x1c/0x74) from [] (mmc_start_req+0x50/0x158)
[26160.955712] [] (mmc_start_req+0x50/0x158) from [] (mmc_blk_issue_rw_rq+0xa4/0x348)
[26160.965485] [] (mmc_blk_issue_rw_rq+0xa4/0x348) from [] (mmc_blk_issue_rq+0x3fc/0x450)
[26160.975624] [] (mmc_blk_issue_rq+0x3fc/0x450) from [] (mmc_queue_thread+0xa0/0x104)
[26160.985481] [] (mmc_queue_thread+0xa0/0x104) from [] (kthread+0xa0/0xb0)
[26160.994345] [] (kthread+0xa0/0xb0) from [] (ret_from_fork+0x18/0x38)
[26161.002875] Kernel panic - not syncing: hung_task: blocked tasks
[26161.009184] [] (unwind_backtrace+0x0/0xe0) from [] (panic+0x84/0x1e0)
[26161.017756] [] (panic+0x84/0x1e0) from [] (watchdog+0x1d4/0x234)
[26161.025866] [] (watchdog+0x1d4/0x234) from [] (kthread+0xa0/0xb0)
[26161.034065] [] (kthread+0xa0/0xb0) from [] (ret_from_fork+0x18/0x38)
[26161.042529] drm_kms_helper: panic occurred, switching back to text console

To 'tickle' this bug, all you have to do is boot, get to a console prompt, and run:

grep TestConfig /usr -r

There's a thread over on the Xenomai list about this issue, and it looks like it's a problem with the interrupt code for the am335x MMC or DMA driver (the MMC code makes heavy use of DMA).  I'm not sure how long it will take to track down this bug, or if I'll be able to recruit any help from TI or elsewhere.

NOTE:  You can still use LinuxCNC on the BeagleBone, as this issue doesn't really show up that often (I've done overnight prints and had systems run fine for a couple weeks straight), but it is definitely something that needs to get fixed.

UPDATE: I initially figured this was probably an issue with the TI code for the AM335x mmc controller, but it looks like it could possibly be an actual kernel bug based on the following very similar issue:
UPDATE 2013.10.29: Thanks to tireless efforts by Ralf Roesch, it looks like this is probably fixed.  Ralf identified a 3.12 kernel commit 7472bab236bdee1173412585591329e718f4d324 that seems to resolve this issue for both xenomai patched and 'plain' kernels.  I am still testing, but everything looks good so far.  I have checked in updates to my linux-dev project and expect to make new MachineKit images with updated kernels soon.

No comments:

Post a Comment