Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dlink-DIR600 rev b hangs intermittently #1

Open
john-peterson opened this issue Jan 11, 2013 · 2 comments
Open

Dlink-DIR600 rev b hangs intermittently #1

john-peterson opened this issue Jan 11, 2013 · 2 comments

Comments

@john-peterson
Copy link
Member

Error

The hang is from CPU overload by a rogue process that's allowed enough priority to only leave approximately 10-5 of cycles to other processes.

The process is seemingly a kernel process because of the large cycle allocation, and because the top output show high sirq directly before the hang which might be related to the hang.

The hang last for around 20 min, hang duration examples are 12, 17, 27, 33 min.

During the hang it's only servicing LAN including ping, its dropbear, telnetd, httpd, dnsmasq and WAN is unreachable. When it becomes unreachable the ping to it instantly goes from ~1 ms to ~45 ms.

The hang begin and end abruptly, there's no noticeable increase in latency or reduction in throughput or other problems directly before or directly after the hang.

There's no clear indication of its cause before or after the hang.

It's not possible to identify any client activity (use of a particular network software or a particular use of it, f.e. utorrent.exe or qbittorent.exe traffic increase, skype.exe connection, teamviewer.exe connection) with correlation to the error.

Request

dir600b-revb-ddwrt-webflash.bin built without NO_LOG so that a trace of the error might be written to /tmp/var/log/messages. Alternatively provide the exact commands to build from a default Debian installation.

Reports of similar occurrences, f.e. because "hang" in the dd-wrt.com/phpBB2 Google index doesn't give any meaningful suggestion.

System information

cat /tmp/loginprompt
DD-WRT v24-sp2 std (c) 2010 NewMedia-NET GmbH
Release: 08/07/10 (SVN revision: 14896)
nvram get DD_BOARD
Dlink-DIR600 rev b
Settings

These are the settings I'm aware that I've changed from the default

ip_conntrack_max=16384
static_leases=…
forward_port=…
wan_hwaddr=…
def_hwaddr=…
dhcp_start=100
dhcp_lease=60
sshd_enable=1
sshd_authorized_keys=…
refresh_time=0
wl0_net_mode=n-only
wl0_ssid=…
cron_jobs=…

I don't have the original settings however, how do I retrieve the nvram show output from the default setting so that I can confirm this list?

The hang occured when ip_conntrack_max was the default 4096 too.

Request the status for other settings that you believe can be relevant and discuss why they might be relevant.

Error tracing

Syslog

The syslog show no message for the period the hang occurs, in this example minutes after a hang the last message is 13 h old

date; cat /tmp/var/log/messages|tail -1
Fri Jan  4 13:01:38 UTC 2013    
Jan  3 23:59:48 DD-WRT user.debug syslog: ttraff: data for 3-1-2013 commited to nvram

and the system has no other logs and to my knowledge there is no more logging that can be enabled. Please correct this statement if it's incorrect because additional logging could be important to trace this error.

Custom log

The custom log to attempt to indentify the problem is

cat log
$(date -Iseconds)\tip_conntrack_count: $(cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count)\n">>status.log
top -bn1>status.log

cat trim
scp -i /tmp/root/.ssh/ssh_host_rsa_key status.log "user@ubuntu:~/ddwrt/$(date -Iseconds|tr : -)-status.log
rm status.log

cat /tmp/cron.d/cron_jobs
* * * * * root ~/log
0 */12 * * * root ~/trim

Please suggest additional commands that you believe are useful for this log.

Identified patterns are

High sirq load for two min before the hang, f.e.

CPU:  0.3% usr  0.6% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 98.9% sirq

compared to around 30% during normal operation, f.e.

CPU:  0.0% usr 66.6% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 33.3% sirq

[events/0] sometimes has high load directly after the hang (and is otherwise always idle), apparently from an increased event queue size because of the rogue process

  4     2 root     RW<      0  0.0 98.5 [events/0]
Other discussion

Directly after the hang the historical load average is high, f.e. (custom log)

Load average: 14.54 12.73 10.49 10/52 19407
Load average: 20.13 17.31 13.22 10/53 19458
Load average: 11.50 10.40 6.98 10/43 20090

Compared to normal operation with unchanged client circumstances

Load average: 0.00 0.00 0.00 1/29 21511
Load average: 0.00 0.00 0.00 2/32 21528
Load average: 0.00 0.00 0.00 1/29 21546

The fact that top output freeze in the middle of an output illustrate how abruptly the rogue process begin to consume all cycles. This is two examples of hanged top output (the only meaningful pattern seem to be the high sirq load)

Mem: 16256K used, 13344K free, 0K shrd, 1540K buff, 5060K cached
CPU:  0.1% usr  0.0% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 99.8% sirq
Load average: 2.46 6.11 7.57 5/21 14745
  PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
  803     1 root     S     1340  4.5 36.3 telnetd
  857     1 root     R     1100  3.7 35.0 dnsmasq --conf-file=/tmp/dnsmasq.conf
 1115     1 root     R     4564 15.3  9.7 httpd -p 80
    3     2 root     RW<      0  0.0  5.7 [ksoftirqd/0]
14668 14663 root     R     1348  4.5  1.4 top
  692     1 root     S     2148  7.2  0.0 resetbutton
  850     1 root     S     1888  6.3  0.0 ttraff
 1208     1 root     S     1884  6.3  0.0 process_monitor
    1     0 root     S     1628  5.4  0.0 /sbin/init noinitrd
14663   803 root     S     1356  4.5  0.0 -sh
 2239     1 root     S      976  3.2  0.0 cron
 1527     1 root     S      976  3.2  0.0 udhcpc -i vlan2 -p /var/run/udhcpc.pi

 Mem: 16336K used, 13264K free, 0K shrd, 1540K buff, 5060K cached
CPU:  0.2% usr  0.2% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 99.5% sirq
Load average: 3.66 6.96 7.93 6/21 14878
  PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
  857     1 root     R     1100  3.7 47.0 dnsmasq --conf-file=/tmp/dnsmasq.conf
  803     1 root     S     1340  4.5 36.6 telnetd
 1115     1 root     R     4564 15.3  5.8 httpd -p 80
dnsmasq

dnsmasq is not involved because this command was run (through an established ssh connection, taking around five minutes to visibly return) during the hang without affecting it

root@DD-WRT:~# killall dnsmasq; ps|grep dnsmasq
 7043 root      1336 S    grep dnsmasq
Connection flood?

Is it possible to see the new connection rate to the system?

The router has a high connection count (/proc/sys/net/ipv4/netfilter/ip_conntrack_count) during normal operation (from bittorent traffic). And sometimes the connection count is higher after the hang, f.e.

2013-01-11T14:45:02+0000    ip_conntrack_count: 3172
2013-01-11T14:51:14+0000    ip_conntrack_count: 9250

and sometimes not

2013-01-17T13:33:04+0000    ip_conntrack_count: 4114
2013-01-17T13:45:26+0000    ip_conntrack_count: 3749

So it's not clear that an increase in number of connections (or traffic) is correlated to the hang. And the connection increase can be correlated to the hang as a result (because WAN connections time out or are placed in queue during the hang) rather than a cause.

A high connection count by itself doesn't use much resources, and during normal operation (above) the load is often 0.00, so it would be beneficial to see the rate of new connections.

DNS lookup flood?

Is it possible to see the DNS lookup rate to dnsmasq?

@BrainSlayer
Copy link
Contributor

looks like a out of memory condition. limit the max conntrack to a sane value for the available memory space and yes from the high cpu load on dnsmasq its possible its a dns request flood. since dns is blocked from wan side by default this flood must be caused by your inner network. if the connection tracking table is full, you wont get any new connections established including ping. this will cause the router to be unresponsive which is also some sort of feature since it just protects itself.

@Faq
Copy link

Faq commented Oct 31, 2020

@john-peterson still happens with newer builds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants