Bug #4060
closedTTCN3 test run failures on jenkins:
60%
Description
We quite frequently see tests fail like this:
+ collect_logs + fix_perms + docker_images_require debian-stretch-build + local from_line + local pull_arg + [ -z ] + pull_arg=--pull + grep ^FROM ../debian-stretch-build/Dockerfile + from_line=FROM debian:stretch + echo FROM debian:stretch + grep -q $USER + echo Building image: debian-stretch-build (export NO_DOCKER_IMAGE_BUILD=1 to prevent this) Building image: debian-stretch-build (export NO_DOCKER_IMAGE_BUILD=1 to prevent this) + PULL=--pull make -C ../debian-stretch-build make: Entering directory '/home/osmocom-build/jenkins/workspace/ttcn3-bscnat-test/debian-stretch-build' docker build --build-arg USER=osmocom-build --build-arg OSMO_TTCN3_BRANCH=master \ --build-arg OSMO_BSC_BRANCH=master \ --build-arg OSMO_BTS_BRANCH=master \ --build-arg OSMO_GGSN_BRANCH=master \ --build-arg OSMO_HLR_BRANCH=master \ --build-arg OSMO_IUH_BRANCH=master \ --build-arg OSMO_MGW_BRANCH=master \ --build-arg OSMO_MSC_BRANCH=master \ --build-arg OSMO_NITB_BRANCH=master \ --build-arg OSMO_PCU_BRANCH=master \ --build-arg OSMO_SGSN_BRANCH=master \ --build-arg OSMO_SIP_BRANCH=master \ --build-arg OSMO_STP_BRANCH=master \ --pull -t docker.io/osmocom-build/debian-stretch-build:latest . Sending build context to Docker daemon 4.608kB Step 1/3 : FROM debian:stretch Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 192.168.111.1:53: read udp 192.168.111.6:42724->192.168.111.1:53: i/o timeout ../make/Makefile:56: recipe for target 'docker-build' failed make: *** [docker-build] Error 1 make: Leaving directory '/home/osmocom-build/jenkins/workspace/ttcn3-bscnat-test/debian-stretch-build' + exit 1 Build step 'Execute shell' marked build as failure Recording test results Sending e-mails to: laforge@gnumonks.org Archiving artifacts Finished: FAILUREThe odd parts about this are:
- why is "192.168.111.1:53" used as DNS server despite the underlying operating system using "real" DNS server IP addresses (213.133.98.98, 213.133.98.99, 213.133.98.100)? Even if I manually start a container with "docker run --rm -it busybox", its /etc/resolv.conf are set "correct", i.e. don't show any 192.168.111.1 IP
- why are we rebuilding that container image during the "collect logs" step? This means that we have invested significant time to execute an entire test suite, and then throw away all those results just because a reandom image for collecting log files hasn't been up to date.
Updated by laforge almost 5 years ago
I also cannot find any iptables nat rules or the like which would explain this 192.168.111.1.
Updated by laforge almost 5 years ago
ok, 192.168.111.1 is the IP address of the host operating system on the lxcbr0 device:
3: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:16:3e:00:00:01 brd ff:ff:ff:ff:ff:ff inet 192.168.112.1/24 scope global lxcbr0 valid_lft forever preferred_lft forever
We're runnnig docker inside a debian9 lxc. So it seems actually lxc might be to blame for this. I will dig further.
Updated by laforge almost 5 years ago
- Status changed from New to In Progress
Ok, so I was mistaken. I was looking at build slaves on admin2, whereas the failures were on build2. And indeed, the /etc/resolv.conf inside the lxc jails on build2 listed only 192.168.111.1 as DNS server. I'm fixing this now and checking other buildhosts.
Updated by laforge almost 5 years ago
- % Done changed from 0 to 30
https://gerrit.osmocom.org/c/docker-playground/+/14433 should at least not make this problem appear again in the final stage during fix_perms/collect_logs.
Updated by laforge almost 5 years ago
- % Done changed from 30 to 60
I've now ensured that /etc/resolv.conf contains the "real" name server IP addresses on all our current build slaves.
Updated by laforge over 4 years ago
- Status changed from In Progress to Resolved
we haven't seen any DNS related failures for ~ 3 weeks now - yay!