Skip to content

Conversation

prakashsurya
Copy link

No description provided.

@prakashsurya prakashsurya requested review from pzakha and sdimitro July 8, 2020 20:51
@prakashsurya
Copy link
Author

Here's an example of how this would work: https://github.com/prakashsurya/drgn/runs/851497401

delphix-devops-bot pushed a commit that referenced this pull request Dec 20, 2024
drgn currently provides limited control over how debugging information
is found. drgn has hardcoded logic for where to search for debugging
information. The most the user can do is provide a list of files for
drgn to try in addition to the default locations (with the -s CLI option
or the drgn.Program.load_debug_info() method).

The implementation is also a mess. We use libdwfl, but its data model is
slightly different from what we want, so we have to work around it or
reimplement its functionality in several places: see commits
e5874ad ("libdrgn: use libdwfl"), e6abfea ("libdrgn:
debug_info: report userspace core dump debug info ourselves"), and
1d4854a ("libdrgn: implement optimized x86-64 ELF relocations") for
some examples. The mismatched combination of libdwfl and our own code is
difficult to maintain, and the lack of control over the whole debug info
pipeline has made it difficult to fix several longstanding issues.

The solution is a major rework removing our libdwfl dependency and
replacing it with our own model. This (huge) commit is that rework
comprising the following components:

- drgn.Module/struct drgn_module, a representation of a binary used by a
  program.
- Automatic discovery of the modules loaded in a program.
- Interfaces for manually creating and overriding modules.
- Automatic discovery of debugging information from the standard
  locations and debuginfod.
- Interfaces for custom debug info finders and for manually overriding
  debugging information.
- Tons of test cases.

A lot of care was taken to make these interfaces extremely flexible yet
cohesive. The existing interfaces are also reimplemented on top of the
new functionality to maintain backwards compatibility, with one
exception: drgn.Program.load_debug_info()/-s would previously accept
files that it didn't find loaded in the program. This turned out to be a
big footgun for users, so now this must be done explicitly (with
drgn.ExtraModule/--extra-symbols).

The API and implementation both owe a lot to libdwfl:

- The concepts of modules, module address ranges/section addresses, and
  file biases are heavily inspired by the libdwfl interfaces.
- Ideas for determining modules in userspace processes and core dumps
  were taken from libdwfl.
- Our implementation of ELF symbol table address lookups is based on
  dwfl_module_addrinfo().

drgn has taken these concepts and fine-tuned them based on lessons
learned.

Credit is also due to Stephen Brennan for early testing and feedback.

Closes #16, closes #25, closes osandov#332.

Signed-off-by: Omar Sandoval <[email protected]>
delphix-devops-bot pushed a commit that referenced this pull request Sep 27, 2025
The CI has intermittently been hitting the following test failures on
Python 3.8 with Clang:

  ======================================================================
  ERROR: test_task_cpu (tests.linux_kernel.helpers.test_sched.TestSched)
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "/home/runner/work/drgn/drgn/tests/linux_kernel/helpers/test_sched.py", line 40, in test_task_cpu
      with fork_and_stop(os.sched_setaffinity, 0, (cpu,)) as (pid, _):
    File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/contextlib.py", line 113, in __enter__
      return next(self.gen)
    File "/home/runner/work/drgn/drgn/tests/linux_kernel/__init__.py", line 203, in fork_and_stop
      ret = pickle.load(pipe_r)
  EOFError: Ran out of input

The EOFError occurs because the forked process segfaults immediately:

  python[132]: segfault at 7f8f87085014 ip 00007f8f891e9774 sp 00007ffccf7acf00 error 4 in ld-linux-x86-64.so.2[16774,7f8f891d5000+2a000] likely on CPU 0 (core 0, socket 0)

The segfault is on dereferencing cache_new in in _dl_load_cache_lookup()
in ld-linux here:
https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-cache.c;h=88bf78ad7c914b02109d6ddef7e08c0e8fd4574d;hb=f94f6d8a3572840d3ba42ab9ace3ea522c99c0c2#l489

Which is coming from a libomp fork handler:

  #0  0x00007f5566f9d774 in _dl_load_cache_lookup (name=name@entry=0x7f55654afde6 "libmemkind.so")
      at ./elf/dl-cache.c:498
  #1  0x00007f5566f91982 in _dl_map_object (loader=loader@entry=0x55f8a170b670,
      name=name@entry=0x7f55654afde6 "libmemkind.so", type=type@entry=2, trace_mode=trace_mode@entry=0,
      mode=mode@entry=-1879048191, nsid=<optimized out>) at ./elf/dl-load.c:2193
  #2  0x00007f5566f959a9 in dl_open_worker_begin (a=a@entry=0x7fffcf5851f0) at ./elf/dl-open.c:534
  #3  0x00007f5566b4ab08 in __GI__dl_catch_exception (exception=exception@entry=0x7fffcf585050,
      operate=operate@entry=0x7f5566f95900 <dl_open_worker_begin>, args=args@entry=0x7fffcf5851f0)
      at ./elf/dl-error-skeleton.c:208
  #4  0x00007f5566f94f9a in dl_open_worker (a=a@entry=0x7fffcf5851f0) at ./elf/dl-open.c:782
  #5  0x00007f5566b4ab08 in __GI__dl_catch_exception (exception=exception@entry=0x7fffcf5851d0,
      operate=operate@entry=0x7f5566f94f60 <dl_open_worker>, args=args@entry=0x7fffcf5851f0)
      at ./elf/dl-error-skeleton.c:208
  #6  0x00007f5566f9534e in _dl_open (file=<optimized out>, mode=-2147483647, caller_dlopen=0x7f55653fa882, nsid=-2,
      argc=9, argv=<optimized out>, env=0x55f8a1477e10) at ./elf/dl-open.c:883
  #7  0x00007f5566a6663c in dlopen_doit (a=a@entry=0x7fffcf585460) at ./dlfcn/dlopen.c:56
  #8  0x00007f5566b4ab08 in __GI__dl_catch_exception (exception=exception@entry=0x7fffcf5853c0, operate=<optimized out>,
      args=<optimized out>) at ./elf/dl-error-skeleton.c:208
  #9  0x00007f5566b4abd3 in __GI__dl_catch_error (objname=0x7fffcf585418, errstring=0x7fffcf585420,
      mallocedp=0x7fffcf585417, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
  #10 0x00007f5566a6612e in _dlerror_run (operate=operate@entry=0x7f5566a665e0 <dlopen_doit>,
      args=args@entry=0x7fffcf585460) at ./dlfcn/dlerror.c:138
  #11 0x00007f5566a666c8 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>)
      at ./dlfcn/dlopen.c:71
  #12 ___dlopen (file=<optimized out>, mode=<optimized out>) at ./dlfcn/dlopen.c:81
  #13 0x00007f55653fa882 in ?? () from /usr/lib/llvm-14/lib/libomp.so.5
  #14 0x00007f5565413556 in ?? () from /usr/lib/llvm-14/lib/libomp.so.5
  #15 0x00007f5565421d1a in ?? () from /usr/lib/llvm-14/lib/libomp.so.5
  #16 0x00007f5566ac0fc1 in __run_fork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=true)
      at ./posix/register-atfork.c:130
  #17 0x00007f5566ac08d3 in __libc_fork () at ./posix/fork.c:108
  #18 0x00007f5566e108ad in os_fork_impl (module=<optimized out>) at ./Modules/posixmodule.c:6250
  #19 os_fork (module=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/clinic/posixmodule.c.h:2750

This doesn't happen in Python 3.9, which I bisected to CPython commit
45a78f906d2d ("bpo-44434: Don't call PyThread_exit_thread() explicitly
(GH-26758)") (in v3.11, backported to v3.9.6).

That commit describes a different symptom where the process aborts
because libgcc_s can't be loaded. I don't understand how that issue can
cause our crash, but the fix appears to be the same. The discussion also
suggests a workaround: linking to libgcc_s explicitly.

Apply the workaround, which appears to fix our problem. We only do this
for the CI and not for the general build for a few reasons:

1. I'm nervous about explicitly linking to this low-level library
   unconditionally, and the logic to decide when it's necessary (only
   for Python 3.8 and glibc) isn't worth the trouble.
2. The situation required to hit it (drgn + Python threading + fork) is
   unlikely outside of our test suite.
3. Python 3.8 is EOL.
4. Builds with libkdumpfile already pull in libgcc_s via libkdumpfile ->
   libsnappy -> libstdc++ -> libgcc_s.

Signed-off-by: Omar Sandoval <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant