Code Review: xz/liblzma Backdoor

Root Elements

C:language backdoor:malware

Some background for future visitors

On March 29th, the Linux community was alerted to a significant security breach within the xz/liblzma project, a prevalent library across many Linux distributions. This breach involved the introduction of a backdoor since version 5.6.0, published 4 weeks prior, specifically crafted to affect the distribution tarballs used by Debian and RPM package maintainers.

The backdoor behavior was initially documented by a postgres developer at Microsoft (Andres Freund) in this post. Andres’s discovery of the issue came through observation of anomalies in system behavior during benchmarking, and diligent scrutiny of the build process and distribution files. Andres describes his finding as coincidental, indicating that the community was lucky to have identified the backdoor so early in the package update lifecycle, as it could have easily gone unnoticed if a surprising set of coincidences hadn’t happened.

The method employed for this backdoor involved manipulating the build routines to include the compromised code only in the produced tarballs for x86_64 Debian and RPM systems by replacing compiled shared library files with backdoored ones, effectively evading detection in the source code repository. Much more detail on the initial timeline and summary findings is covered by Evan Boehs here if you would like a full recap.

The backdoor isn’t the whole story, however. The broader context of this breach indicates a pattern of modifications by actor(s) operating under potentially many names, but most notably as “Jia Tan” (@jiat75 on GitHub). The activity spans multiple packages over a few years, and many of these changes appear to be aimed at building trust with package maintainers initially, while introducing small iterative questionable changes as time goes on. These small, seemingly inoccuous changes pose a continued risk for software integrity to this and other open source projects they have been introduced into.

In light of the situation, I will be examining the commit history of Jia Tan’s contributions on the xz project repository (and perhaps others) to identify interesting code changes or other potential concerns they were responsible for introducing in addition to the backdoor.

A brief note

This will not be a line-by-line of each commit, but rather an analysis of the actors potential reasons for these changes and the harms that could come from them.

While I have a good bit of experience dealing with a number of languages, Clang is not my strongest skillset. I anticipate greater minds than mine are already working on fixing these issues quitely behind the scenes, and that I will not fully appreciate the implications of all the changes made over the last years this actor spent committing code. But that doesn’t mean we can’t try to derive some purpose or potentially interesting choices of theirs along the way.

Commit History Analysis

Before we dive too deep into specific commits, we should try to isolate the type of artifacts that this actor was typically involved in by eliminating commits that don’t speak to logic or feature changes. This will give us some perspective as to which commits to focus on, such as those more likely to include interesting code.

Couldn’t they commit logic changes and not mention it in the commit? - Yes, however we are working under the assumption that the package maintainers are not compromised, and thus the actor would have to at least make some effort to align the commit messages with their code changes.

To do this, we will want to review their commit history in the project using the repository on git.tukaani.org since the GitHub repo is now restricted.

While a timeline based approach from the actor’s first commit date seems like a more thorough approach, a popular belief at the moment (as noted by Kevin Beaumont) is that the actor was operating on a tight deadline due to upcoming patches in systemd that would render their work obsolete. The reasoning is mentioned in Kevin’s article so I won’t dive into it here, but it makes sense.

We may also assume that the actor would have had to gain trust with the package maintainers over some period of time. Since the beginning of their commit history was (probably) anticipated to be much more heavily scrutinized, they were likely not submitting brazen exploits or serious logic changes early on.

Working off these assumptions, it seems most reasonable to me that the more apparent issues are also more recent, and thus I will be working backwards in time from the most recent commits.

Top touched and added files

As one might expect, the actor has been very busy since coming into the project. There are quite a lot of files with modifications, but only a small portion have commit counts in the double digits. Notably, many of these files are ones that were involved in the backdoor.

Below are summaries of the number of non-documentation commits that files appear modified in.

While only a small set other than the test compressed files were introduced by them. The following are all files introduced by the actor over the course of their work.

Filtering out likely mundane commits

Latest Jia Tan commits

Screenshot of the git log in `lol` format for Jia Tan

Cursory review of the commit log indicates that we could probably focus on cases of Tests, liblzma, xz:, Build, CMake, and CI to start looking for interesting content.

This leaves us with the following commits:

Filtered Jia Tan commits part 1

Screenshot of first list of commits for Jia Tan

Filtered Jia Tan commits part 2

Screenshot of second list of commits for Jia Tan

Of these commits we can see immediately, based on other posts, that the actor has introduced a number of test files in the last few commits, the most recent of which are known to be a part of the backdoor. These are already undergoing significant analysis, and we will focus on earlier codebase changes instead. The other RISC-V test files have not been associated to this backdoor event yet, that I am aware of.

2024.3.9: The fix for that Valgrind error

commit 82ecc538

Latest Jia Tan commits

The Valgrind issue was noted by others as an actual (not fake) problem that the actor was trying to solve, but the fact that this change involves code pieces that the actor is responsible for introducing is meaningful, and it’s difficult to say if this problem was an unintended consequence of moving too fast, or a basis to make undoing their changes more difficult.

So this commit is slightly interesting for the fact that the code they’re replacing continues to include attribute((no_profile_instrument_function)) , which is the code that they had submitted in two prior commits (below) that disables instrumentation when ifunc is available. I speak more to why I think this might be relevant below.

2024.3.4: Disable instrumentation with ifunc for crc32_fast and crc64_fast

commit 72d2933b

These commits were also pointed out in Andres’ post. I’m not super well versed in these compile options for gcc, but this code appears to specifically disable compiler instrumenting of crc32_resolve and crc64_resolve functions when CRC_USE_IFUNC (ifunc available) is set in order to limit profiling of those functions. This is interesting because we know, also based on Andres’ post, that the crc functions are directly related to the exploit path (excerpt below):

The backdoor initially intercepts execution by replacing the ifunc resolvers crc32_resolve(), crc64_resolve() with different code, which calls _get_cpuid(), injected into the code (which previously would just be static inline functions). In xz 5.6.1 the backdoor was further obfuscated, removing symbol names.
[…] Below crc32_resolve() _get_cpuid() does not do much, it just sees that a ‘completed’ variable is 0 and increments it, returning the normal cpuid result (via a new _cpuid()). It gets to be more interesting during crc64_resolve().
[…] In the second invocation crc64_resolve() appears to find various information, like data from the dynamic linker, program arguments and environment. Then it perform various environment checks, including those above. There are other checks I have not fully traced.
[…] If the above decides to continue, the code appears to be parsing the symbol tables in memory. This is the quite slow step that made me look into the issue.
- Andres Freund

--- a/src/liblzma/check/crc32_fast.c
+++ b/src/liblzma/check/crc32_fast.c
@@ -135,6 +135,11 @@ typedef uint32_t (*crc32_func_type)(
 // This resolver is shared between all three dispatch methods. It serves as
 // the ifunc resolver if ifunc is supported, otherwise it is called as a
 // regular function by the constructor or first call resolution methods.
+// The __no_profile_instrument_function__ attribute support is checked when
+// determining if ifunc can be used, so it is safe to use here.
+#ifdef CRC_USE_IFUNC
+__attribute__((__no_profile_instrument_function__))
+#endif
 static crc32_func_type
 crc32_resolve(void)
 {

--- a/src/liblzma/check/crc64_fast.c
+++ b/src/liblzma/check/crc64_fast.c
@@ -98,6 +98,9 @@ typedef uint64_t (*crc64_func_type)(
 #      pragma GCC diagnostic ignored "-Wunused-function"
 #endif
 
+#ifdef CRC_USE_IFUNC
+__attribute__((__no_profile_instrument_function__))
+#endif
 static crc64_func_type
 crc64_resolve(void)
 {

My interpretation of these changes is that not only do they want to ‘fix’ the compilation issue (that is the “reason” for this change), but they want to also minimize visibility of these code segments. We know that the results of the replaced functions getting called is significant cpu cycles, so the less red flags the better.

2024.2.26: The period that disabled Linux Landlock

commit 328c52da

This change is quite sneaky. As mentioned by others, there is a very evil-looking period that is easy to miss on line 8, just above void my_sandbox.

index 7670059..d2b1af7 100644 (file)
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -901,10 +901,29 @@ endif()
 
 # Sandboxing: Landlock
 if(NOT SANDBOX_FOUND AND ENABLE_SANDBOX MATCHES "^ON$|^landlock$")
-    check_include_file(linux/landlock.h HAVE_LINUX_LANDLOCK_H)
+    # A compile check is done here because some systems have
+    # linux/landlock.h, but do not have the syscalls defined
+    # in order to actually use Linux Landlock.
+    check_c_source_compiles("
+        #include <linux/landlock.h>
+        #include <sys/syscall.h>
+        #include <sys/prctl.h>
+.
+        void my_sandbox(void)
+        {
+            (void)prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+            (void)SYS_landlock_create_ruleset;
+            (void)SYS_landlock_restrict_self;
+            (void)LANDLOCK_CREATE_RULESET_VERSION;
+            return;
+        }
+
+        int main(void) { return 0; }
+        "
+    HAVE_LINUX_LANDLOCK)
 
-    if(HAVE_LINUX_LANDLOCK_H)
-        set(SANDBOX_COMPILE_DEFINITION "HAVE_LINUX_LANDLOCK_H")
+    if(HAVE_LINUX_LANDLOCK)
+        set(SANDBOX_COMPILE_DEFINITION "HAVE_LINUX_LANDLOCK")
         set(SANDBOX_FOUND ON)
 
         # Of our three sandbox methods, only Landlock is incompatible

It’s well documented in other places, but the function check_c_source_compiles is intended to inform on whether you can import linux/landlock.h, a sandboxing capability that allows for unprivileged processes to create sandboxes, during build. The original code was replaced under the following premise:

author	Jia Tan <[email protected]>	
    Mon, 26 Feb 2024 11:02:06 -0400 (23:02 +0800)
committer	Jia Tan <[email protected]>	
    Mon, 26 Feb 2024 11:27:44 -0400 (23:27 +0800)
The previous Linux Landlock feature test assumed that having the
linux/landlock.h header file was enough. The new feature tests also
requires that prctl() and the required Landlock system calls are
supported.

However, the actor goes on to introduce the period on line 8, effectively disabling landlock use. This occurs because it forces the compile check to fail, which is the expected result in cases where landlock was not available on the system. But since they removed any other check for the presence of landlock.h this check silently fails in all cases, and it is not apparent that Landlock support has been disabled. (See section on Capsicum below as well)

2024.2.12: Interesting removal of bounds check in lz_decoder.h

commit de5c5e41 | commit f3872a59

This is an interesting change, that appears to implement a version of dict_put function within lz_decoder.h that doesn’t check bounds on dict before writing, while replacing its use with dict_put_safe in lzma_decoder.c in locations without preconditioned bounds checks. Checking pos against limit is pretty crucial, as it is the safety feature that protects against buffer overflows in these very simple functions.

--- a/src/liblzma/lz/lz_decoder.h
+++ b/src/liblzma/lz/lz_decoder.h
@@ -180,12 +180,22 @@ dict_repeat(lzma_dict *dict, uint32_t distance, uint32_t *len)
 }
 
 
+static inline void
+dict_put(lzma_dict *dict, uint8_t byte)
+{
+       dict->buf[dict->pos++] = byte;
+
+       if (dict->pos > dict->full)
+               dict->full = dict->pos;
+}
+
+
 /// Puts one byte into the dictionary. Returns true if the dictionary was
 /// already full and the byte couldn't be added.
 static inline bool
-dict_put(lzma_dict *dict, uint8_t byte)
+dict_put_safe(lzma_dict *dict, uint8_t byte)
 {
-       if (unlikely(dict->pos == dict->limit))
+       if (dict->pos == dict->limit)
                return true;
 
        dict->buf[dict->pos++] = byte;

While this isn’t malicious at face value, it does replace a singular implicit safe function with needing to explicitly select the safe/unsafe intent in other files then rely on checking bounds before use (which it was doing previously for all calls anyway). The changes in lzma_decoder.c do however, introduce some (likely) performance improvements in the form of a faster default mode when "… it is guaranteed there is enough input to decode the next symbol.".

I posit that this change was actually a weakening of interpretation and implicit handling for the decoder, intended to allow for easier misdirection or questionable use in the future with both modifications to bounds checking operations within decoding functions.

You could argue this is the classic difference between risk and speed that a lot of C programmers walk the line on, but since you end up checking the dict.pos before being able to use dict_put anyway, this change (specifically to lz_decoder.h) doesn’t appear like it would do much to improve performance. The change to lz_decoder.h could be a sniff test for how much attention is being paid in code review – and as it turns out, as shown in the 2nd commit above, Lasse does review, update, and simplify the functions before pushing them up.

2023.3.6-2023.3.7: Adding handling for Capsicum (BSD) missing, and then make it silently fail

commit f070722b | commit 61ee82cb | commit 5fb93678 | commit 01587dda | commit 916448d6 (revert 5fb93678)

The commits are interesting because they’re similar in nature to what we see happen with other files in later commits regarding silent failures. There are a handful of commits that initially iterate the sandbox failure logic, add a failure warning, then remove the warning altogether and default to a silent failure when enabling Capsicum, the sandbox in BSD.

--- a/src/xz/file_io.c
+++ b/src/xz/file_io.c
@@ -193,24 +193,24 @@ io_sandbox_enter(int src_fd)
        cap_rights_t rights;
 
        if (cap_enter())
-               goto error;
+               goto capsicum_error;
 
        if (cap_rights_limit(src_fd, cap_rights_init(&rights,
                        CAP_EVENT, CAP_FCNTL, CAP_LOOKUP, CAP_READ, CAP_SEEK)))
-               goto error;
+               goto capsicum_error;
 
        if (cap_rights_limit(STDOUT_FILENO, cap_rights_init(&rights,
                        CAP_EVENT, CAP_FCNTL, CAP_FSTAT, CAP_LOOKUP,
                        CAP_WRITE, CAP_SEEK)))
-               goto error;
+               goto capsicum_error;
 
        if (cap_rights_limit(user_abort_pipe[0], cap_rights_init(&rights,
                        CAP_EVENT)))
-               goto error;
+               goto capsicum_error;
 
        if (cap_rights_limit(user_abort_pipe[1], cap_rights_init(&rights,
                        CAP_WRITE)))
-               goto error;
+               goto capsicum_error;
 
 #elif defined(HAVE_PLEDGE)
        // pledge() was introduced in OpenBSD 5.9.
@@ -231,7 +231,19 @@ io_sandbox_enter(int src_fd)
        //message(V_DEBUG, _("Sandbox was successfully enabled"));
        return;
 
+#ifdef HAVE_CAPSICUM
+capsicum_error:
+       // If a kernel is configured without capability mode support or
+       // used in an emulator that does not implement the capability
+       // system calls, then the capsicum system calls will fail and set
+       // errno to ENOSYS.
+       if (errno == ENOSYS) {
+               sandbox_allowed = false;
+               return;
+       }
+#else
 error:
+#endif
        message_fatal(_("Failed to enable the sandbox"));
 }
 #endif // ENABLE_SANDBOX

This is the sum of all changes over the listed commits above, which is why you do not see the intermediate ones with warnings.

Again, these changes are not directly malicious, but they weaken the posture of the code and ultimately make it more convenient to inject small sabotage elements into the build or test processes. In order to abuse this segment, one would only need to identify a configuration state that would enable the failure of any one of the cap_enter or cap_rights_limit statements, and the sandboxing would silently fail.

… additional commits [Work in Progress]

I will continue reviewing additional commits as time permits, as updates to this article.