In my 1980s version of Empire, all the global variables were kept contiguously in one source file. To save/restore the game, it just took the address of the first one, the address of the last one, and blitted it to a disk file, and blitted it back.
Very fast & easy.
Of course, it broke when COMDATs were introduced.
I did a similar thing with my text editor. The colors were configurable. The usual way was to have a configuration file, which the editor would read upon startup. But floppy disk systems were unbearably slow. So what I did was take the address of the configuration data in the data segment. I'd work backwards to where those bytes were in the EXE file, and patch the EXE file. This worked great!
Until the advent of virus scanners, which broke that. Virus scanners hated self-modifying EXE files.
I love these little "factoid" comments that you hand out from time to time. What would be even better is a book with all the nitty-gritty details of the software you've written, along with all the misadventures in doing so.
You can name it: 'Bright Moments'
I'd be happy to place an advanced order if it leans you in that direction!
See also: Emacs's old unexec() function. How do you speed up a bunch standard library lisp loading and initialization? Just do it once then write a new exe out with all your state pre-computed. Genius.
I've pre-packed game asset data to the end of EXE files using this trick and it seems that newer neural network virus scanners seem to the the only type of scanner to pick it up. So we may even lose the ability to do that soon once these type of scanner become more popular/prevalent.
This reminded me of the old days working in Windows 3.1 and my first professional project was to write a SOCKS client that could be loaded up and intercept all calls to Winsock's connect() function. It needed to do this without modifying the other programs and it had to happen at the DLL level and not the VxD layer where our IP stack ran.
Turns out there was an undocumented Windows API function along the lines of "AliasCsToDsRegister" or something like that - I've tried to find a reference to it but I can't find it. It allowed me write into the code segment (the CS was global and read only - as it was shared among all processes) and replace the first few bytes of the connect function call with a jump to my code which would then put it back, make the call to the socks server, do some other magic, put my jump hook back in and the return to the caller. Good times!
Kind of surprised I remember this and more so that it actually worked.
Yeah win16 had AllocCStoDS and AllocDStoCS, one was documented the other wasn't. They also had the fabulously named PrestoChangoSelector which toggled the code/data bit in the descriptor table
IIRC there was a documented ChangeSelector function that was actually not implemented, so you had to use the undocumented PrestoChangoSelector instead. Visual Basic used it to implement a direct threaded code interpreter.
Some other functions had similarly great names:
>>Anyway, I am not sure what type of person writes a function called "BozosLiveHere" and puts it into USER.EXE
>This started out life with a non-bozo name as an undocumented
function in Windows 3.0.
>Windows 3.1 removed the undocumented function, but we found that
some programs were using the undocumented function and started
crashing.
>So we reluctantly put the function back, but changed its name to
"BozosLiveHere" so that nobody else would use it in the future.
>A similar story exists for "TabTheTextOutForWimps".
I remember doing exactly this to get code injection working on GNU/Linux systems! I made a library injection library in college for some coursework, which involved copying a C function into a code cave on a remote process and getting the remote process to execute it and return. It only works because its so bare-bones, it doesn't use try/catch and calls to other functions are possible because the function pointers are passed in through the registers and the compiled code is small enough to fit in a single page.
Technically the code being uploaded is compiled from C++ as the rest of the project is written in C++. Kind of similar to the article but in my case all I'm doing is calling a bunch of C library functions from the shellcode.
As you are well aware, calling setjmp and longjmp doesn't make code non-position-independent in the way that Raymond is talking about, because setjmp saves the return address with which it was actually called. It doesn't rely on PC tables in the executable the way that (a common implementation of) C++ exception handling does. I mention this not to inform you of it (you already understand it better than I do) but to keep anyone else reading the thread from being misled.
This was common back in the day with ezines. They would include codes to exploit security bugs in common software but they would have intentional errors so script kiddies couldn't compile them.
Exactly the idea.
If you're a skiddie, you'll copy/paste, look for a minute, say it doesn't work.
If you dive in, you gotta learn about the language, debugging it, the exploit itself and suddenly you're not a script kiddy anymore. It's like a leap of faith.
Really? Compiler error messages are downright friendly compared to networking tools and concepts. When I was 12, I certainly found them to be much easier, even though I completely failed to learn C, Perl, C++, or any of the other popular languages of the time.
Raymond Chen is a great writer. He writes about the driest topics on earth, but still manages to make it entertaining. Definitely one of my favorite blogs.
And he seems to be one of the few who bothers to keep up writing at MS devblogs.
I wonder how it looks internally. Like, can anyone write there. Apart from a university I worked at management would never let me write any "blogs" about work related stuff.
I once replied to a flippant comment about the inherent reliability of the cloud with a two-line PowerShell script that would end any 100% cloud-hosted business if executed in the context of an admin account. As in: pack up everything, turn the lights off, lock the doors, and go home because the game is over.
I deliberately obfuscated the script by replacing characters with various Unicode confusables, both the letters and the symbols.
I still felt bad and ended up deleting the post.
Hopefully nobody tried to type it in manually instead of cut & pasting just to see what would happen...
There are cloud-era equivalents that will do a heck of a lot more damage. Think bulk account/subscription deletion with a sprinkling of -force and -purge.
Delete locks and the like will stop this… unless you bulk delete them first.
The admins will get warning emails that they will finish reading in abject terror some time after their entire cloud tenant has gone to heaven.
The modern equivalent of “Operating system not found” is “Click the guides below to get started with your new cloud account.”
# this would wipe all EC2 instances, assuming you are authenticated and there is nothing else need to get a list of them
Get-EC2Instance | Remove-EC2Instance -Force
If there is a way to enumerate objects without explicitly specifying some properties then PS makes it very easy to pipe that output to remove or disable cmdlets.
I never worked with AWS but if I needed to write such killer script I would kill instances (as in above example), remove all S3 objects (quite similar, Get-S3Object | Remove-S3Object) and finally wipe out all IAM roles and users (similar Get/Remove-IAMGroup, ..IAMUser etc). And thesubscription/account things, as @jiggawatts says, I'm just not familiar with the AWS lingo. And secrets/creds, of course.
By the time a human starts to investigate why the monitoring gone mad (assuming it wasn't hosted with all other infra on same account, lol) there would be nothing to do, too many thing are gone already, too many things are on the way to be purged and even if they could be restored there would be too many things to tie it again to something resembling a functioning system.
> I pointed out to the customer liaison that what the customer is trying to do is very suspicious and looks like a virus. The customer liaison explained that it’s quite the opposite: The customer is a major anti-virus software vendor! The customer has important functionality in their product that that they have built based on this technique of remote code injection, and they cannot afford to give it up at this point.
As an aside, whenever I set up a Windows PC for me or a family member, the first thing I do is uninstall any third-party antivirus that may have come with the computer. I have found that anti-virus software likely makes my computer more insecure by having a big attack surface, not to mention slowing it down.
Actual Chromium developers have similar opinions w.r.t. antivirus vendors.
It's almost like the extensibility necessary to make third-party security products work requires creating entirely new attack surface for those products to work.
You shouldn't trust that uninstalling the AV reliably gets rid of whatever kernel drivers and other detritus it came with. I don't know of any AV examples but plenty of video game anti-cheat software has that problem.
It's not as if the first party antivirus inspires any confidence, either. It still loses in performance to some 3rd party ones, and has had its own list of security issues.
For those who remember, its early versions (before the MS acquisition) required the Visual Basic runtime.
The current Windows Defender is not the crappy old MS AV it used to be. It's gotten drastically better, and for normal home users I always recommend it over installing any 3rd party AV.
I'm a pentester and red teamer, and yes, bypassing Defender isn't all that hard, but neither are most 3rd party AVs, and Defender does not bring in all the instability and additional attack surface. Also I think it can take advantage of newer kernel APIs that non-MS AV can't.
> The customer is a major anti-virus software vendor! The customer has important functionality in their product that that they have built based on this technique of remote code injection, and they cannot afford to give it up at this point.
LoadLibrary is not guaranteed to have the same address across processes. As of Win7 you might also have a process link to kernelbase and not kernel32, so it's not even guaranteed to be in the same DLL.
However it should be possible to use GetModuleHandleEx to find the DLL base addresss in a remote process, then ReadProcessMemory to implement your own remote GetProcAddress.
You do end with modules having the same address in every process if untampered though. This is due to Copy on Write windows implements internally with DLL's to save space. So while not guaranteed. On X64 you can be certain that because of COW the module will have the same address.
That's not true. You're not considering different virtual addresses backed by the same pages.
Yes, the loader will create file-backed memory mappings and not redundantly store read-only parts. However, it is free to load it at a different address in each process. This can happen via ASLR, or if the mapping is already claimed by the time the module loads.
They may get the same base address repeatedly in multiple processes and work most of the time, but it's not guaranteed.
It's extremely likely for stuff from Kernel32.dll.
> That's not true. You're not considering different virtual addresses backed by the same pages.
technically I suppose, but PEs don't tend to be relocatable, so if it mapped it in at different virtual addresses that would be extremely unlikely to be backed by the same pages as much of the just-mapped-in code would need relocs
No need to go that far. If you allow an untrusted process to write to your memory, you've already lost. One thing I haven't seen called out yet is that there is security associated with this API call - ordinary users can only call this on processes that they own.
(And as far as the terminology, these aren't "bugs" since they aren't defects in the software.)
I mean sure. If you can do this you can debug the process (how do you think debuggers are implemented?) and ofc once you're debugging a process you can introduce bugs and cause random havoc.
Linux can do this too with ptrace. And mac has something similar I'm sure, you gotta be able to debug.
Now, the target process can implement countermeasures against you, that's what anti-debug is, but it's impossible in the general case to defend yourself against a debugger with the same privilege level as you. (it is an arms race though, so sometimes it's easiest to resort to kernel-mode anti-anti-debug techniques even against pure user-mode anti-debug. If you do that you need to disable KPP, and you can't disable KPP in the supported way because the supported way is to attach a kernel debugger to the kernel, which will make it obvious that someone is debugging!)
You can allocate memory in another process on Unix too: use ptrace to make the other process call malloc (use PTRACE_SETREGS to set PC to malloc and the first argument register to the number of bytes, then intercept the return).
GDB will use this if you tell it something like `p foo("bar")`, as it needs to allocate memory for that string somewhere.
When I took a compiler class back in the early '90s, the project was to write the compiled machine code into an array, then cast the array into a function, and execute it. I and another student were doing it on 68040 NeXT workstations. One other student was doing it on a Mac, one on a VAX, and the rest on PCs (the PC students largely failed!). We were mystified why, when we tried to execute our code, it was as if it wasn't there. Took us a while to realize that the 68040 had separate instruction and data caches, and even more time (and emailing people at NeXT) to determine what the cache flush procedure was.
I wrote a "cd" replacement for cmd[1] a _long_ time ago (I only recently uploaded it to Github).
It uses exactly this technique to run a thread in cmd's process to actually change the directory. It's kept working from XP on up to Windows 11 now. I am always amazed it works, I fully expect it to go boom some day, probably with an error along the lines of "Don't do that, please".
Terrible hacks is an art form that is hugely underrated today in the name of overengineered best practice complexity monsters. Sometimes, just doing it the stupid way is simpler than doing it properly. Especially when working with propertiary systems ...
I was competing in the Jump Trading programming competition and thought I had a pretty good implementation in AVX asm, but I was still behind one of their engineers, so I asked him after the competition. Turns out he was a Linux kernel committer and wrote a process to spawn multiple threads by copying itself, modifying the parameters, and then setting the offsets directly in the thread table, avoiding all mallocs and thread startup. So basically, his math code was just basic C loops, but his process was complete before my threads even finished allocation.
Forgive me if I got it wrong, I am definitely not a Linux kernel committer.
This reminds me of the time I wanted to run binaries compiled for SSE3 on a system that lacked SSE3. I started writing a tool to emulate this [0], and one thing it could do is rewrite the executable pages with replacement instructions if there was something that would fit (using memcpy(2), naturally).
This harkens back to the days when you could "download" a math coprocessor for your SX system, which was a TSR which likely did the same catching and handling of illegal instructions.
Memcpying and executing code could also surface micro-architectural realities of the underlying CPU and memory subsystem micro-architecture that may need attention from the programmer.
For example:
- On most RISCy Arm CPUs with Harvard style split instruction and data caches special architecture specific actions would need to be taken to ensure that after the memcpy any code still lingering in the data cache was cleaned/pushed out to the intended destination memory immediately (instead of at the next cache line eviction).
- Any stale code that happened to be cached from the destination (either by design or coincidence) needs to be invalidated in the instruction cache.
- Depending on the CPU micro architecture, programmer unknown speculative prefetching into caches as a result of the previous two actions may also need attention.
If using paging you may need to invalidate the TLB entry which contains execute permission for the page.
On x86 if using segments, after changing segment attributes you need to reload the segment selectors.
The execution pipeline may need to be flushed, using a serialising instruction.
When modifying code in place that may be being executed by another thread on another core at the same time, some modifications may trigger CPU errata.
On particular CPUs there may be other kinds of caches or state invalidation required, but hopefully the OS provides a "flush I-cache" function that covers all of them.
I've done stuff like this before; it works very well if you know the limitations, and I'd say that it even gives you a better understanding of how things actually work. Of course, don't bother MS or any other "official" vendor if it doesn't work, because you are on your own in debugging it.
Well, it should work in some well-controlled cases and in those cases if you're writing this code you are probably an OS vendor. Examples of legit use case: an executable loader or some bootstrapping code/bootloader.
When writing code like this, as Chen says, you are bound to the architectural rules regarding how to appropriately locate code and safely invalidate code caches etc.
So if you follow the rules it doesn't work then typically I figure you'd take it up with the CPU vendor.
While you could potentially do it and have a good reason to in userspace code, it should be heavily scrutinized because it's so unconventional.
When I read this quote from the customer in the story, I wasn't surprised.
I figured it was likely that code that aggressively scans and modifies other running executables would be written as a kludge, an unorthodox way of abusing the compiler-loader-runtime chain.
The client certainly should have made sure their code was truly position independent.
Also, the client should have embedded their code in the executable file name so they just have to jump to the appropriate offset in argv[0]. This way, future updates just require renaming the file!
At least one additional step which is required on some architectures is you must flush the data cache and invalidate the instruction cache at the location of the new code.
Dynamically loading code is indistinguishable from self-modifying code, and each architecture has special steps you must take in order for it to work.
The "such a bad idea" part of this is that the code being injected is written in C++ rather than assembly. The injector itself is perfectly reasonable.
really? I've not written much shellcode at all, but what I did write wasn't generically-compiled C++ - it was always either C or ASM, specifically because you get to avoid all the platform and position-dependent stuff (except in return-to-libc payloads).
You can definitely use C++, but you need to use specific compiler flags and avoid things like the STL or exceptions. Strings need to be created on the stack, a few other tricks. Then you can extract the .text section assembly of the resulting binary and inject and run it.
That makes sense, but then a C++ without exceptions, without vtables, without the STL - might as well be C, right? There's not much going on there beyond syntax sugar!
You want to use C anyhow as you want to make sure you have control over the code that is output.
For example the following code you know what the assembly is going to be.
strcmp(char* a, char* b);
strcmp(str1,str2);
If you do the above as a template you can run into some weird issues that you may not be expecting. So while tedious, you would need to make your own wscmp. You also have to be very careful so that you don't pull in ANY libraries. Since your code needs to be 100 % independent and do the loading itself.
C++ exceptions are implemented at the OS level in windows. C++ exceptions using SEH, while there's also VEH and unhandled exceptions. You can easily use SEH for your shell code, it's just not documented well. But sadly you have to manually set this up by having something like
SetExceptionHandler(curAddr,Handler) // Where curaddr can be found by doing something like call $+5 so you remain position independent.
Yeah it's not super helpful. Maybe a bit smaller code, stricter type checking, a little faster compilation time possibly But not really a huge benefit over C. Sure beats writing it all in hand-coded assembly though!
Uhh yea, this is Running shellcode 101, works very well. My Red Team stuff at work all starts with a simple loader like this (with some encryption / obfuscation sprinkled in).
When I was first shown this I was like 'What non virus use case does this have!?!?'
Red teamer here too, and this was my exact thought. There's lots of legit uses for DLL injection, but straight up shellcode injection? Shady as hell. So of course it was an AV vendor...
Which is not an end-user advantage but rather a tool to better understand end-user software - equally useful for improving said software or attacking it. Assuming GP meant "virus" as in "malware" then actually this supports his point.
I actually did something like that on Windows x86, and it worked fine. Even I was surprised by that fact :)
I used it to copy out a (forgotten) password from a password inputfield in another program, which you cannot read remotely (for security reasons). Worked fine for that one use-case, and I haven't used this trick it anywhere else ever again :)
I don't think I have that code around anymore, at least I can't find it now. And it's been a while. Here is what I remember:
Basically you use VirtualAllocEx() to allocate some memory in the remote thread. The returned pointers are in the context of the target process.
You can access that remote memory with ReadProcessMemory() and WriteProcessMemory(), which uses those "remote" pointers to copy data to/from your process.
You can then use these memory areas to pass global handles and other stuff around.
For accessing the actual password field data, you use standard Window-Messages with SendMessage() etc.
Some Windows screen readers used, and maybe still use, the same technique to get data out of the SysListView32 common control, since the parameters and results of that control's window messages aren't marshaled by the OS.
This technique is a classic one. I remember learning about it in late 90's after I got infected with a malware that hid itself from Task Manager. That made me also write my own Task Manager.
To this day (Win10/Win11) you can hide your program from Task Manager using this technique and any malware that respect itself does it.
I don't know much about Windows GUI programming, but in other GUI toolkits I've used, there's usually some sort of text_field.get_current_value() function you can call. Presumably the parent injected some code that repurposes the callback of a button or something so that when you click it, it calls get_current_value() and then dumps it to console or a log file or something.
Of course all of this is very much undefined behavior in standard C and C++. Some programmers really need to learn that they program the "abstract machine" when they write C or C++.
That is probably the smallest and least interesting of the problems with this. Any method for injecting code into another process is inherently going to be platform-specific and outside the bounds of the C++ abstract machine.
The standard is pointless when what you’re trying to implement is non-portable or architecture-specific by nature. At that point, your compiler implementation and target architecture are what matters.
That's the thing, though. What is considered undefined by the standard is NOT necessarily undefined by the compiler or target architecture. Type punning via union is UB in the standard, but well defined in `gcc`. Signed overflow is UB in the standard, but well defined in `gcc` with `-fwrapv`. Casting a void pointer to a function pointer is UB in the standard, but well defined for pretty much every non-MCU target arch (dlsym relies on it, after all!). Dereferencing a runtime-known NULL pointer is UB in the standard, but is well defined to trigger a segfault in pretty much every arch with an MMU. Etcetera, etcetera, etcetera.
> Dereferencing a runtime-known NULL pointer is UB in the standard, but is well defined to trigger a segfault in pretty much every arch with an MMU.
This is incorrect for C/C++ though. Modern compilers definitely treat null dereferences as UB, with real consequences (e.g. eliminating redundant null pointer checks). The compiler is part of the architecture.
I don't think most of this is actually UB. The cast from a function pointer to `BYTE *` is, but the rest is "sound" (AFAICT) from the C abstract machine's perspective. The reason it fails is basically orthogonal to C.
From the C abstract machine's perspective it is undefined:
1. To substract the two function pointers after casting to BYTE*.
2. To read the bytes through those pointers.
3. To cast the copied bytes back to function and invoke it.
The article only focuses on 3, and only about how it's undefined because the compiler is not required to generate position independent code. But an optimizing compiler in theory can just optimize the whole function away from looking at the very first undefined line.
The message is that yes, you need to step out of the standard to do stuff like this, but you may have to consult a lot more about your implementation defined behavior than you originally signed up for.
The reason it fails is due to UB. There's not much point in saying that most of it isn't UB since the part that is UB is the part that causes the failure.
The cast isn't the part that causes the failure (C says that function pointers don't have to be safely representable within `void *` or any other fundamental pointer type, but they are on both x86(-64) and Itanium[1]).
The failure happens because the programmer assumed that the code is "self contained" and position-independent, both of which are concepts outside of the C abstract machine.
[1]: Function pointers on Itanium are actually fat, but IIRC most compilers hide this by making the "function pointer" point to some kind of thunk instead.
This is the type of reasoning that leads people to write insecure code, by trying to outguess the compiler with implementation details that are entirely inapplicable. It suggests that if the code was self-contained or position independent or satisfied any property whatsoever, that using memcpy would be a perfectly fine way to copy it, that is simply untrue, it would never be safe to use memcpy on pointers to functions under any circumstance. It's irrelevant whether you are on x86, Itanium, or any other platform, undefined behavior WILL result in incorrect and invalid program semantics and can not be relied upon to produce consistent results even within the same execution of the same program.
There's literally decades worth of people trying to use undefined behavior in clever ways and ultimately failing and yet here we are...
I think you're reading too far into my reasoning. I'm not a huge fan of UB, and I mostly write Rust these days to get away from cheekiness like this.
My only point was that, for better or worse, UB is not the culprit in this code. C could have well-defined abstract semantics for copying functions or aliasing function pointers through datatype pointers, and this code would still be platform dependent and would still break on different hosts.
This is also untrue, if the standard did mandate that pointers to functions could participate in a memcpy and the behavior was as if the instructions of that function were treated as an array of chars, then C++ compilers would be forced to accommodate that implementation regardless of what platform it ran on, Itanium, x86, even a PDP-11 it wouldn't matter. For example it might mean that implementations must tag or otherwise store a dictionary to keep track of addresses to functions so as to differentiate pointers to functions from pointers to objects and produce whatever appropriate behavior is needed to copy said instructions. An implementation could keep an intermediate representation of any function that can potentially be memcpy'd and then translate that representation at runtime when it participates in a memcpy. Whatever the case may be, if the standard mandates certain behavior then an implementation is required to respect it. While not entirely the same, C++ does do something along these lines in certain situations involving pointers to virtual member functions involved in a diamond inheritance structure.
Saying that C++ (the code in the article is not C and the two languages differ about the treatment of pointers to functions) could have well defined semantics to handle this is entirely moot... it's about as relevant as saying that a C++ program could run on top of the JVM with a garbage collector [1] and hence eliminate all types of memory errors. Even if a C++ program ran on a platform that had guaranteed garbage collection the fact would still remain that C++ as a language does not have well defined semantics for what happens to a dangling reference and as such a C++ compiler is free to exploit that to make very strong assumptions about the runtime behavior of the program for the sake of generating efficient code.
The fact that this is undefined behavior gives a compiler the freedom to perform optimizations under the assumption that runtime behavior will never engender said behavior. Raymond makes use of this property when he discusses COMDAT folding which is a common optimization to elide multiple copies of the same function or to even produce multiple versions of a single function optimized for different scenarios. From the article:
"Even without Profile-Guided Optimization, compile-time optimization may inline some or all of a function, so a single function might have multiple copies in memory, each of which has been optimized for its specific call site."
This property has nothing to do with x86, or Itanium or PDP-11, it's a purely logical optimization permissible only because of the various forms of undefined behavior in C++ with respect to the treatment of function pointers.
> if the standard did mandate that pointers to functions could participate in a memcpy and the behavior was as if the instructions of that function were treated as an array of chars, then C++ compilers would be forced to accommodate that implementation
C/C++ compilers would then not be portable to ISAs with Harvard architecture.
I froze for a moment seeing this article after having worked at a major anti-virus company long time back and used some low level Win32 APIs.
Fortunately, I followed some of the techniques from “Programming Applications for Microsoft Windows” book and Detours project to intercept and execute custom code mostly based on loading custom DLL in target remote process and using DllMain() to execute.
Yet, copy-on-write works well in Unix fork/exec() models and helps reduce memory pressure. Presumably, the kernel has a mechanism which presents as logistically simple "copy" but takes care of page/pointer/vm necessity.
If you can allocate memory in a foreign process, I'd guess you could also change the permissions on that memory… So write first, then change to executable.
(VirtualProtectEx looks like it would do that. Never used winapi, not sure.)
But the code being memcpy'ed is using the parent process's specific symbol relocations. When a library with PIC is loaded the executable code is copied from the file into RAM to a random offset and all references to structures in the ASM are updated to match their now random offset in memory (simplifying). Say function XYZ in library libfoo is in offsetXYZ, parent process loads libfoo at offset 0xDEADBEEF, injectee process loads libfoo at offset 0xDEC0DE. In Windows the call to function XYZ in the parent process uses the address offsetXYZ+0xDEADBEEF, but the call in the injectee process uses offsetXYZ+0xDEC0DE, causing any reference to the parent process's function to fail. GNU/Linux is very similar but library symbols are found based on an offset to a structure in memory that contains the library symbol metadata, that changes every time the program is loaded.
So actually the opposite is true, if the code wasn't position-independent and was statically located, the assembly code offsets wouldn't need to be updated and you might be able to call a memcpy'd function. Position-independent code could only possibly work if you updated the reference to the symbol metadata structure in the ASM after the memcpy, but at that point you're re-implementing libdl and no longer just using memcpy.
This depends on the libraries/DLLs being used. Windows loads system DLLs at the same location in every process's address space, so you can use process-local offsets in a remote process. For custom libraries of course this wouldn't work. Or if the required system library hasn't been loaded in the remote process.
Very fast & easy.
Of course, it broke when COMDATs were introduced.
I did a similar thing with my text editor. The colors were configurable. The usual way was to have a configuration file, which the editor would read upon startup. But floppy disk systems were unbearably slow. So what I did was take the address of the configuration data in the data segment. I'd work backwards to where those bytes were in the EXE file, and patch the EXE file. This worked great!
Until the advent of virus scanners, which broke that. Virus scanners hated self-modifying EXE files.