MarketSelf-modifying code
Company Profile

Self-modifying code

In computer science, self-modifying code is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow.

Application in low and high level languages
Self-modification can be accomplished in a variety of ways depending upon the programming language and its support for pointers and/or access to dynamic compiler or interpreter "engines": • overlay of existing instructions (or parts of instructions such as opcode, register, flags or addresses) • direct creation of whole instructions or sequences of instructions in memory • creation or modification of source code statements followed by a "mini compile" or a dynamic interpretation (see eval statement) • creating an entire program dynamically and then executing it Assembly language Self-modifying code is quite straightforward to implement when using assembly language. Instructions can be dynamically created in memory (or else overlaid over existing code in non-protected program storage), in a sequence equivalent to the ones that a standard compiler may generate as the object code. With modern processors, there can be unintended side effects on the CPU cache that must be considered. The method was frequently used for testing "first time" conditions, as in this suitably commented IBM/360 assembler example. It uses instruction overlay to reduce the instruction path length by , where N is the number of records on the file (−1 being the overhead to perform the overlay). SUBRTN NOP OPENED FIRST TIME HERE? * The NOP is x'4700'<Address_of_opened> OI SUBRTN+1,X'F0' YES, CHANGE NOP TO UNCONDITIONAL BRANCH (47F0...) OPEN INPUT AND OPEN THE INPUT FILE SINCE IT'S THE FIRST TIME THRU OPENED GET INPUT NORMAL PROCESSING RESUMES HERE ... Alternative code might involve testing a "flag" each time through. The unconditional branch is slightly faster than a compare instruction, as well as reducing the overall path length. In later operating systems for programs residing in protected storage, this technique could not be used, and so changing the pointer to the subroutine would be used instead. The pointer would reside in dynamic storage and could be altered at will after the first pass to bypass the OPEN (having to load a pointer first instead of a direct branch and link to the subroutine would add N instructions to the path length – but there would be a corresponding reduction of N for the unconditional branch that would no longer be required). Below is an example in Zilog Z80 assembly language. The code increments register B in range [0, 5]. The CP compare instruction is modified on each loop. ;========== ORG 0H CALL FUNC00 HALT ;========== FUNC00: LD A,6 LD HL,label01+1 LD B,(HL) label00: INC B LD (HL),B label01: CP $0 JP NZ,label00 RET ;========== Self-modifying code is sometimes used to overcome limitations in a machine's instruction set. For example, in the Intel 8080 instruction set, one cannot input a byte from an input port that is specified in a register. The input port is statically encoded in the instruction itself, as the second byte of a two-byte instruction. Using self-modifying code, it is possible to store a register's contents into the second byte of the instruction, then execute the modified instruction in order to achieve the desired effect. High-level languages Some compiled languages explicitly permit self-modifying code. For example, the ALTER verb in COBOL may be implemented as a branch instruction that is modified during execution. Some batch programming techniques involve the use of self-modifying code. Clipper and SPITBOL also provide facilities for explicit self-modification. The Algol compiler on B6700 systems offered an interface to the operating system whereby executing code could pass a text string or a named disc file to the Algol compiler and was then able to invoke the new version of a procedure. With interpreted languages, the "machine code" is the source text and may be susceptible to editing on-the-fly: in SNOBOL the source statements being executed are elements of a text array. Other languages, such as Perl and Python, allow programs to create new code at run-time and execute it using an eval function, but do not allow existing code to be mutated. The illusion of modification (even though no machine code is really being overwritten) is achieved by modifying function pointers, as in this JavaScript example: var f = function (x) {return x + 1}; // assign a new definition to f: f = new Function('x', 'return x + 2'); Lisp macros also allow runtime code generation without parsing a string containing program code. The Push programming language is a genetic programming system that is explicitly designed for creating self-modifying programs. While not a high-level language, it is not as low-level as assembly language. Compound modification Prior to the advent of multiple windows, command-line systems might offer a menu system involving the modification of a running command script. Suppose an MS-DOS batch file MENU.BAT contains the following: :start SHOWMENU.EXE Upon initiation of MENU.BAT from the command line, SHOWMENU presents an on-screen menu, with possible help information, example usages and so forth. Eventually the user makes a selection that requires a command SOMENAME to be performed: SHOWMENU exits after rewriting the file MENU.BAT to contain :start SHOWMENU.EXE CALL SOMENAME.BAT GOTO start Because the command interpreter does not compile a script file and then execute it, nor does it read the entire file into memory before starting execution, nor yet rely on the content of a record buffer, when SHOWMENU exits, the command interpreter finds a new command to execute (it is to invoke the script file SOMENAME, in a directory location and via a protocol known to SHOWMENU), and after that command completes, it goes back to the start of the script file and reactivates SHOWMENU ready for the next selection. Should the menu choice be to quit, the file would be rewritten back to its original state. Although this starting state has no use for the label, it, or an equivalent amount of text is required, because the command interpreter recalls the byte position of the next command when it is to start the next command, thus the re-written file must maintain alignment for the next command start point to indeed be the start of the next command. Aside from the convenience of a menu system (and possible auxiliary features), this scheme means that the SHOWMENU.EXE system is not in memory when the selected command is activated, a significant advantage when memory is limited. Control tables Control table interpreters can be considered to be, in one sense, "self-modified" by data values extracted from the table entries (rather than specifically hand coded in conditional statements of the form IF inputx = 'yyy'). Channel programs Some IBM access methods traditionally used self-modifying channel programs, where a value, such as a disk address, is read into an area referenced by a channel program, where it is used by a later channel command to access the disk. ==History==
History
The IBM SSEC, demonstrated in January 1948, had the ability to modify its instructions or otherwise treat them exactly like data. However, the capability was rarely used in practice. In the early days of computers, self-modifying code was often used to reduce use of limited memory, or improve performance, or both. It was also sometimes used to implement subroutine calls and returns when the instruction set only provided simple branching or skipping instructions to vary the control flow. This use is still relevant in certain ultra-RISC architectures, at least theoretically; see for example one-instruction set computer. Donald Knuth's MIX architecture also used self-modifying code to implement subroutine calls. ==Usage==
Usage
Self-modifying code can be used for various purposes: • Semi-automatic optimizing of a state-dependent loop. • Dynamic in-place code optimization for speed depending on load environment. To a lesser extent, the DR-DOS kernel also optimizes speed-critical sections of itself at loadtime depending on the underlying processor generation. Regardless, at a meta-level, programs can still modify their own behavior by changing data stored elsewhere (see metaprogramming) or via use of polymorphism. Massalin's Synthesis kernel The Synthesis kernel presented in Alexia Massalin's Ph.D. thesis is a tiny Unix kernel that takes a structured, or even object-oriented, approach to self-modifying code, where code is created for individual quajects, like filehandles. Generating code for specific tasks allows the Synthesis kernel to (as a JIT interpreter might) apply a number of optimizations such as constant folding or common subexpression elimination. The Synthesis kernel was very fast, but was written entirely in assembly. The resulting lack of portability has prevented Massalin's optimization ideas from being adopted by any production kernel. However, the structure of the techniques suggests that they could be captured by a higher-level language, albeit one more complex than existing mid-level languages. Such a language and compiler could allow development of faster operating systems and applications. Paul Haeberli and Bruce Karsh have objected to the "marginalization" of self-modifying code, and optimization in general, in favor of reduced development costs. ==Interaction of cache and self-modifying code==
Interaction of cache and self-modifying code
On architectures without coupled data and instruction cache (for example, some SPARC, ARM, and MIPS cores) the cache synchronization must be explicitly performed by the modifying code (flush data cache and invalidate instruction cache for the modified memory area). In some cases short sections of self-modifying code execute more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code. The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop. Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See prefetch input queue (PIQ). PC processors must handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so. ==Security issues==
Security issues
Because of the security implications of self-modifying code, all of the major operating systems are careful to remove such vulnerabilities as they become known. The concern is typically not that programs will intentionally modify themselves, but that they could be maliciously changed by an exploit. One mechanism for preventing malicious code modification is an operating system feature called W^X (for "write xor execute"). This mechanism prohibits a program from making any page of memory both writable and executable. Some systems prevent a writable page from ever being changed to be executable, even if write permission is removed. Other systems provide a "backdoor" of sorts, allowing multiple mappings of a page of memory to have different permissions. A relatively portable way to bypass W^X is to create a file with all permissions, then map the file into memory twice. On Linux, one may use an undocumented SysV shared-memory flag to get executable shared memory without needing to create a file. ==Advantages==
Advantages
Fast paths can be established for a program's execution, reducing some otherwise repetitive conditional branches. • Self-modifying code can improve algorithmic efficiency. ==Disadvantages==
Disadvantages
Self-modifying code is harder to read and maintain because the instructions in the source program listing are not necessarily the instructions that will be executed. Self-modification that consists of substitution of function pointers might not be as cryptic, if it is clear that the names of functions to be called are placeholders for functions to be identified later. Self-modifying code can be rewritten as code that tests a flag and branches to alternative sequences based on the outcome of the test, but self-modifying code typically runs faster. Self-modifying code conflicts with authentication of the code and may require exceptions to policies requiring that all code running on a system be signed. Modified code must be stored separately from its original form, conflicting with memory management solutions that normally discard the code in RAM and reload it from the executable file as needed. On modern processors with an instruction pipeline, code that modifies itself frequently may run more slowly, if it modifies instructions that the processor has already read from memory into the pipeline. On some such processors, the only way to ensure that the modified instructions are executed correctly is to flush the pipeline and reread many instructions. Self-modifying code cannot be used at all in some environments, such as the following: • Application software running under an operating system with strict W^X security cannot execute instructions in pages it is allowed to write to—only the operating system is allowed to both write instructions to memory and later execute those instructions. • Many Harvard architecture microcontrollers cannot execute instructions in read-write memory, but only instructions in memory that it cannot write to, ROM or non-self-programmable flash memory. • A multithreaded application may have several threads executing the same section of self-modifying code, possibly resulting in computation errors and application failures. ==See also==
tickerdossier.comtickerdossier.substack.com