Data Definition

Data Definition

Explore concepts of data in Assembly language.

ยท

11 min read

Assembly language requires you to explicitly define the type and size of data.

Data types:

  • Byte: 8 bit

  • Word: 16 bit

  • Double Word: 32 bit

  • Quad Word: 64 bit

You define data using directives like:

  • DB - Define Byte

  • DW - Define Word

  • DD - Define Double Word

  • DQ - Define Quad Word

For example:

num DB 10    ;Define a byte with value 10
value DW 100 ;Define a word with value 100

Memory Allocation: Assembly language requires you to explicitly allocate memory for variables. You do this using directives:

  • RESB - Reserve Byte

  • RESW - Reserve Word

  • RESD - Reserve Double Word

  • RESQ - Reserve Quad Word

For example:

array1 RESB 10 ; Reserve 10 bytes  
array2 RESW 20 ; Reserve 20 words

You then access the variables using their memory locations. For example, if array1 starts at location 100, you access the 5th element as:

MOV AL, array1[5]  ; Move the 5th byte into AL register

Other Data

In Assembly we don't have native string or character types, we only have numeric data types. Some key points regarding numeric data types in Assembly:

  • We have integer data types like byte, word, double word, etc.

  • We do NOT have a native floating point data type, we have to use floating point instructions to work with floats.

  • We do NOT have a native boolean type, booleans are typically represented using 0 for false and 1 for true.

  • We do NOT have string or character types, we have to represent strings as arrays of bytes or words.


Negative numbers

Here are the key things to know about negative numbers for Assembly development:

  1. Negative numbers are represented using 2's complement. This means the most significant bit (MSB) represents the sign:
  • 0 as the MSB represents a positive number

  • 1 as the MSB represents a negative number

  1. To get the 2's complement of a positive number, you flip all the bits and then add 1. For example:
  • 5 (0101) -> -5 (1010)

  • 10 (1010) -> -10 (0101)

  1. When performing arithmetic on negative numbers, the CPU handles it automatically using 2's complement representation. For example:
  • Adding two negative numbers will give a more negative result

  • Subtracting a negative number is the same as adding its 2's complement

  1. You will need to use sign extension when performing operations on numbers of different sizes. This means filling the upper bits with the MSB to maintain the sign.

  2. Some instructions have a "signed" and "unsigned" version. For example:

  • CMP (compare) - signed

  • CMPE (compare) - unsigned

So in summary, negative numbers are represented using 2's complement, and the CPU handles arithmetic on them automatically. You just need to be aware of sign-extension and signed vs unsigned instructions.


Data Literals

Data literals refer to numeric or string constants that are directly embedded in Assembly code. They are commonly used to initialize variables or assign values.

Some common uses of data literals in Assembly:

  1. Initializing registers - To initialize a register with a value, you can use a literal. For example:
MOV EAX, 10  ; Initialize EAX with decimal 10
  1. Initializing memory locations - To initialize a variable in memory, you use a literal:
MOV BYTE [variable], 'A' ; Initialize byte variable with character 'A'
  1. Assigning values to variables - You can assign a literal value to a variable:
MOV AX, 1000h ; Assign hex 1000 to AX
  1. Passing arguments to functions - Function arguments can be literals:
CALL sum     ; Call sum function
MOV EAX, 5   ; Pass 5 as first argument  
MOV EBX, 10  ; Pass 10 as second argument
  1. Performing calculations - Literals are used in calculations:
MOV EAX, 10 
MUL 100h     ; EAX = EAX * 100h = 1000

There are different types of literals depending on the data type:

  • Binary - 01001101b

  • Octal - 071o

  • Decimal - 100

  • Hexadecimal - 1ACh

  • Character - 'A'

  • String - "Hello"

Using literals makes Assembly code more readable and reduces the need for external variables. But literals should be used sparingly to avoid bloating the code.

Notice: we have used uppercase in this example. Assembly language is actually case-sensitive. I should have used "mov" instead of "MOV" in the code examples.

Assembly language is case sensitive, which means the case of instructions, registers, variables, etc. matters. Some points regarding case sensitivity in Assembly:

  • Instruction mnemonics and register names are usually written in lowercase, like mov and eax. But they can be written in uppercase as well.

  • The case used must be consistent within the same program. Mixing case styles should be avoided.

  • Variables, labels and constants defined by the programmer are also case sensitive. For example:

variable1: 
    mov eax, 10
Variable1:  
    mov ebx, 20

Here variable1 and Variable1 are two different variables.

  • Different assemblers and Assembly dialects may have different conventions regarding case. Some use all uppercase, some use all lowercase, and some allow both.

  • Not considering case sensitivity can cause bugs that are difficult to identify. Two variables that differ only by case will be treated as separate.

Summary: Assembly language is case-sensitive by nature. The case used for instructions, registers, variables, etc. must be consistent within a program. Mixing cases can introduce subtle bugs.


Best Practices

Here are some best practices to handle data in Assembly:

  1. Define variables clearly - Use data directives like DB, DW, DD, DQ to clearly define the type and size of variables. This makes the code self-documenting and easier to understand.

  2. Reserve memory for variables - Use directives like RESB, RESW, RESD, RESQ to reserve memory for variables. This ensures variables have the needed space and avoid conflicts.

  3. Use meaningful variable names - Use descriptive names for variables instead of generic names like var1, var2. This makes the code more readable.

  4. Use constants - Define constants using EQU directives. This makes the code more readable and easier to change.

  5. Minimize global variables - Try to define variables locally within procedures to avoid namespace issues.

  6. Use comments - Add comments to describe the purpose of variables and constants. This makes the code self-documenting.

  7. Use appropriate data types - Use the smallest data type that can store the needed values. This optimizes memory usage.

  8. Initialize variables - Initialize variables before use to avoid garbage values.

  9. Handle negative numbers carefully - Be aware of 2's complement representation and use signed/unsigned instructions appropriately.

  10. Perform type checking - Make sure variables are used correctly and consistently. Mixing types can cause bugs.


Possible Issues

Here are the main data conversion and compatibility issues for Assembly along with best practices to handle them:

The main issues are:

  1. Different data types - Converting between integer, floating point, and string data types requires using the appropriate instructions to convert correctly.

  2. Different data sizes - Mixing data of different sizes, like bytes, words, and doublewords, requires sign-extension to maintain the correct value.

  3. Endianness - Converting between little-endian and big-endian formats requires reordering the bytes. Assembly code is typically written for a specific endianness.

  4. Overflow and underflow - When converting to a smaller data type, the value may overflow or underflow, producing incorrect results. You must check for this and handle it appropriately.

  5. Loss of precision - Converting from a higher precision type to a lower precision type, like float to integer, can lose precision. You may need to round the value.

  6. Undefined behavior - Mixing types incorrectly can result in undefined behavior, like using a word variable where a byte is expected. This can cause bugs that are hard to track down.

  7. Register size - Some registers, like EAX, are multi-purpose and can hold different size values. You must use the correct instructions for the intended size.

To deal with these issues:

  • Use the correct instructions for converting between types

  • Perform sign-extension when necessary

  • Check for and handle overflow and underflow

  • Round values when reducing precision

  • Swap bytes when changing endianness

  • Initialize variables before use to avoid garbage values

  • Define variables clearly with appropriate data directives

Following these best practices can help minimize data conversion and compatibility issues when handling data in Assembly language.


Endianness

Endianness refers to the order in which bytes are stored in memory to represent larger data types, like words and doublewords. There are two main endianness formats:

Little-endian: The least significant byte is stored at the lowest memory address. Subsequent bytes are stored at higher addresses in increasing order of significance.

Big-endian: The most significant byte is stored at the lowest memory address. Subsequent bytes are stored at higher addresses in decreasing order of significance.

For example, the 32-bit (4 byte) hex value 0x12345678 stored in memory would be:

Little-endian: 78 56 34 12 Big-endian: 12 34 56 78

Assembly language code is typically written for either little-endian or big-endian format. The endianness of the target processor must be considered.

For an Assembly programmer, endianness issues can arise when:

  • Converting data to/from big-endian and little-endian formats

  • Accessing multi-byte values in memory

  • Interfacing with external devices that use a different endianness

To deal with endianness, Assembly programmers must:

  • Be aware of the endianness of the target processor

  • Write code that accesses multi-byte values using the correct ordering

  • Use byte swapping instructions to convert between endianness formats

  • Check endianness when interfacing with external devices

Not considering endianness can lead to bugs that are difficult to track down. So it is important for Assembly language programmers to have a good understanding of endianness and how it impacts data storage and manipulation.


Convention Document

My advice is to establish a convention for your project, to stay consistent. You can use uppercase or lowercase depending also on the compiler you are using. Here is an example of a convention document.

There are a few common conventions regarding cases used in Assembly language:

  1. All lowercase: This is the most common convention, where all instructions, registers, variables, etc. are written in lowercase. For example:
mov eax, 10  
add ebx, eax
  1. All uppercase: Some assemblers require that everything be written in uppercase. For example:
MOV EAX, 10
ADD EBX, EAX
  1. Mixed case: Some assemblers allow mixing uppercase and lowercase, but it is not recommended.

A good convention document for an Assembly program would specify:

  • Instructions should be written in lowercase

    • Examples: mov, add, cmp
  • Registers should be written in lowercase

    • Examples: eax, ebx, ecx
  • Variables defined by the programmer should begin with a lowercase letter

    • Examples: number, result
  • Labels should begin with a period followed by uppercase letters

    • Examples: .START, .LOOP

An example convention document:

Assembly Coding Conventions

Instructions:
   All assembly instructions should be written in lowercase.
   Examples: mov, add, cmp

Registers: 
   All registers names should be in lowercase.
   Examples: eax, ebx, ecx

Variables:
   All variables defined by the programmer should begin with a lowercase letter.
   Examples: number, result

Labels:
   All labels should begin with a period followed by uppercase letters.
   Examples: .START, .LOOP

Sticking to a consistent convention helps keep your Assembly code clean, readable and bug-free. Let me know if you have any other questions!


Arrays

Arrays in Assembly language are implemented using memory. They allow storing multiple variables of the same type.

To implement an array in Assembly, we need to:

  1. Reserve memory: Use the .DATA directive to reserve a block of memory for the array.

For example, for an array of 10 integers:

.DATA
array DWORD 10 DUP(0)  ; Reserve 10 DWORDs (4 bytes each)

This reserves 40 bytes (10 * 4) for the array.

  1. Access array elements: We access array elements using the base address of the array plus an offset.

The offset is the index of the element multiplied by the size of each element.

For example, to access the 5th element of the integer array:

mov eax, array ; Base address 
add eax, 4*4 ; Offset - 4 (index) * 4 (size of each integer)
mov ebx, [eax] ; Load 5th element into EBX

Here we add an offset of 16 (4 * 4) to the base address to get the 5th element.

  1. Loop through the array: We use a loop and increment the offset to iterate through the array.

For example:

mov esi, 0 ; Array index     

loop1:    
mov eax, array ; Base address
mul esi ; Offset 
mov ebx, [eax] ; Load element into EBX

; Use EBX

inc esi ; Increment index
cmp esi, 10 ; Compare to array size
jne loop1 ; Loop until end

Here we use ESI as the index, and multiply it by the element size (DWORD = 4 bytes) to get the offset.

  1. Initialize array elements: We can initialize array elements using a loop.

So in summary, arrays in Assembly are implemented using reserved memory. We access elements using the base address and an offset, and iterate through the array using loops.


Structs

Assembly has native support for structures using STRUCT/ENDS, it still requires you to manually:

  • Allocate memory for structures

  • Calculate offsets to access structure fields

  • Pass structures by reference

So while structures are a native part of Assembly, you still have to manage many of the low-level details yourself.

STRUCT and ENDS are used to define structures, similar to structs in C.

The syntax is:

struct_name STRUCT
    ; Structure fields    
struct_name ENDS

STRUCT indicates the start of a structure definition.

struct_name is the name you give to the structure.

Then you define the structure fields, which can be:

  • BYTE

  • WORD

  • DWORD

  • QWORD

  • Strings (BYTE arrays)

  • Nested STRUCTs

ENDS indicates the end of the structure definition.

For example:

person STRUCT
     name BYTE 30 DUP (?)
     age BYTE 
     gender BYTE
     address STRUCT
         street BYTE 20 DUP (?)
         city BYTE 15 DUP (?) 
         state BYTE 2 DUP (?)
         zip WORD        
     address ENDS
person ENDS

This defines a person structure with name, age, and gender fields, and an address nested struct.


Disclaim: I'm a beginner learning Assembly. I use AI to clarify this language step by step. This is my research. Don't trust me. I do not have the required experience to teach Assembly but if you read this document and have questions, post a comment. Learn and prosper. ๐Ÿ––

ย