sábado, 19 de febrero de 2022

Tomb Raider (3)

I've spent some days looking at OpenLara source code to better understand how it works.

For now, it works like this, all is executed into the 68000. Although all boxes have the same size, they take different times of processing.



Now, I'm building a command list and using it to do all the transformations and rendering. Although it has to build this list for each frame it wouldn't impact the performance.


You may think that this solution is a waste of time, but now I have each frame "chewed" into some plain data for the GPU, as soon as the command list is built I can tell the GPU to execute it.

The advantage is that the 68000 doesn't need to wait for the GPU, it can process the next frame while the GPU is doing all 3D Transformations, and polygon drawing.

Running these steps in parallel means the frame rate will be as fast as the XForm + Render part, instead of Logic + XForm + Render.

I hope that all the 3D Transformations and rasterizer fit into GPU RAM, so no need to be assisted from the 68000 to do some code swapping.


jueves, 27 de enero de 2022

Tomb Raider (2)

Let's have a quick look at how (more or less) Tomb Raider works.

These are the steps of the render loop.

  1. Get camera room
  2. Set clipping area to the viewport
  3. For each room portal do:
    1. Project portal to screen
    2. Clip portal vertices to the current clipping area
    3. It's visible?
      1. Mark portal destination (another room) as visible
      2. Set clipping area to portal's bounding box
      3. go to 3 with that room.
  4. Render Lara
  5. For each visible room
    1. Render room's room
    2. Render room's sprites
    3. Render room's meshes
  6. Flush
All Render calls do the 3D transformations and include non culled polygons to a list, the actual rendering (pixels drawing) is done in the flush pass.

The next step will be to measure how much time takes each part (calculate visibility, render and flush), let's cross our fingers and hope that the first part (calculate visibility) is fast enough on the 68000, take in mind that the ARM in 3DO or GameBoy Advance is a lot faster than the 68000, so you have more free time to do the rendering.


sábado, 22 de enero de 2022

Tomb Raider (1)

 Yesterday (21-01-2022) I did a quick port of OpenLara to the Atari Jaguar. I'm just using the 68000, the other processors DSP, GPU, and blitter are stopped, of course, I'm using the OP or you won't see anything. 

You can see a small video here. Yes is running at 1FPS or maybe less.

I'm sure that the GPU is fast enough to do all 3D transformations for this game and most first-generation PS1/Saturn games but there are some problems.

  • The 68000 is too slow, is ok for a Genesis/Megadrive game, but it's too slow to make a modern shoot'em up (have a look at Sega Saturn games) with dozens of sprites, for a 3D game... forget it.
  • The more you use the Jaguar's hardware the fewer bus cycles are left for the 68000.
  • The Jaguar has only 2MB of RAM, the code takes 285Kb, and you have to store all textures, sounds, and 3D models. Just have a look at the filesize, almost none of them fit in RAM.

domingo, 14 de marzo de 2021

Jaguar Coding Nightmare

At last!!! Today I've put together all three games that I've been developing for the last... I don't know and I don't want to remember, it was too much time.

I think that I've spent more than 70% of the time fixing bugs, some of them were my own mistakes, when you code all by yourself it's normal to make mistakes but some of them were caused by some kind of mismatch between the tools and the lack of OS, these were very hard to find.

I haven't used any library, it's 100% my own code, well... Christmas Craze and Classic Kong are a port from the SNES version, but I had to rewrite some parts to use the Jaguar hardware.

Here you have a couple of errors that took me a lot of time to spot and fix it.

Data alignment

The Jaguar it's very picky with this, if you try to access a 32bits value (must be aligned to 4 bytes boundary) with the GPU but the data it's aligned to a 2 bytes boundary you won't read the correct value. It's ok but when you code in C sometimes you can forget to align the data or the compiler can do some nasty things (see below -flto).

Object Processor

The same as above but this time the data must be aligned to a 16 bytes boundary (phrase) or sometimes to a 32 bytes boundary (dphrase) for scaled objects.

Also when the Object Processor reads the list, it'll modify the Bitmapped Objects, and if you make a mistake in the list you'll hang the Jaguar instead of having a wrong display, sometimes because it will read data outside the Object Processor List and can smash the code.

-flto

Link time optimization, with this flags the compiler delay optimization to the link phase, I don't know if it a bit buggy or if it should be used with other flags but when I was using this flag (I wanted the fastest code 😜) some data missed the alignment, this means you are in big problems with the GPU and the Object Processor.


-fno-zero-initialized-in-bss

When I was coding the games (any of them) and the menu, sometimes the code didn't work (about 1/20 of the times), I uploaded the code to the Skunkboard and nothing happens. I thought that I got some wrong init code, I've looked at the init code at the Jaguar SDK and even I disassembled a couple of games to have a look at the init code, but everything looked fine.

When I coded the menu to launch any of the three games I got the following issues with each game.

  1. Christmas Craze, it worked.
  2. Classic Kong, always hangs at the intro when you start to play (Kong climbing with Pauline).
  3. BurgerTom, sometimes it played the menu music, sometimes you can see the menu with graphics glitches, sometimes it just hangs at the very beginning.
It was really weird because all games begin with the same code, upload the GPU code, init the sprite system, init the sound system. If I uploaded the code of any game to the skunkboard it worked but failed if the game was launched from the menu, weird because it was a simple copy game code to $4000 and jmp $4000, the error must be somewhere, not in the menu... 

After a couple of print debugging (I wish to have a proper debugger 😞), I realized that this part of the code executed in a different way when the game was uploaded to the skunkboard than it was copied from the menu, _text_strip wasn't NULL 😮.

...
static SPRITE_STRIP *_text_strip = NULL;
...
void init_text()
{
    if ( _text_strip == NULL )
        _text_strip = new_strip(256, 224, 0);

    clear_strip(_text_strip);
    ...
}

Looking at the map file generated by the linker, _text_strip was located at the bss segment. By default gcc compiler put all data initialized to zero into bss (-fzero-initialized-in-bss) because the OS will fill the bss section with zeros, BUT we don't have any OS on the Jaguar so _text_strip can have any value, actually, It will have some value from the menu code or data.


So after a lot of headaches I could finish all games, sometimes it was my own error, sometimes the tools didn't worked fine (-flto) and sometimes the lack of OS support makes that some features of the tools useless (-fzero-initialized-in-bss).


Let's hope that future projects will take me a lot of less time 😉.

miércoles, 24 de junio de 2020

ST-NICCC 33%

This is a small update, I've realized that I've some bugs in my libraries when I tried to compile it. :(

The software render version (1) it's the same code than the first version but now I'm using gcc compiler instead of vbcc.

For now, only options 1 & 2 are implemented.

Music updated with the original tune.

I've realized that if you draw a CLUT sprite without writing the CLUT (using uninitialized colors) you will get some ugly vertical lines.

Download: st-niccc 33% (skunkboard only)
Download: st-niccc (first version)


jueves, 7 de mayo de 2020

My dream Jaguar

After some time developing for the Jaguar here are some ideas that I wish that Atari implemented into the Jaguar.

First of all, all the things about bitness it’s complete bullshit. You don’t have a better device if you have some 64bits processor, just have a look at Intellivision (Mattel 1979), it has a 16bits CPU so the games look just like a Sega Megadrive(Genesis) or a SNES, isn’t it?.

With today's technology you could build an 8bits console running at 1GHz, and a GPU with thousands of cores, each one will draw a single pixel. Everything using 8bits ALU, and it will blow away any other 8, 16, or 32bits console.

In the end the most important thing it’s the memory bandwidth, not the bits. Note, for 3D games also you need computational power because you’re going to do a lot of multiplications.

68000

It’s too slow to make something interesting also the lack of cache makes it starve for free cycles of the bus.
Ideally, it should be on his own bus with something like 256KB of RAM, and maybe only can access the other custom chips but not the main RAM. A better option could be a 68020 or a 68030.

GPU/DSP

I would change the instruction set encoding to allow a few more opcodes, all single operand instructions can use the same opcode, and then use the reg1 field to specify the actual instruction. Also, it’s a must to allow bigger jumps. And of course, include a cache (the real one) to run the code from the main RAM without the current headache.

Some new opcodes that I find useful.

- split: Takes a 32 bits register and write the high word into a second register and the low word into the current one. With and without sign extension.

- join: The inverse of the split opcode, of course.

- pack/unpack with RGB pixels

- load/store with pre-decrement and post-increment

- loadp/storep should work with registers pairs, instead of using a different register for the high word.

- 32bits bus on the DSP, well actually it has a 32bit bus but it’s not fully connected, maybe to make the MMU more simple?.

- Include a real sound chip.

Object Processor

Having to rebuild the Object Processor list on each frame it’s a waste of time, anyway I think that there are more important things to fix.

- Bigger CLUT, 256 color palette it’s not enough. At least 1024 colors, this is 4 8bits sprites with different palettes.

- Object to change CLUT

- It could be interesting to include an 8bit direct RGB mode in the color depth.

- More transparency modes and they must also work in RGB.

- Include three-color multipliers, one for each color channel, to make easy fade effects.

- Pixel precise collision detection. 

- Remove all link address in all object except at branch object.

- The Image Width field must be a signed value to allow vertical mirrored sprites.

- GPU interrupt Object must have y coordinate and height field, and work without bugs…

- Rearrange the bitmap object and scaled bitmap object to have the same size. If you remove the link address both objects fit into 16bytes.

- Improve the write ratio, it must write at 4 pixels per cycle.

- Cache, it will be flushed on each VBL interrupt.

Blitter

I don’t know why they thought that the bitter was fast enough, if you try to make any interesting effect like scaling, rotation or texture map you must work in pixel mode and it kills the performance. The blitter must be as fast as the Object Processor, it’s sad but you can’t make a game like After Burner (1987) into the Jaguar without a lot of headaches.

- Allow pixel expansion, this allows to use 1, 2, 4, or 8 bits texture and write the destination in a 16bits bitmap.

- Optimize single color/Gouraud horizontal lines. If you are going to draw a horizontal line, always write the pixels in phrases.

- RGB lighting

- Command queue, why do you have to wait for the blitter to be idle before you set any register? This is a waste of time.

- Reorganize the registers, why the integer and fractional coordinates are in different registers? What they were thinking?

- Cache, of course

RAM

Dual-port RAM could be nice but it’s expensive maybe 4MB should be better.

As an extra, I think that it would be great to include a second GPU to drive the blitter, something like a RPU (Rasterizer Process Unit) but it only runs code from his internal RAM. You’ll write a polygon list (or sprite with scaling/rotation info) and this RPU will read it and send the corresponding blitter command while you are processing the next frame with the CPU/GPU.


And of cause some more Mhz, a bus at 13Mhz it’s a bit slow.

viernes, 7 de febrero de 2020

Disassembling Supercross 3D

I've been playing a bit with my disassembler, mostly fixing bug... And I've been using Supercross 3D for testing. Looking at the source code I can understand why it runs so slow. Ok the Jaguar it's very slow at texture mapping but the code could be better.

For now I've seen the following things.
  • The code it's about 117KB, 120,016 bytes to be precise and it's stored at the end of the cartridge.
  • The game it's locked to a minimum of 4 vbls per frame for PAL systems and 5 vbls for NTSC ones, this means that it will run at maximum speed of 12,5fps and 12fps respectively.
  • There are one block of DSP code, I suppose that it's the sound engine.
  • There are eleven blocks of code for the GPU (maybe one or two more, I haven't finished the disassembly)
  • One of the GPU blocks it's used just to set the Object Processor List Pointer, this one never it's loaded into the GPU internal RAM, it runs from ROM.
  • There are about 20KB (21, 184bytes) of dead code or unused data, they are spread around the code and most of them end with a $4E75 (rts opcode) but they are never referenced or called.
  • Short branches are almost never used.
  • It waits for the bitter to be idle in several places, but IMO if you are using the 68000 you don't need to wait because it has lower priority (68000 < blitter), so if the blitter it's busy the 68000 will be stoped. The only advantage of not having a cache.
  • There are some link/unlink opcodes, also some routines push values into the stack, jump somewhere, load the values from the stack to the registers and jump again to do the actual work. I think that some parts are written in C and others in assembler, and this kind of routines are used to jump from C to ASM.
  • There are some parts of the game that depends if the system it's PAL or NTSC, but it reads the hardware register each time that it needs to instead of using a flag.
  • The game runs in 8bits mode with colors in CrY format (not 100% sure).

And now some codes snippets. All of them are actual code (it's full of them).
move.w (a0),d0
addq.w #1,d0
move.w d0,a3
move.w a3,-(sp)
jsr l01e3e0e
At least it uses quick add, I think that this is used to increment the lap count and print it.

move.w #0,l01b72d8
move.w #0,l01b72da
move.w #0,l01b72dc
move.w #0,l01b72de
move.w #0,l01b72e0
...

What about using a data register and post-increment addressing?

move.l a1,-(sp)
move.l #l01ece80,d3
move.l d3,a1
jsr (a1)
Because jsr l01ece80 it's too easy.


By the way, I've found two bugs in my assembler when I was looking at the disassembled code to write this post.