IPv6 ready (sort of)

I just got an IPv6 subnet from Hurricane Electric Tunneled to me.

2001:470:B02D::/48

TekSavvy is currently in the process of implementing native IPv6 routing over PPPoE. ARIN assigned us a subnet that should be large enough to “route the internet” altogether 2607:f2c0::/32. We already have IPv6 configured in our core but still need to figure out the distribution level. We also need to provide basic IPv6 services such as DNS and Mail.

If this could help anyone, I’ve spent at least an hour trying to figure out the simplest of problem in my Cisco router at home. Don’t forget to type in this command or else things won’t work.

(config)#ipv6 unicast-routing

FreeBSD/Linux Crypt() md5 on a PC

So after working on this project on my spare time I’ve been able to achieve close to 10,000 FreeBSD Crypt/s on a 2.2ghz Opteron (175). The code is running on both cores, so 5000 crypt/s per core. I could probably speed this up quite a bit by using SSE2 or 64bit registers but I’ll keep it that way for now.

At this point I’ll start porting this over to the PS3 so that the md5_transform function is performed by the SPUs. The challenging part will probably be to transfer the crypt blocks from the PPU to the SPus. I hope to be able to reach arround 100,000 Crypt/s which it think sounds reasonable.

FreeBSD/Linux Crypt() md5 on the PS3

The now commonly used Linux/FreeBSD passwd hashes being a modified version of md5. The orginal code can be found in the FreeBSD source tree under src/lib/libcrypt/crypt-md5.c or here.

The implementation is basically trying to slow down the md5 process as much as possible from what I can tell. It runs through the basic md5 function over a thousand times over and over using the salt and the password.

After doing more research (or in other words understanding more what I’m doing). I need to implement the RSA defined functions MD5Init, MD5Update and MD5Final. MD5Init being quite simple. It basically initializes the a,b,c,d values. The PS3 uses vectors so we will initialize it this way.

vec_uint4 a = (vec_uint4){0×67452301, 0×67452301, 0×67452301, 0×67452301};
vec_uint4 b = (vec_uint4){0xefcdab89, 0xefcdab89, 0xefcdab89, 0xefcdab89};
vec_uint4 c = (vec_uint4){0×98badcfe, 0×98badcfe, 0×98badcfe, 0×98badcfe};
vec_uint4 d = (vec_uint4){0×10325476, 0×10325476, 0×10325476, 0×10325476};

Now MD5Update is quite a bit trickier. It appears to perform the MD5 transformation on every 64 bytes chunk of data, which in this case, one chunk should be plenty enough for password storage. The MD5 Transformation itself performs the following operations. I hope wikipedia won’t mind this.

F(X,Y,Z) = (X\wedge{Y}) \vee (\neg{X} \wedge{Z})
G(X,Y,Z) = (X\wedge{Z}) \vee (Y \wedge \neg{Z})
H(X,Y,Z) = X \oplus Y \oplus Z
I(X,Y,Z) = Y \oplus (X \vee \neg{Z})

For which I’ve respectively implemented using these operations and of course, vectorized, so in other words it performs the operations 4 times.

F: vec_t = spu_xor(*vec_d, spu_and(*vec_b, spu_xor(*vec_c, *vec_d)));
G: vec_t = spu_xor(*vec_c, spu_and(*vec_d, spu_xor(*vec_b, *vec_c)));
H: vec_t = spu_xor ( spu_xor ( *vec_b, *vec_c ), *vec_d );
I: vec_t = spu_xor ( *vec_c ,spu_or(*vec_b, spu_eqv (*vec_d,vec_null)) );

Honestly, I’m still trying to figure out what MD5Update does exactly. While it may appear to be simple for some coders out there, my mind is still trying to grasp the concept for lack of documentation about it.

Finally MD5Final appears to be adding padding to the 512 bit buffer, adds the bit count and then does a final transformation.

So how fast can the MD5 algorithm really go on the PS3? 80 million hashes per second.Thats not too shabby I believe although a bit deceiving.

With the MD5crypt algorithm, the MD5 hashing function is executed ~1000 times per encryption so one can ball park the number of key/s second to about 75,000 key/s.

This is pretty good considering these benchmarks. On a E6750 overclocked at 3.6ghz it reaches 15500 keys/s per core or 31,000 on both cores. Considering the price of building a PC supporting this CPU not to mention overclocking it to those speeds, the ps3’s performance isn’t too bad.

-G

Star Wars: The Force Unleashed

I’ve been playing the game for the past few days. I haven’t heard anything about Star Wars for the past couple months if not years and out of nowhere all I see everywhere on TV and just in plain life always seems to come back to that lately. Weird. Thought I’d share.

Tags:

DMCA Settlement-o-matic

Just talk about plain wrong. It’s now possible to settle with the RIAA over P2P infringement auto-magically over the net.

Yeah thats right, see for yourselves here

You can conveniently pay by Credit Card. How much? I have no idea… but the concept is just wrong. Where does that money go in the end? Back to the artists? I don’t think so.

Intel 6 core Xeon chip is out

Codenamed “Dunnington” it seems like the “race for cores” has truly begun. More info here

Tags: ,

MacBook died today

Yeah thats right. I left it on for a while and I think the heat fried the hard drive. It simply won’t even get past the “Apple logo” on boot. Well that sucks….gotta fork out some more money for an expensive laptop hard drive. Good thing I still have a Mac desktop downstairs, which sorta also sucks since I hate being some computer geek basement-cave dude.

-G

Tags:

Cell Processor

The Cell Processors found in PS3s and the less common IBM Blade servers are quite simply what I believe to be the most cost efficient per flop CPU you can buy these days. It contains 1 PPE (Extended PowerPC) and 8 (Only 6 available with the PS3) Synergistic Processor Units, all of them running at 3.0ghz.

One can easily setup a development lab with an inexpensive (sort of) PS3 and install any linux flavor on it. It’s fully supported by Sony and instructions to do so can be found all over the net. I personally chose to use Gentoo and instructions can be found here http://wiki.ps2dev.org/ps3:linux:installing_gentoo.

The real development challenge with the Cell Processor is that it contains two different architectures. One being a slightly modified PowerPC Instruction set. The other running on a very basic (but fast) SIMD instruction set named Synergistic Processor Unit Instruction set quite simply. In other words, you need two different GCC compilers. You compile the code for each processor seperately and then link everything together. The GCC SPU compiler is still in beta and doesn’t optimize very well yet in my opinion.

For some reason my main interest in programming has always been about getting an algorithm to run faster and in a more efficient way especially when it comes to cryptography. I’ve always been a big fan of John the Ripper (Don’t ask) but I really hate the fact that its not multi-threaded. So I decided to code an MD5 function that would run on the processor and just see how fast I could get it going and of course, share the load between the SPUs.

So I started coding for a few days until I realized that someone had already done the same.. DOH! The Publication can be found here. The  code itself only does 80 millions MD5 hashes per second if I remember well while he claims that he optimized the code to run 1.9 Billion/s which can’t be found anywhere. So that really got me going… I simply started using his code and made a few adjustments in his algorithm. The main problem I found was the fact that the compiler was branching too much which the processor is absolutely awfull at executing. So I changed a few settings in the compiler to avoid branching as much as possible, re-ran the code …OMG… just couldn’t believe the results when I saw the number 31 billion on screen. I thought that something must have been wrong. I added verification code to ensure that it was actually hashing properly…and it was. Granted that the MD5 hashes are generated from a 8-Byte string, it’s still thousands if not millions times faster than a modern PC.

Now the problem with the Cell Processor is that each SPU can only access 256Kb of memory at a time. And the SPUs only do well if they run a task that can be vectorized and without branching. Transfers between the main PowerPC and the SPUs happen through DMA which are quite efficient but still not enough. Lets just say that 31 Billion * 128bit = 3.968 Terabits per second which is …hum…a little retarded. It brings forth the main problem with the Cell Processor, Bandwidth. I’m still working on code on my spare time but this obviously means that the processor cannot reach 31 billion hashes per second in a useful manner.

I just found documentation on WIkipedia and I quote

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s”

So this means that at a theoretical speed of 204GB/s the PS3 could reach a transfer speed of 12.8 Billion 128bit MD5 hashes per second more or less. Of course I’m sure it’s probably less.

Anyway… more to come.

-G

Tags: , ,

Why Open a new Blog?

No clue. I simply like the idea of having my own space to share ideas. I like the idea of blogging my ideas and document all the projects that I tend to get into and never fully finish. Gee Did I say idea enough now? Gotta learn to re-read myself… Anyway, I’m thinking that this blog will probably help me track what I’ve done.

I currently work at TekSavvy Solutions as a Network Admin/Programmer/Everything else under the moon. I really do like my job as it seems that every day brings something new.

Anyway, I’m really bad at this self writing stuff so I’ll just leave it there.

Back to top