Verilog: Multiples of 3 and 5

Verilog: Multiples of 3 and 5

As part of my verilog exploration, I'm also going to look try to just use the open source tools available for the verilog design chain. Project IceStorm is a Verilog to bitstream flow for Lattice iCE40 FPGAs. For simulation I'm going to use Icarus Verilog. Like so many open source tools, documentation could be better. I've not had a good experience in the past using open source tools for FPGA design, but hopefully these tools work out well. 

VHDL: Even Fibonacci numbers

Project Euler problem 2 calculates the Fibonacci sequence

By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms

This is the sort of thing problem that I've seen people asked to do fairly often in hardware. We're going to use the same top level entity that we described here.  To calculate the Fibonacci sequence you need two registers. To get the next Fibonacci number you need an adder, and to get the sum of the even numbers, you need an accumulator. Two comparators tell you when you're number is even, and when you've hit the max input value. 

block diagram for a hardware implementation

block diagram for a hardware implementation

Translating this directly into VHDL works out pretty well like this: 

And running that through a simple simulation with a clk, enable, and reset and monitoring the results_valid will get us a simple test bench. Taking a quick look at the start of the simulation shows us if we're on the right track. 

We can see the Fibonacci sequence in the sum register, and the even signal toggling each times it's even and the results accumulator summing these even numbers. 

We can see the Fibonacci sequence in the sum register, and the even signal toggling each times it's even and the results accumulator summing these even numbers. 

And that's about it. We could change the generic we pass in for the max_value to get this to run up to any maximum. One of the things that surprised me is how fast we got to the solution brute forcing our way through each Fibonacci number. 

MyHDL: Multiples of 3 and 5

MyHDL is an open source python package for hardware design. It's biggest apparent benefit over other languages is that with it you can use python for the entire process of designing a chip, from hardware design, to unit test, to tool control (as a tcl replacement), to other glue logic to get the design through synthesis and place and route tools. The manual and tutorial the developers have written is pretty good, and worth reading through if you're interested. The designers of the language say that it might be an easier language to start off with than VHDL or Verilog, but that sounds so much like what the C - like language guys say that it's hard for me to believe that. Since it's an addition to python though, there are a lot of resources out there, probably more than VHDL or verilog though they won't be hardware design specific which can also be an issue. Python has modern methodologies in mind already, and the tools to deal with them which can be a nice change of pace from some of the older processes hardware design can seem mired in. Like other go between languages, the tools have to convert to VHDL or Verilog prior to synthesis and place and route. 

The package installs pretty easily with 

pip install myhdl

on my linux box, though I do remember it being a little more involved installing it in cygwin. We're going to start out with the same project euler problem set, starting off with problem 1

Find the sum of all the multiples of 3 or 5 below 1000

Since I've solved the problem before, I have access to the forums where people post their solutions to the problem, so I thought it'd be fun to try taking a pythonic solution and see if I can get it into hardware with the myHDL package (spoiler alert, I don't think it's going to work). Here's the function that I'm going to use in unittest that I'm going to try to massage into turning into compilable hardware


To model concurrency in python, MyHDL makes heavy use of generators in the same way VHDL uses processes or Verilog uses always blocks, with the yield statement in python acting as the sensitivity list. Here's another unittest function: this describes the generator that creates the clk. 

the use of these myHDL specific decorators (always(), always_comb(), always_seq()) seems pretty familiar coming from VHDL, though only dabbling in python I'm not sure how familiar they are to those coming at it solely from a software background. 

Because we're doing hardware design, we need some help with hardware oriented objects. MyHDL gives us Signals as the objects to connect different concurrently running generators, and intbv (integer bit vector) for support for indexing and slicing particular bits out of a single object and being more hardware cognizant with the sizes of our signals and registers. We're going to use the same top level structure as the other problems, with clk and reset for synchronous design going in, enable to start processing, and results and results_valid as outputs. 

myhdl_hardware_interface

So in our unit test, we'll have to define these inputs and outputs as Signal objects:

So we've got most of the unit test pasted in chunks above. Only a few things left to make it actually run a test, use the traceSignals function in myHDL to create a UUT instance so we can get a trace of the inputs and outputs to help with debugging, create an instance of the checkResult code, and pass that plus the clkgen to the myHDL simulation function. So next up we define the generator that runs this python code. We're going to do some bad practices and just copy the test function into a generator and give that a go. First pass will look like this: 

And then run the unit test, and voila! 

gtkwave_euler1_myhdl_hardwary

This is a little suspicious though, if we have the result in less than a clock cycle, we probably don't really generate any hardware that calculates the result. So we'll add a toVHDL function to see what the hardware output looks like. 

So good news, and bad news. The good news is the python is convertible to synthesizable hardware. The bad news is this is definitely not the sort of algorithm implementation I was looking for -- the algorithm isn't implemented in hardware, it's calculated ahead of time and the hardware spits out the result. This is one of the issues in choosing problems from a site focussing on the algorithm to implement the problem instead of focussing on problems that hardware would be good at. 

So we can modify the function to count by cycle instead of using the range function. And then we'll need to create our own summing registers to store those values. And finally, we run into a little bit of a hack figuring out when we're done counting -- knowing the problem we know that the threes counter will take the longest to get to the end. And we know it'll take an extra cycle for the threes accumulator to then sum up the end. Finally, because we are converting things to VHDL, MyHDL requires you to use convertable types with bitwidths defined. This isn't necessary for simulation only, but could certainly be a little confusing. Here's a pass trying to implement the 4 line pythonic solution in hardware convertable code. 

We end up with 25 lines of code, and a little bit of hacking to get timing to work out just right. We could turn the fives and fifteens accumulators into independent functions to keep from repeating how the work, but in the end that doesn't save us any code since there's only two of them, and in conversion to VHDL makes even more separate processes. 

One of the other take-aways from this first pass is the converted VHDL is surprisingly readable and editable. Other mechanized tools that do this sort of conversion have a pretty bad track record, but this uses the same signal and variable names used in python, uses function names as process names, and even will take comments that use the python DocString format and convert them over to VHDL comments. 


VHDL: Multiples of 3 and 5

One of the reasons this is such a great starting problem is that anyone can solve it very easily with pretty minimal brute force, which I imagine is the idea to rope people in to start seeing the joy of determining more refined algorithms. I don’t actually have that joy, but I don’t mind solving this problem in a variety of languages, even if that doesn’t end up being a variety of ways. From the site: 

Find the sum of all the multiples of 3 or 5 below 1000.

Since we’re doing synchronous hardware design, our module will have a clock and a reset going into it. It will have a results bus coming out, and a signal telling us when the results are valid. To make solving this multiple times slightly more interesting, we’re going to solve this for any number, n. In this case I made it a generic in VHDL, but it could (and maybe would) also be an input bus.

VHDL entity describing the inputs and outputs of a module. 

VHDL entity describing the inputs and outputs of a module. 


So in hardware we can design a counter to count from 1 to n. We can have an accumulator that will sum the current counter number with the previous sum based on some control logic. And the control logic has is the logical OR of two similar calculations, on whether the count value is divisible by 3 or by 5. Dividing is actually a very bad idea to do in hardware, but we could instead have parallel counters continuously counting from 1 to 3, and 1 to 5. Any time these control counters hit their max value, the main counter will be a multiple of 3 or 5. So from there, we get this block diagram:

We can take this block diagram and turn it pretty literally into a VHDL architecture as is which looks something like this: 

That's pretty long, and one of the major complaints about VHDL coding is that it is quite verbose. But we don't have to take the diagram that literally either. Each element that is sequential (the counters and accumulator) can be inside a single sequential process in VHDL instead of each having their own. Then we can put all the combinatorial logic together in a single section of code and save ourselves about 40 lines. Still not great, but getting better. 

Next up, we have to run this code through simulation. This is where I give a shout-out to Emacs, which while often being a pain in the ass, has one of the best VHDL modes out there that I know about. One of the great things it can do is allow you to is a lot of creation of objects based on a port definition. Select a port map and copy it using the VHDL port copy command, and then you can paste that port as an entity, instance, signals, initialization, or even the entire test bench. This is pretty helpful for such a verbose language. 

The test in this case can pretty easy, we just need to provide a clock, and set enable active and then wait for results_valid to go high. This is a pretty crappy test, but since we don't know what result we're expecting going into the problem solving that's all that we have available to us with the entity definition we used. A quick look at the waveform at the startup to make sure everything is working... 

we can see the modulus counters cycling here, the accumulate enable signals going active, and the results getting summed. 

we can see the modulus counters cycling here, the accumulate enable signals going active, and the results getting summed. 

and voila, we have some hardware that can solve the first problem from Project Euler, for any value of n that we provide in the generic parameter we pass in. 

How's this compare to software implementations? Here's what a friend wrote for me when I asked him to solve the same problem in C++

int sum(0);
for (int i = 0; i < 1000; ++i) {
  if ((!(i % 3)) || (!(i % 5))) sum += i;
} 
return sum;

So, 5 lines of code in what I took about 40 lines to do. That hurts some. You can also see he tried the same dumb brute force algorithm as I did, looping through all the numbers, checking to see if the number was a multiple of 3 or 5, and then summing it if so. And since he did that, when we timed his code being run in a processor, vs my code running in an FPGA my code ran about twice as fast as his, around 1.3 microseconds vs his taking 2.6 microseconds.