December 19th, 2017

12/19/2017

Once we had enough memory that we no longer needed to commit atrocities in the name of space efficiency, state still bit us on the backside. Huge programs were written where many many functions used many many different bits of state. No part of the state could be changed without changing many parts of the program.
The enormous cost of such programs led to a backlash. Programs as state-machines were bad. State itself must therefore be bad. This led to the development of things such as atomic functional programming, where there is no state, only functions. Despite their conceptual and mathematical elegance, functional programming languages never caught on for commercial software. The problem is that software engineers think and model in terms of state.
State is a pretty darn good way to think about the world. Objects represent a middle ground. State is good, but only when properly managed. It is manageable if it is chopped into little pieces, some alike and some different.
Similarly, related programs are chopped into little pieces, some alike, some different. That way, changing part of the representation of the state of a program need lead to only a few, localized changes in the program.
- Refactoring, Martin Fowler and Kent Beck
Modeling cannot completely replace coding. Modeling deals with higher level changes in the program as a state machine, not with the tactical challenges of coding specific manipulations on particular types of data.
That said, the mind sets of those who think in terms of models and then write code where necessary, and those who write code and put together some sort of model where necessary, are intrinsically different.
The quote above assumes that by “software engineer” he means specifically Smalltalk developer. Smalltalk is particularly suited to the modeling mind set. The main author, Martin Fowler, while using Java examples in the current edition of the book, complains throughout about the lack of decent refactoring tools in Java environments relative to Smalltalk environments.
Properly speaking Smalltalk is not “object-oriented” in the sense that that term has come to mean, but is a “topological entelechy” language. If you don’t understand the latter phrase, it should give you an inkling that you don’t understand in what way Smalltalk is unlike other languages, even those with similar syntax.
This difference, together with the tendency to preference coding over modeling since it is easier to understand, and easier from the project management perspective to monitor, leads to inefficiencies that not only add to the cost of software projects, but in many cases, cause them to fail altogether.
Web developers, in particular, have a penchant for avoiding maintaining state, as the number of “new” technologies within that ecosystem that attempt to avoid it testifies. A combination of beginning with web sites, then slowly adding some JavaScript code, initially only to manipulate page elements, is the most likely reason.
A major difference between the two styles of thinking is that one thinks in terms of relations primarily, and puts ‘things’ that satisfy the needs of the relation where they need to be, while the other thinks in terms of ‘things’ first, then tries to find ways to relate them.
Another thing to be clear on: with modeling we are no more talking about architecture than with coding; although the architectural constraints may appear to affect modeling more directly, in many cases modeling provides a means of meeting architectural constraints in multiple different ways, something that cannot be accomplished by simply coding without multiplying the work by the number of means implemented.
It occurred to me to write this while reading a book on design patterns in Smalltalk, which despite having ‘design’ in the name, is largely concerned with formalizing well known tactical solutions to common coding problems. Specifically, the following passage made me think about the mindset of a coder as opposed to a modeler.
I wrote the section on Temporary Variables before I wrote this section. I was pleased with how the section on temps came out. I expected this section to turn out to be the same sort of cut-and-dry, “Here’s how it goes” list of patterns. It didn’t.
The problem is that temporary variables really are all about coding. They are a tactical solution to a tactical problem. Thus, they fit very well in the scope of this book. Most uses of instance variables are not tactical.
Along with the distribution of computational responsibility, the choice of how to represent a model is at the core of modeling. The decision to create an instance variable usually comes from a much different mind-set and in a different context than the decision to create a temp.
I leave this section here because there are still important coding reasons to create instance variables, and there are some practical, tactical techniques to be learned when using instance variables.
- Design Patterns, GoF
Modeling is very well defined here: it concerns the distribution of computing responsibility and the choice of how to represent state (and therefore also manage state changes).
Of course, every program does this in some manner, but from the tactical, coding perspective, state is a necessary evil, something to be avoided as far as possible, and so tends to be at best an afterthought, or better, left for the middle tier developers to deal with.
Distribution of responsibility, as well, is largely left up to the system; from the coders perspective everything could go in one big method and would be more efficient, and as long as the code is properly re-entrant, or even better, written in a single threaded language like JavaScript, no problems. In APEX, a DSL written in Java and visually resembling it, but functionally more like JavaScript, I saw a 1200+ line method. Worse, it was the key method in a ‘god class’ and was called as part of virtually every system process.
I remember trying to explain REST a few years ago to a project manager, who luckily was a former programmer, because the ‘enterprise architects’ wanted us to use it between modules of a very state-intensive application. When the project manager finally realized that state would not be shared across modules by the live system he just about exploded “then it’s not an application, an application is by definition a state f%%^^*$ machine!”. After which he promptly hung up on said “enterprise architects” and had sufficient political pull at the company to have them barred from any input into the project.
Pointing out that since REST uses precisely the same command set as HTTP implies that it creates a web site and not an application didn’t mollify said project manager very much.
REST proponents may point out that a representation of state is transferred, after all that’s what REST means. However, they’re falling into the same trap as psychologists who only think in terms of ‘mental representations’ without considering that an original presentation must have occurred in some manner for a re-presentation to be possible.
REST is also not necessarily, and in fact not usually, truly resource oriented, since by and large the representations in the payload are aggregations, which causes all kinds of issues if the client needs to change the state.
An analogy might help understand the problem: a payload of data from an aggregated REST call is a convenient and useful representation of complex data, much as a bank statement for a company with many accounts, lines of credit, company credit cards etc. is a convenient representation of that complex data.
Now imagine the recipient of said statement (who downloaded the statement in .csv format) imports it into Excel, makes various changes to the numbers, and uploads it back to the bank, expecting the bank to change the appropriate accounts that those numbers aggregated. The bank would be lost as to where to even start, particularly since the state that produced the representation no longer necessarily exists at the server.
On the other hand, if the client makes REST calls to access every individual resource separately it will result in complex aggregations (and their inversions) being written in the browser in JavaScript, precisely by coders who find maintaining state in a single threaded language too difficult, as well as causing possibly thousands of network calls to get the data for one visual data view.
The popularity of REST, despite these various issues, is largely a consequence of coders’ tactical desire to avoid maintaining state and managing state transitions.
It is in such areas, as well as areas such as distributing parallel tasks while maintaining transactionality, that modeling is most useful as a complement to coding, but as I said, it takes a different mindset.
A modeler can code, but they’re unlikely to be all that interested in it other than as a means to complete functionality that can’t easily be expressed in a model; likewise, a coder can learn modeling, but they’re unlikely to be all that good at it.
A good way of understanding the difference between the two mind sets is as follows.
Coder Requirement

A facility with the symbolic manipulation of linear operators;

Modeler Requirements

An intuitive understanding of the logical structure of new models and the need for meta-models;
An intuitive understanding of the combinatoric superstructure of new models, i.e.understanding how all the models and meta-models in a system interact).

As you can probably guess, a software engineer needs to be good at all three.
Unfortunately, most people are only good at the first or second of the above requirements, those that are good at the second two together are less common, those that are good at all three even less common.
The problem is that systems design, including understanding data state and how to model it and its transitions, is heavily dependent on the second two. The first is more of a tactical issue of implementation. Coders notoriously jump straight into implementation-it’s what they’re good at.
The issue is not that the second two are intrinsically difficult, but that they require an ability to project systems imaginatively, and software engineering hasn’t exactly portrayed itself as all that imaginative a career choice.
Example:In concrete terms, though, why would I want to begin a project with something as labour-intensive (and, for many people, intrinsically difficult and outside their area of comfort) as developing a model up front, when I could simply start coding and worry about state and responsibility distribution later?
Perhaps the following actual example will give some idea of why I would choose this approach, and also show where and why hand coding remains a necessity.
A product owner approaches me with a problem he needs to solve. In the data stores of the company there are hundreds of thousands of high-resolution images. These are kept on various types of media — whatever was current at the time.
Since many of the readers for such media are now unavailable and parts are even difficult or impossible to find, he needs to create a data warehouse of all these images on a big NAS system while the readers are still functional. So far so good, no software issues yet, maybe he’s just thinking out loud?
Nothing works out that easily though.
While transferring these images he wants to create an image specific datastore, and this is where it becomes my problem. These images are all in TIFF format, but as I already know “TIFF format” can mean just about anything, since the tags in the “tagged image file format” determine the actual format of the image data that follows: it could be CMYK or RGB data; it could be interlaced or not; etc.
Luckily there are libraries available at the company that go some way beyond ImageMagick in terms of being able to extract data from TIFF images, including libraries to create smaller versions of these files, since most of them were created on Crosfield or Highwater scanners and as a result are at 2540 DPI resolution.
In terms of the actual formats of the images some are standard RGB Photoshop-type TIFFs, but the majority originated on either Scitex machines, which use CMYK interlaced TIFFs, or Quantel Graphic Paintboxes, which use CMYK non-interlaced TIFFs.
From what I can gather from the initial conversation I have the following basic requirements:
• a data store that is browsable by images, which I figure can be implemented as some sort of key / value map data store
• images must be searchable by any data that can be extracted from the original image or entered by human beings as tags
• images must also be searchable via some sort of image recognition system whose algorithms have probably not been written yet
• images must be editable via some sort of editor that can be accessed via a rich GUI app or via a web app
As you might have noted, much of the difficult coding is already accomplished in libraries, I just have to plug them in. Others are in process but they’re somebody else’s problem. My problem is largely creating and maintaining the state of a huge data set (and most likely figuring out a decent image recognition algorithm).
Of course, they’re going to need to give me some decent deployment hardware, given the analysis is done on streaming data at SCSI-III speeds, a mere hundreds of megabytes per second.
From what I wrote above this sounds like a perfect problem for a modeling based solution. And it is, of course, since I chose it.
Real life never pans out so neatly — i.e. in real life it’s not a project I would have particularly chosen.
Now I should have to make some key decisions:

What data store should I use as a test bed, given I may have to use any in the finished product, or even multiple different ones?
What language should I write this in?
What environment should I develop in?

Most of these decisions though are already made, since companies have standard data stores they use depending on the required type, and standard languages and environments. In this case, we’ll say the data store is MongoDB and the language is Java. Environments can sometimes be a bit more at the discretion of the developer, but in this example, we can go along with the company’s preferred environment — Eclipse.
As it happens, Eclipse has a powerful modeling environment — EMF, the Eclipse Modeling Facility. EMF (and extensions such as the Graphical Modeling Facility and the Extended Editing Facility) provide many of the capabilities found natively in a language more suited to modeling than Java, such as Smalltalk, by creating a parallel object hierarchy, beginning with EObject, that takes care of the reflectivity necessary for adept modeling techniques.
So, to begin I’m going to create a simple EMF model (or ECore model) that contains the basic data I need to capture for each image. Along with the textual data in the TIFF header I’ll need the two low-resolution (thumbnail and viewer sized) images, and a reference to the location of the original. Since I don’t work in a company with myriads of images by total accident, I also figure I’ll need to derive a vector representation of the image so that the recognition and matching algorithms can do topography over topology image matching.
I have code I wrote years ago to generate EPS vector versions of TIFF bitmaps, it will be easy enough to convert that code to output SVG rather than EPS. I also need to create fields for editable tags that can be added to or changed by human beings.
Once I have this model created I can try it out by generating code for the model in Java, a simple GUI editing framework for the model, test cases and a CDO model that will be useful for persistence and maintaining state changes later on.
By plugging in various libraries and writing a bit of glue code, I can actually start bringing in real image data and, for now, storing it in a local CDO repository managed inside Eclipse.
Given that I don’t know all the potential tags in the TIFF headers, I’m going to make that part of the model dynamic, where the model fields are determined reflectively from the data, then used to store, persist and edit those fields, whenever a new type of TIFF is encountered.
I can then generate the code (via an EMF library called Texo) to persist the CDO model to MongoDB (or any other JPA supported data store).
Given that a fairly complex model (including all the elements required for a reverse engineered topology of each image) might take a week or so to create and test. I now have a primitive version of software that meets the initial requirements.
Since the editor is created in Eclipse RCP (rich client platform), I can use RCP remoting, also known as RAP, to make the editing facilities available via a web server to browser clients. Later when I optimize the finished program I can replace RAP with a more dynamic web client, and replace the top layer of the RCP application with QT to get a more dynamic and attractive rich client.
The full value comes where, after seeing a demonstration of what it can do, the product owner starts to see more clearly what it needs to do.
The more detailed (and changing) requirements can be quickly implemented as a model change and the code for the model, persistence, and editing regenerated and automatically tested.
As requirements on the UI move beyond simple text fields, more advanced controls applied via EEF and GMF are also created once and regenerated whenever the base model changes. I can create any arbitrary views of the data via simple EMF transforms and the map of the object structure is maintained throughout the transforms, allowing transformed views of data to update the originals.
Finally, I can generate JPA code to persist each version of the model not only in CDO, but in Cassandra, Mongo, Hadoop or an RDBMS without extra coding.
Since the model comprises both data and behavioural state changes, including specifying responsibility for the latter, the two most basic aspects of software design are captured in the model and can be quickly inspected, judged, corrected and optimized. I can use EMF Compare and EMF Diff/Merge to do full comparisons of my topological/topographical models, potentially saving the poor sods that have to write the matching algorithms a lot of work, maybe their sanity.
In this particular example, the rather primitive, initial version of the software comprises about 600,000 lines of Java code. Perhaps if I hand coded it that might drop by a quarter, but not much more than that. Any new views of the data require approximately 100,000 lines of code, and trying to ensure that a given representation can be successfully modified and persisted is near impossible if those representations are hand coded, whereas the change graph that is an inherent part of the generated code does that for me reliably and consistently.
The only difficulty is that much of the system, until it’s finished and working (and the dynamic parts largely even then), only exists in the imaginative projection I have. This not only requires that one has the imaginative ability, but maintains the projection in one’s imagination until the model is sufficiently realized to generate a working system. That can be exhausting.
But, given the choice between that and hand writing hundreds of thousands of lines of (mostly rote) code, often multiple times, I’ll take it.
Nobody said development was supposed to be easy in every way.
This also helps answer the question of why Eclipse uses so much memory compared to a text editor like the appropriately named “Sublime Text”. (Appropriately named given Hegel’s definition of the Sublime, as “the night where all cows are black” i.e. nothing whatsoever can be distinguished or understood.) That the Pharo OSS Smalltalk environment has better tooling than Eclipse, yet uses fewer resources than Sublime Text, is a bit more difficult to justify.
Of course, neither converting bitmaps to vector files nor accurate image recognition is code that can be generated from a model, at least not at this point, since we don’t even have a theory regarding all the possible correlations between topography and topology.
Thus, to do it at all is going to involve detailed implementation of those aspects.
Those are areas where hand coding is necessary and will be for the foreseeable future, but even in those areas I would tend to trust someone who thinks in terms of relations that determine what things are necessary and where, over someone who thinks of things and tries to make them relate in some way to achieve a result.

0 Comments

Software Engineering

December 19th, 2017

Leave a Reply.

Author

Archives

Categories