http://idcee.org/p/mark-zbikowski/
Mark Zbikowski has more than 35 years of experience in the technology industry, primarily leading the architecture, design and development of operating systems. From 1980 to 2006, he worked at Microsoft and was deeply involved with multiple products and technologies, including DOS, OS/2, Cairo, NT and Windows in many roles, from individual contributor, development manager and architect. Since 2006, he has taught at the University of Washington and acts as an advisor to several startups.
Pic's are here: http://www.flickr.com/photos/idcee/sets/
More @ http://idcee.org
Follow us on:
YouTube: http://www.youtube.com/user/OfficialIDCEEChannel
Facebook: https://www.facebook.com/IDCEE
Linkedin: http://www.linkedin.com/groups/IDCEE-3940138
Twitter: https://twitter.com/idcee_eu
Google+: http://gplus.to/idcee
Flickr: http://www.flickr.com/photos/idcee/collections/
16. LESSON 1
Don’t always do exactly what
customer wants
What he asks for is his solution to a
problem.
Find the problem and do the right fix
But remember, the customer does pay
the bills
IDCEE ‘13
77. LESSON 12
Use features that help the project,
not the programmer
Just because you use a higher level
language does not mean you
shouldn’t care about what the CPU
executes.
IDCEE ‘13
I came to MSFT…
As employee #55….
MSFT was very much like a startup then
At the time, it really was like a startup
30 developers, in about 7 different projects
Over my time, I wore many hats
Writing, probably, a million lines of code in total
But over 25 years, that’s only 200 lines of code/day
But Think about all the OTHER things you do, meetings, etc.. 200 lines of code is probably what you do too.
I managed small to mid-size groups
Hated being non-technical, but technical management is a good way to leverage your expertise
Not only many jobs, many projects
Bits and bytes guy, liked the low level
But many lessons learned, mostly in how to make successful products. Keep “success” in mind,
In my second year…
At Microsoft, IBM came…
To talk about what was next for DOS
But not really, it took a while for us to realize this…
But IBM never really had a software-only release. Everything was tied to a hardware release
IBM was preparing to ship a new PC with lots of storage.
Enough for all of the sources to DOS and tools and still have more than 90% left over
It was a massive 10Mb disk…
But at the time that was better than 360Kb per disk. IBM wanted DOS to take advantage of this disk…
And they were telling us how do do it. They had lots of experience on their mainframes
Neither of these items were going to give the programmer or the end user any benefits
Oh, but they had lots of experience in their mainframes.
Now while the customer…
… is always right, he may not know what he’s right about.
He has a problem and wants it fixed
You get the chance to make a different fix, one that is more broadly applicable, or more flexible, or more performant.
And don’t get the customer mad…
Anyway, so the the requirements were quite a bit off…
So we brainstormed…
And came up with a solution that required a fair bit of education of the customer.
Their OS did not have such a thing, but…
… it was something that we are all familiar with today.
Now IBM gave us only four months to do this
Except that we had to rewrite pretty much the entire file system.
And doubled the number of API…
Because the current API could only access the current directory.
And we had a bunch…
Of ideas what we needed to do to make DOS more and more useful.
We held many brainstorming sessions trying to figure out the relative difficulties and priorities involved.
This helped…
… shape what we wanted to offer IBM in addition to the hardware support they required.
You need to know how you want your product to evolve. Your version 1 product CAN’T be the end of your ideas and innovation.
Online poll
Obviously, we looked to unix for inspiration but we really couldn’t use unix code
Not because of any licensing, GNU wasn’t even thought about
But because…
DOS had some pretty severe memory limitations…
I managed (with several visits to Boca Raton Florida) to convince them of the needs for more than just hard disk support
They accepted a limited set of enhancements as long as the size constraints were observed.
This gave us a chance to…
Introduce extra value to him quickly but not to endanger his schedules or your (joint) quality
We pushed the schedule to six months, which was not a pleasant experience
Shipped as …
DOS 2.0
Which was remarkably successful.
Surprised the heck out of me.
No sooner had we finished, when our customer came by again and said…
.. We have some new hardware.
A 20Mb…
Hard disk which they acknowledged did not need any OS change…
… and a super-secret network card.
IBM Said… Now, for the network card, we want ,,,
… you to modify your code in the following places and in Just the following way, giving us a list of locations in the code, which were inaccurate, and giving us a number of interfaces to use.
We didn’t like how this was sounding …
…what the ultimate product was going to look like.
The interfaces were…
… shall we say… interesting. Really really narrow, almost like a paranoid OO developer had been there.
We asked about the details of what was behind the interfaces but IBM refused to tell us.
We were on a tight…
… deadline (again), expecting to ship in nine months since they had a big hardware announcement planned
At this point, the DOS team
was only two people and as good as we were, doing development between coasts …
and narrow interfaces (both timezone, lack of email, and code), meant things just didn’t work as expected and progress was very slow.
With two months to go including QA time, the customer gave us a 130-state
error recovery state machine for their network card which was driven partially by us and partially by them…
through another new interface.
This was not going to be pretty.
We went back to our code…
and realized that we really couldn’t do the work in the time remaining; in fact, we’d have to redo about half of what we’d written already and this would take us another six months to develop and stabilize.
Remember IBM’s planned hardware announcement? We had a conference call to let them know about the schedule
Yeah, it was really pretty bad. There was silence on their end of the conference call for about five minutes.
There was no way around either the estimate or …
… telling your customer the bad news. Adding people wouldn’t have helped
Communication with all stakeholders is key, from customers who are making business plans based on your deliveries to your marketing teams who are planning announcements and events.
As we went…
back to work, it became clear that their interfaces weren’t providing them what they or we needed. There were all sorts of hidden dependencies that weren’t captured in their copious documentation and it was clear that some bugs were simply intractable.
Locking yourself in a closet …
And doing textbook or academic software design really doesn’t work.
This is software ENGINEERING. Tradeoffs happen all the time. There is no oracle that tells when one is better than another. Worse, they were designing interfaces for which they had no clients. We’ll come back to this in a bit.
I flew to Boca so that we could debug side-by-side.
The IBM developers and I were tired of not being able to share the deeper designs and the actual code. When I got there, the IBM developers took me into their office, closed the door so that the managers and the IBM lawyers couldn’t see us and we looked at both sets of source code. Within one hour, we saw all the problems of the current design and managed within three days to resolve all outstanding bugs. Within two weeks we passed their system test QA…
and shipped the product.
At this point, I told Steve Ballmer, that I was tired of working with IBM. They were a demanding customer who always wanted to do less than what I though was possible or necessary.
He said to go to the DOS roadmap and start working on the really big important items. He’d worry about dealing with IBM.
Little did I know that he and Bill had already been having talks with them about doing joint development on
Something called “New DOS”
so here I was, again working with IBM.
Between the two companies, we assembled an architecture team to answer a number of questions:
This was an easy one to answer. We had a small list
of requirements. And we knew who our competitors were
The Macintosh and Windows 1.0.
Our second question
Was a lot more involved.
This was a year before the 386 appeared, so all the currently shipping machines were shipping with segmented 16 bit 286’s. You have no idea how painful it was to write code for that chip. [Contortionist]
Microsoft knew that whatever was built would be finished after the 386 had shipped so why not design for that 32bit linear chip?
We had a meeting
with IBM where we pitched to their management doing this development on the new architecture. They replied that they couldn’t because they had promised 10000 286-based machines to a particular customer and they needed to deliver just those machines. Remember, IBM was all about customers and legacy hardware.
Bill Gates, I love the guy, looked at the IBM Vice President and said: “We’re limiting ourselves for THAT? It’s what, $20 million? Here, I’ll write you a check.”
Jaws dropped, but no deal. We had to write in 286 assembler because of a
“business decision”, that was actually a “revenue decision”
It was a long-term tax on the development.
There was a further problem here,
however. None of our requirements addressed something that was extremely critical, namely, preserving our customer’s assets.
We were going to come up with a new operating system. We were designing API for this operating system that would expose the powerful kernel and window managers.
What’s the issue?
We had no requirements to be compatible with any existing programming model
We didn’t even have requirements to be even close to an existing programming model
So, we were coming up with a new operating system that was incompatible at the binary layer and incompatible at the wetware layer.
This means that no existing program would run and that any existing knowledge about how to create effective and efficient programs was of limited use.
To quote Nathan Myhrvold, “OS/2 was just like Windows…
except in every detail”.
We couldn’t even make a
decent emulation layer because the semantics were just that different.
So, this operating system would come out with no applications and a developer community without tools that would need to be educated.
This is not a viable long term strategy.
In OS/2’s case, we broke customers by forcing them to purchase new (and expensive) applications
Worse, we broke the programmer’s models meaning old applications would need to be rewritten at a deep level.
But a bigger view of this is:
That you need to investigate all the limitations that can be applied to your product. You don’t have to live entirely within them, but you need to know which ones you are violating and make a business decision on whether you should.
Now, with flawed assumptions firmly in our hands and tied by business and revenue constraints…
IBM wanted to know the process for how we were going to do the architecture.
They had lots of experience doing architecture so they had a “process”.
Then once we had the architecture, they wanted a process for the design. They had one of those too.
Lastly, they wanted a process for the development. Which, of course they did.
Once we got into the development, however, IBM discovered they sometimes needed to revisit designs
and they needed a process for that. But they didn’t have one and had to invent one on the fly.
We ended up with a ten page form and flowchart for every design change.
Different teams, different projects have different demands
Process to handle 90% of all decisions
Save the brain for important stuff
It’s a guideline, not fixed rule
Processes change too (this is the great learning of Agile)
Despite being a guideline, IBM had a “Process
[enforcer] team that enforced each step and prevented commits that did not have the correct paperwork.
The whole purpose was to make sure that less-talented developers didn’t put crap into the product.
The result was a process that slowed down of 95% of the developers just so the 5% wouldn’t do bad things.
My opinion was that they should fire those 5% and get on with development.
We had a similar experience with Vista where MSFT hired program managers to “improve the process” by adding more “gates”. Feature branch merges now took a week only because we didn’t want the less-capable 5% to do something bad. I wanted to fire any program manager who added anything.
Its much
Better to not let them code at all. Or hire them, for that matter.
Run lean if you have to
Getting rid of bad code is costly
Getting rid of bad programmers is costly
Use contractors if you have to
And, eventually, we
Shipped OS/2
What we produced was technically sophisticated and of high quality. It’s window-manager paradigm much more rational than today’s managers, and had some real file system innovations. That’s the good news.
It also had no applications that ran out of the box,
no applications that could be easily ported
and a big learning curve for the developer
So, it ended up a market failure.
The relationship between Microsoft and IBM broke down and Microsoft’s involvement in OS/2 ceased in 1991.
I took some time off from mainline development and worked
with some other architects at MSFT to figure out where OS development needed to go to support Bill Gates’ “Information at your fingertips” but augmented with Jim Allchin’s familiarity with corporate needs.
This was to become Windows Cairo.
[For me, this was the greatest technical and management challenge I had ever had and I felt incredibly good about what we had produced… but there were lessons to be learned
Cairo was intended to be a set of object oriented extensions to Windows 95 and Windows NT.
it had a number of big pieces: a flexible and customizable shell, a directory service, kerberos distributed security, a distributed namespace file system, and an object-based file store.
That’s what the intention was. At some point, the program managers and vice presidents started talking about Cairo which was a code name, as a product. This was a big mistake.
A product needs a theme, a central focus.
Everything you do for the product needs to be guided by that theme.
It needs to have concrete usage scenarios that map up with experiences you want for the customer.
The product will solve a user’s problem. A feature is one part of a solution. A bag of features without comprehensive scenarios or a roadmap is simply incoherent.
This makes it hard to sell to customers. And to manage.
With marketing and
senior management full of self-delusion, we launched into the development.
With the emphasis on objects, we decided to implement all of these extensions in C++.
As Nathan Myhrvold said
OO programming just lets you get further into the woods before getting stuck, and you WILL get stuck.
Once we got going, however, a number of C++ features made life more difficult rather than easier. Things that seemed like clever tricks for a programmer, like operator and function overloading, just made it more difficult to maintain when that programmer left. Worse, when you are writing system software, every cycle and every byte counts so we were concerned about code quality.
Here’s a question: what is the most expensive syntactic construct in C++? Does anyone know?
The hidden code that gets generated when all the destructors are called can be pretty significant. We ended up disabling many features of C++ for our development work (we owned the compiler so we could do it) because
All the hidden code was just too big and slow.
Computers are supposed to do things quickly and if you start noticing lags, that will impact your customers’ perception of the product.
And customer perception is critically important
This it completely true for interpreted as well as compiled languages.
So we began development in earnest.
We had a plan for maximal efficiency and maximal flexibility for the file system, kerberos, DFS, and directory service.
Each was extensible, had well-designed APIs but there was one problem that was puzzlingly difficult
To find kerberos
ticketing service, required talking to the directory service
To talk to the directory service
required kerberos for authentication and authorization
Starting DFS service
required loading configuration information from the directory service in a secure manner
Directory service stored
its object database in the object file store making heavy use of it’s query capabilites
Kerberos stored its
data in the object file store but using DFS names
Opening files or getting access to objects in the object file store required mapping account names to security IDs using directory service
data in the object file store but using DFS names
Opening files or getting access to objects in the object file store required mapping account names to security IDs using directory service
Booting the entire system across multiple nodes was painful.
Debugging was even more so.
And added greatly to the system complexity and to our development problems.
The worst part was that each component was being developed not only by a different group, but one that was quite separated in the chain-of-management.
There were many turf battles that needed to be fought.
You need to control your own destiny. Find dependencies and eliminate them.
The last two lessons from Cairo
Have to do with the Object File Store.
Now, most of the OFS development had preceded the Cairo Shell and the Cairo Directory Service and there was a “planned” Mail to take advantage of OFS’s efficient storage and query.
In fact we had designed and spec’d the API for these new features with the various teams and were far down the implementation path by the time they had started their coding.
One point of irony here, most of the OFS team had also worked on OS/2. And we made the same mistakes.
We designed an API backed by a pile of code without the direct involvement of a customer.
These “customers” really weren’t because they weren’t writing code yet!
When the shell and the directory service started using our API, they
realized it was pretty much 180 degrees from what they needed. But OFS was too far down the development path to make changes without a large schedule impact.
This led to the shell abandoning
the use of the new API and to use the bare Win32 API.
At the same time, the directory service decided
to use a prebuilt database internal to Microsoft (it was Exchange’s DS) and stop using the object file store.
The mail team decided
to use their own store for tactical reasons.
So OFS was down to zero new clients because its programming model didn’t meet their needs.
There was no reason to do OFS any more
and I killed it.
But I learned AGAIN
This applies to UI, to API, to languages.
If you are not actively working with a client who is actively working on the latest code, you will have a zero percent chance of getting it right.
Now, most of these technologies and work for Cairo were not discarded.
The extensible shell
Was released in windows 95
The Directory Service
hosted on their private data base became the scalable and powerful Active Directory
The distributed file system
Was named many things but was released in Windows 2000 and has been supported since as Windows DFS
And the essential part of OFS
the query and content indexing part of the object file store was released as MSSearch in Windows XP.
This reinforces my statement
before. Cairo was a technology source and not a product.
So much for Microsoft Archaeology.
You’ve heard some stories and seen some lessons from the glory days, but
What is your job really about?
Success. But fundamentally all the success you hear about is rooted
In a simple notion. Yeah, it’s obvious, but everything you do and your team does every day must be in service of this. When you have a decision to make, ask “does it lead me to success”.
All of the lessons I’ve just talked about are directly tied into this view.
From an architect, a development manager, and a coder, this is all that counts.
Thank you.