Monday, August 20, 2012

Flipping my Classroom

I'll be experimenting with a "Flipped Classroom" model in my graduate computer architecture course this Fall.

I'll be recording key lectures beforehand and uploading them to YouTube.  I'll ask students to watch these lectures before class and use class time to solve problems and clarify doubts.

I buy the argument that a video (that can be paused, rewound, re-visited) is a more effective learning tool than listening to the same person speak in class.  I hope that the classroom experience will continue to be fun and interactive.  And I also hope that students have the discipline to listen to the videos before class.  If done right, this should lead to more effective and efficient learning -- more time spent outside class listening to lectures, but a lot less time spent outside class on assignments.

The videos will be in a screencast format: only showing the prepared slides and my run-time handwriting on them (no hands or face).  It won't be glitzy; I will fumble around at times.  If this model works for Khan Academy, it can't be a terrible idea :-).

If you're curious about the process: I'm using Powerpoint on my tablet-PC, combined with the Camtasia Studio Add-In.  So far, I've been willing to "accept" my videos after 4-6 takes.  I do very minor editing (cropping out the occasional cough).  Including prep time, I'm averaging about 3-4 hours of effort per video (note that I've taught this material many times).  In addition, I'll have to create new problems that I can use for discussion in class.

The videos can be found here (only the first six are ready as of today). 
The class website is here.

Tuesday, June 19, 2012

Memory Scheduling Championship Wrap-Up

The MSC Workshop was last week at ISCA, featuring a strong set of presentations.  The MSC webpage has all the final results and papers.  Congrats to the winners of each track!  Thanks to Zeshan Chishti and Intel for sponsoring the trophies and award certificates (pictured below with the winners of each track).  Numerically speaking, the workshop had about 30 attendees, we received 11 submissions, and the USIMM code has received 1254 downloads to date (261 downloads for version 1.3).

Ikeda et al., Winners of the Performance Track.
Fang et al., Winners of the Energy-Delay Track.
Ishii et al., Winners of the Performance-Fairness Track.

Some important take-home messages from the competition:
  1. It is nearly impossible to win a scheduler competition that implements a single new idea.  A good scheduler is a combination of multiple optimizations.  All of the following must be carefully handled: read/write interference, row buffer locality, early precharge, early activates, refresh, power-up/down, fairness, etc.  The talk by Ishii et al. did an especially good job combining multiple techniques and breaking down the contribution of each technique.  During the workshop break, some of the audience suggested building a Franken-scheduler (along the lines of Gabe Loh's Frankenpredictor) that combined the best of all submitted schedulers.  I think that idea has a lot of merit.
  2. If I had to pick a single scheduling artifact that seemed to play a strong role in many submitted schedulers, it would be the smart handling of read/write interference.  Because baseline memory write handling increases execution time by about 36%, it offers the biggest room for improvement.
  3. Our initial experiments seemed to reveal that all three metrics (performance, energy-delay, and performance-fairness) were strongly correlated, i.e., we expected that a single scheduler would win on all three metrics.  It was a surprise that we had different winners in each track.  Apparently, each winning scheduler had a wrinkle that gave it the edge for one metric.
The USIMM simulation infrastructure seemed to work well for the competition, but there are a few things we'd like to improve upon in the coming months:
  1. The current processor model is simple; it does not model instruction dependences and assumes that all memory operations within a reorder buffer window can be issued as soon as they are fetched.  Adding support to model dependences would make the traces larger and slow down the simulator (hence, it wasn't done for the competition).
  2. We have already integrated USIMM with our version of Simics.  This automatically takes care of the dependency problem in the previous bullet, but the simulations are much slower.  In general, reviewers at top conferences will prefer the rigor of execution-driven simulation over the speed of trace-driven simulation.  It would be worthwhile to understand how conclusions differ with the two simulation styles.
  3. The DRAMSim2 tool has an excellent validation methodology.  We'd like to possibly re-create something similar for USIMM.
  4. We'd like to augment the infrastructure to support prefetched data streams.
  5. Any new scheduler would have to compare itself against other state-of-the-art schedulers.  The USIMM infrastructure already includes a simple FCFS and an opportunistic close page policy, among others.  All the code submitted to the competition is on-line as well.  It would be good to also release a version of the TCM algorithm (MICRO'10) in the coming months.
If you have an idea for a future research competition, please email the JWAC organizers, Alaa Alameldeen (Intel) and Eric Rotenberg (NCSU).

Friday, June 15, 2012

Introducing simtrax

The Hardware Ray Tracing group has just release our architectural simulator called simtrax. The simulator, compilation tools, and a number of example codes can all be checked out from the Google Code Repository. We have a couple tutorial pages on the simtrax wiki that will hopefully get you started.

Now you may be wondering at this point, what kind of an architecture does simtrax simulate. The answer is a number of different configurations of the architecture we call TRaX (for Threaded Ray eXecution), which is designed specifically for throughput processing of a ray tracing application. As such, the simulator is not well suited to a large number of application spaces, but does show quite good results for ray tracing, and would perform well on other similar applications.

In particular, the limitations built in to the TRaX architecture which are assumed by the simulator include the following:

  • No cache coherence – at present the memory is kept coherent, but no simulation penalty is added to maintain coherence. We depend on code being written to not use coherence.
  • No write caching – writes are assumed to write around the cache, and the lines are invalidated if they were cached, but again, no coherence messages are sent to the other caches in the system.
  • Small stack sizes – each thread has a small local store for the program call stack. Recursion and deep call stacks can cause overflow or excessive thread-context size.
  • A single global scene data structure – only load relevant data when it is needed by the thread.
The architecture we simulate is hierarchical in nature. First there are a number of independent, in-order, scalar thread processors. These are grouped in sets of (usually 32) that compose what is called a Threaded Multiprocessor (TM). While each thread processor has a few independent execution units, some of the larger, less-frequently-used units are shared among the threads in the TM. In addition, the TM shares a single L1 data cache and a number of instruction caches, both of which have multiple independent banks, which allows for independent fetching of distinct words. 

A number of TMs are grouped together to share access to an L2 data cache, which then manages the connection to the off-chip global memory. In a large chip there may be as many as 4 separate L2 data caches, while a smaller chip might do away with the L2 data cache altogether. The final piece of the puzzle is a small global register file with support for an atomic increment operation, which is used to give out thread work assignments.

This simulator has been in development for a number of years and has had a number of contributors involved in the project. In addition, we have taught a course using the simulation and compiler tools, and published 4 conference and journal articles, with more in progress.

We hope this simulation tool can be useful and intend to do what we can to support it. Feel free to email us and post bug reports on the google code page.

Wednesday, February 22, 2012

USIMM

We've just released the simulation infrastructure for the Memory Scheduling Championship (MSC), to be held at ISCA this year.

The central piece in the infrastructure is USIMM, the Utah SImulated Memory Module.  It reads in application traces, models the progression of application instructions through the reorder buffers of a multi-core processor, and manages the memory controller read/write queues.  Every memory cycle, USIMM checks various DRAM timing parameters to figure out the set of memory commands that can be issued in that cycle.  It then hands control to a scheduler function that picks a command from this candidate set.  An MSC contestant will only have to modify the scheduler function, i.e., restrict all of your changes to scheduler.c and scheduler.h.  This clean interface makes it very easy to produce basic schedulers.  Each of my students produced a simple scheduler in the matter of hours; these have been included in the code distribution as examples to help get one started.

In the coming weeks, we'll release a number of traces that will be used for the competition.  The initial distribution includes five short single-thread traces from PARSEC that people can use for initial testing.

The competition will be judged in three different tracks: performance, energy-delay-product (EDP), and performance-fairness-product (PFP).  The final results will be based on the most current version of USIMM, as of June 1st 2012.

We request that contestants focus on scheduling algorithms that are easily implementable, i.e., doable within a few processor cycles and within a 68 KB storage budget.  A program committee will evaluate the implementability of your algorithm, among other things.

We'll post updates and bug fixes in the comments section of this blog post as well as to the usimm-users@cs.utah.edu mailing list (sign up here).  Users are welcome to use the blog or mailing list to post their own suggestions, questions, or bug reports.  Email usimm@cs.utah.edu if you have a question for just the code developers.

Code Updated on 04/17/2012:

Code download: http://www.cs.utah.edu/~rajeev/usimm-v1.3.tar.gz

Changes in Version 1.1: http://www.cs.utah.edu/~rajeev/pubs/usimm-appA.pdf
Changes in Version 1.2: http://www.cs.utah.edu/~rajeev/pubs/usimm-appB.pdf
Changes in Version 1.3: http://www.cs.utah.edu/~rajeev/pubs/usimm-appC.pdf

USIMM Tech Report: http://www.cs.utah.edu/~rajeev/pubs/usimm.pdf

The contest website: http://www.cs.utah.edu/~rajeev/jwac12/

Users mailing list sign-up:  http://mailman.cs.utah.edu/mailman/listinfo/usimm-users

Wednesday, February 8, 2012

Trip Report -- NSF Workshop -- WETI

I was at an NSF-sponsored Workshop on Emerging Technologies for Interconnects (WETI) last week that was attempting to frame important interconnect research directions.  I encourage everyone to check out the talk slides; talk videos will also soon be posted.  In the coming months, a detailed report will be written to capture the discussion.  This post summarizes some personal take-home messages.

1. Applications of Photonics: An important conclusion in my view was that photonics offers little latency, energy, and bandwidth advantage for on-chip communication.  Its primary advantage is for off-chip communication.  It is also worthwhile to look at limited long-distance on-chip communication with photonics.  For example, if a photonic signal has entered a chip, you might as well take the signal to a point near the destination, thus reducing the cost of global wire traversal.  Nearly half the workshop focused on photonics; many of the challenges appeared to be at the device level.

2. Processing in Memory: Our group has some initial work on processing-in-memory (PIM) with 3D chip stacks.  It was re-assuring to see that many people believe in PIM.  Because it reduces communication distance, it is viewed as a vital ingredient in the march towards energy-efficient exascale computing.  However, to distinguish such ideas from those in the 1990s, it is best to market them as "processing near memory". :-)

3. Micron HMC: The talk by Gurtej Sandhu of Micron had some great details on the Hybrid Memory Cube (HMC).  An HMC-based system sees significant energy contributions from the DRAM arrays, the logic layer on the 3D stack, and the host interface (the memory controller on the processor).  SerDes circuits account for 66% of the power in the logic layer.

4. Electrical Interconnect Scaling: Shekhar Borkar's talk was interesting as always.  He reiterated that mesh NoCs are overkill and hierarchical buses are the way forward.  The wire energy for a 16 mm traversal matches the energy cost per bit for a router; frequent routers therefore get in the way of energy efficiency.  He pointed out that the NoC in the Intel 80-core Polaris contributed 28% to chip power because the computational units were so simple.  The NoC in Intel's SCC chip consumes more power than the NoC in Polaris, but the overall contribution is lower (10%), because the cores are more beefy and realistic.  In moving from 45 nm to 7 nm, compute energy will reduce by 6x; correspondingly, the electrical interconnect energy to travel a fixed length on-chip reduces by only 1.6x and the energy for off-chip interconnect reduces by less than 2x.  So the communication energy bottleneck will grow, unless we can reduce communication and communication distances.

5. Miscellaneous: There was a buzz about near threshold computing (NTC).  It appears to be one of the few big arrows left in the quiver for processor energy efficiency.  It was also one of many techniques that Patrick Chiang mentioned for energy-efficient communication.  He also talked about low-swing, transmission lines, and wireless interconnects.  Pradip Bose's talk had lots of interesting power breakdowns, also showing trends for the IBM Power series.

Sunday, January 15, 2012

Waking Up to Bottleneck Realities!

NSF is organizing a workshop on Cross-Layer Power Optimization and Management (CPOM) next month.  The workshop will identify important future research areas to help guide funding agencies and the research community at large.  Below is my position statement for the same.  While the DRAM main memory system is a primary bottleneck in most platforms, the computer architecture community has been slow to react with innovations that look beyond the processor chip.


SPOT QUIZ:     What is an SMB?
Hint:  It consumes 14 W of power and there could be 32 of these in an 8-socket system.

FACTS:
Bottlenecks:
  • The memory system accounts for 20-40% of total system power.  Significant power is dissipated in DRAM chips, on-board buffer chips, and the memory controller.
  • Single DRAM chip power (Micron power calculator): 0.5 W.  On-board buffer chip power (Intel SMB datasheet): 14 W.  Memory controller power (Intel SCC prototype): 19-69% of chip power.
  • Future memory systems: 3D stacks, more on-board buffering, higher channel frequencies, higher refresh overheads.
  • And ... we have an off-chip memory bandwidth problem!  Pin counts have stagnated.
You Cannot Be Serious!!
(Exaggerated) State of Current Research: 
  • Number of papers on volatile memory systems:  ~1 per conference.
  • Number of papers on the processor-memory interconnect:  ~1 per year.
  • Number of papers that have to define the terms "rank" and "bank":  all of them.
  • Year of first processor paper to leverage DVFS: 2000.  Year of first memory paper to leverage DVFS: 2011.
  • Percentage of readers that have to look up the term "SMB":  > 89%.  (ok, I made up that fact :-) ... but I bet I'm right)

For every 1,000 papers written on the processor, 20 papers are written on the memory system, and 1 paper is written on the processor-memory interconnect.  This is absurd given that the processor and memory are the two fundamental elements of any computer system and memory energy can exceed processor energy.  While the routers in an NoC have been heavily optimized, the community understands very little about the off-chip memory channel.  The memory system is a very obvious fertile area for future research.

QUIZ 1:  
Most ISCA attendees know what a virtual channel is, but most would be hard-pressed to answer 2 of the following 5 basic memory channel questions:
  1. What is FB-DIMM?
  2. What is an SMI?
  3. Why are buffer chips placed between the memory controller and DRAM chips?
  4. What is SERDES and why is it important?
  5. Why do the downstream and upstream SMI channels have asymmetric widths?
QUIZ 2: 
Many ISCA attendees know the difference between PAp and GAg branch predictor configurations, but most will struggle to answer the following basic memory system questions:
  1. How many DRAM sub-arrays are activated to service one cache line request?
  2. What circuit implements the DRAM row buffer?
  3. Where is a row buffer placed?
  4. Why do DRAM chips not implement row buffer caches?
  5. What is overfetch?
  6. What is tFAW?
  7. Describe a basic algorithm to implement chipkill.  (What is chipkill?)
  8. What is scrubbing?

In early 2009 (before my foray into memory systems), I would have scored zero on both quizzes.  Such a level of ignorance is perhaps ok for esoteric topics... but unforgivable for a component that accounts for 25% of server power.

ACTION ITEMS FOR THE RESEARCH COMMUNITY:
  • Identify a priority list of bottlenecks.  Step outside our comfort zone to learn about new system components.  Increase memory system coverage in computer architecture classes.
  • Find ways to address obvious sources of energy inefficiencies in the memory system:  reduce overfetch, improve row buffer hit rates, reduce refresh power.
  • Find ways to leverage 3D stacking of memory and logic.  Exploit 3D to take our first steps in the area of DRAM chip modifications (an area that has traditionally been off-limits).
  • Understand the considerations in designing memory channels and on-board buffer chips.  Propose new channel architectures and new microarchitectures for buffer chips.
  • Understand memory controller microarchitectures and design complexity-effective memory controllers.
  • Design new architectures that integrate photonics and NVMs in the memory system.

Friday, November 25, 2011

ISCA Deadline

I'm guessing everyone's digging themselves out after the ISCA deadline.  I had promised not to talk about energy drinks and deadlines, so let's just say that our deadline week looked a bit like this :-)... without the funky math of course.

I'm constantly hounding my students to get their writing done early.  Good writing doesn't magically show up 24 hours before the deadline.  While a Results section will undoubtedly see revisions in the last few days, there's no valid excuse for not finishing most other sections well before the deadline.

The graph below shows two examples of how our writing evolved this year... better than in previous years, but still not quite early enough!  We also labored for more than the usual amount to meet the 22-page budget... I guess we got a little spoilt after the 26-page HPCA'12 format.  I am amazed that producing a refined draft one week before the deadline is an impossible task, but producing a refined 22-page document 3 minutes before the deadline is a virtual certainty.  I suppose I can blame Parkinson's Law, not my angelic students :-), for accelerating my aging process...


Tuesday, November 8, 2011

Memory Has an Eye on Disk's Space

We recently read a couple of SOSP papers in our reading group: RAMCloud and FAWN.  These are terrific papers with significant implications for architects and application developers.  Both papers target the design of energy-efficient and low-latency datacenter platforms for a new breed of data-intensive workloads.  FAWN uses many wimpy nodes and a Flash storage system; RAMCloud replaces disk with DRAM.  While the two papers share many arguments, I'll focus the rest of this post on RAMCloud because its conclusion is more surprising.

In RAMCloud, each individual server is effectively disk-less (disks are only used for back-up and not to service application reads and writes).  All data is placed in DRAM main memory.  Each server is configured to have high memory capacity and every processor has access via the network to the memory space of all servers in the RAMCloud.  It is easy to see that such a system should offer high performance because high-latency disk access (a few milli-seconds) is replaced by low-latency DRAM+network access (micro-seconds).

An immediate architectural concern that comes to mind is cost.  DRAM has a dollar/GB purchase price that is 50-100 X higher than that of disk.  A server with 10 TB of disk space costs $2K, while a server with 64 GB DRAM and no disk costs $3K  (2009 data from the FAWN paper).  It's a little more tricky to compare the power consumption of DRAM and disk.  An individual access to DRAM consumes much less energy than an individual access to disk, but DRAM has a higher static energy overhead (the cost of refresh).  If the access rate is high enough, DRAM is more energy-efficient than disk.  For the same server example as above, the server with the 10 TB high-capacity disk has a power rating of 250 W, whereas the server with the 64 GB high-capacity DRAM memory has a power rating of 280 W (2009 data from the FAWN paper).  This is not quite an apples-to-apples comparison because the DRAM-bound server services many more requests at 280 W than the disk-bound server at 250 W.  But it is clear that in terms of operating (energy) cost per GB, DRAM is again much more expensive than disk.  Note that total cost of ownership (TCO) is the sum of capital expenditure (capex) and operational expenditure (opex).  The above data points make it appear that RAMCloud incurs a huge penalty in terms of TCO.

However, at least to my initial surprise, the opposite is true for a certain large class of workloads.  Assume that an application has a fixed high data bandwidth demand and this is the key determinant of overall performance.  Each disk offers very low bandwidth because of the low rotational speed of the spindle, especially for random access.  In order to meet the high bandwidth demands of the application, you would need several disks and several of the 250 W, 10 TB servers.  If data was instead placed in DRAM (as in RAMCloud), that same high rate of data demand can be fulfilled with just a few 280 W, 64 GB servers.  The difference in data bandwidth rates for DRAM and disk is over 600X.  So even though each DRAM server in the example above is more expensive in terms of capex and opex, you'll need 600 times fewer servers with RAMCloud.  This allows overall TCO for RAMCloud to be lower than that of a traditional disk-based platform.

I really like Figure 2 in the RAMCloud CACM paper (derived from the FAWN paper and reproduced below).  It shows that in terms of TCO, for a given capacity requirement, DRAM is a compelling design point at high access rates.  In short, if data bandwidth is the bottleneck, it is cheaper to use technology (DRAM) that has high bandwidth, even if it incurs a much higher energy and purchase price per byte.

Source: RAMCloud CACM paper

If architectures or arguments like RAMCloud become popular in the coming years, it opens up a slew of interesting problems for architects:

1.  Already, the DRAM main memory system is a huge energy bottleneck.  RAMCloud amplifies the contribution of the memory system to overall datacenter energy, making memory energy efficiency a top priority.

2.  Queuing delays at the memory controller are a major concern.  With RAMCloud, a single memory controller will service many requests from many servers, increasing the importance of the memory scheduling algorithm.

3.  With each new DDR generation, fewer main memory DIMMs can be attached to a single high-speed electrical channel.  To support high memory capacity per server, innovative channel designs are required.

4.  If the Achilles' heel of disks is their low bandwidth, are there ways to design disk and server architectures that prioritize disk bandwidth/dollar over other metrics?

Monday, October 31, 2011

Dissing the Dissertation

In my view, a Ph.D. dissertation is an over-rated document.  I'll formalize my position here so I don't get dragged into endless debates during every thesis defense I attend  :-).

A classic Ph.D. generally has a hypothesis that is formulated at an early stage (say, year two) and the dissertation describes 3-4 years of research that tests that hypothesis.

A modern Ph.D. (at least in computer architecture) rarely follows this classic recipe.  There are many reasons to adapt your research focus every year:  (1) After each project (paper), you realize the capabilities and limitations of your ideas;  that wisdom will often steer you in a different direction for your next project.  (2) Each project fixes some bottleneck and opens up other new bottlenecks that deserve attention.  (3) Once an initial idea has targeted the low-hanging fruit, incremental extensions to that idea are often unattractive to top program committees.  (4) If your initial thesis topic is no longer "hot", you may have to change direction to be better prepared for the upcoming job hunt.  (5) New technologies emerge every few years that change the playing field.

I'll use my own Ph.D. as an example.  My first project was on an adaptive cache structure.  After that effort, I felt that long latencies could not be avoided; perhaps latency tolerance would have more impact than latency reduction.  That led to my second effort, designing a runahead thread that could jump ahead and correctly prefetch data.  The wisdom I gathered from that project was that it was essential to have many registers so the processor could look far enough into the future.  If you had enough registers, you wouldn't need fancy runahead; hence, my third project was a design of a large two-level register file.  During that project, I realized that "clustering" was an effective way to support large processor structures at high clock speeds.  For my fourth project, I designed mechanisms to dynamically allocate cluster resources to threads.  So I had papers on caching, runahead threads, register file design, and clustering.  It was obviously difficult to weave them together into a coherent dissertation.  In fact, each project used very different simulators and workloads.  But I picked up skills in a variety of topics.  I learnt how to pick problems and encountered a diverse set of literature, challenges, and reviews.  By switching topics, I don't think I compromised on "depth"; I was using insight from one project to influence my approach in the next.  I felt I graduated with a deep understanding of how to both reduce and tolerate long latencies in a processor.

My key point is this: it is better to be adaptive and focus on high-impact topics than to flog a dead horse for the sake of a coherent dissertation.  Research is unpredictable and a four-year research agenda can often not be captured by a single hypothesis in year two.  The dissertation is therefore a contrived concept (or at least appears contrived today given that the nature of research has evolved with time).  The dissertation is a poor proxy when evaluating a student's readiness to graduate.  The student's ability to tie papers into a neat package tells me nothing about his/her research skills.  If a Ph.D. is conferred on someone that has depth/wisdom in a topic and is capable of performing independent research, a student's publication record is a better metric in the evaluation process.

In every thesis proposal and defense I attend, there is a prolonged discussion on what constitutes a coherent thesis.  Students are steered in a direction that leads to a coherent thesis, not necessarily in a direction of high impact.  If one works in a field where citation counts for papers far out-number citation counts for dissertations, I see little value in producing a polished dissertation that will never be read.  Use that time to instead produce another high-quality paper!

There are exceptions, of course.  One in every fifty dissertations has "bible" value... it ends up being the authoritative document on some topic and helps brand the student as the leading expert in that area.  For example, see Ron Ho's dissertation on wires.  If your Ph.D. work naturally lends itself to bible creation and you expect to be an ACM Doctoral Dissertation Award nominee, by all means, spend a few months to distill your insight into a coherent dissertation that will have high impact.  Else, staple your papers into a dissertation, and don't be ashamed about it!

My short wish-list:  (1) A thesis committee should focus on whether a student has produced sufficient high-quality peer-reviewed work and not worry about dissertation coherence.  (2) A dissertation can be as simple as a collection of the candidate's papers along with an introductory chapter that conveys the conclusions and insights that guided the choice of problems.

Wednesday, October 5, 2011

The right yellow gradient

Vic Gundotra, SVP of Engineering at Google shared this inspiring story on Google+:

One Sunday morning, January 6th, 2008 I was attending religious services when my cell phone vibrated. As discreetly as possible, I checked the phone and noticed that my phone said "Caller ID unknown". I choose to ignore.

After services, as I was walking to my car with my family, I checked my cell phone messages. The message left was from Steve Jobs. "Vic, can you call me at home? I have something urgent to discuss" it said.

Before I even reached my car, I called Steve Jobs back. I was responsible for all mobile applications at Google, and in that role, had regular dealings with Steve. It was one of the perks of the job.

"Hey Steve - this is Vic", I said. "I'm sorry I didn't answer your call earlier. I was in religious services, and the caller ID said unknown, so I didn't pick up".

Steve laughed. He said, "Vic, unless the Caller ID said 'GOD', you should never pick up during services".

I laughed nervously. After all, while it was customary for Steve to call during the week upset about something, it was unusual for him to call me on Sunday and ask me to call his home. I wondered what was so important?

"So Vic, we have an urgent issue, one that I need addressed right away. I've already assigned someone from my team to help you, and I hope you can fix this tomorrow" said Steve.

"I've been looking at the Google logo on the iPhone and I'm not happy with the icon. The second O in Google doesn't have the right yellow gradient. It's just wrong and I'm going to have Greg fix it tomorrow. Is that okay with you?"

Of course this was okay with me. A few minutes later on that Sunday I received an email from Steve with the subject "Icon Ambulance". The email directed me to work with Greg Christie to fix the icon.

Since I was 11 years old and fell in love with an Apple II, I have dozens of stories to tell about Apple products. They have been a part of my life for decades. Even when I worked for 15 years for Bill Gates at Microsoft, I had a huge admiration for Steve and what Apple had produced.

But in the end, when I think about leadership, passion and attention to detail, I think back to the call I received from Steve Jobs on a Sunday morning in January. It was a lesson I'll never forget. CEOs should care about details. Even shades of yellow. On a Sunday.


If there's one thing to learn from Steve Jobs' life, it is the importance of being a perfectionist. All the time. Right down to the perfect shade of yellow. Respect, Mr. Jobs.