The Mythical Man Month Summary – Chapter 2 Sample

Mythical Man Month Summary

People experienced with software development and project management of software products can have negative experiences.  They miss deadlines, they miss budget forecasts, the deliverables are not well received, etc.  This is not to say all software projects are negative.  Good software projects can happen.  Your chances of success can increase when you are aware of the following signals.

The following is an excerpt from The Mythical Man Summary

Why do software projects go awry?

There are five common elements to the answer of this question:

    1. Techniques for estimating are poorly developed.
    1. Effort does not always equal progress.
    1. Our estimates are not stubborn enough. 
    1. The overall progress of the schedule is often poorly monitored.
  1. The traditional response of adding more manpower to late projects is flawed.
Sample Chapter from The Mythical Man Month Summary book


All programmers are optimists, but this can lead to thoughts such as “all will go well” and “each task will take only as long as it ought to.”  This is not only untrue, but it’s also reckless.  This is exemplified during the implementation phase of software development, where design ideas are shown to have flaws such as bugs or incomplete requirements.  The presence of bugs alone shows overly optimistic thoughts are unjustified.  Bugs happen.

The Man-Month

Cost varies by the number of developers and number of months; progress does not.  Men and months are interchangeable commodities only when a task can be delegated which requires no communication amongst one another.  A possible example of a task which requires no communication is reaping wheat.  Software development is not reaping wheat.

When a task cannot be partitioned because of sequential constraints, more effort does not affect the schedule because the amount of communication required increases.  The extra communication is found in training and intercommunication amongst team members.

Because communication effort is great, adding more men lengthens, not shortens, the schedule.

Systems Test

Sequential constraints in the schedule are predominant in debugging and system testing portions of the project.  This is often the mis-scheduled portion of the overall deliverables.

Gutless Estimating

While the project sponsor may govern the urgency, it cannot govern the actual completion.  However, false scheduling to match the sponsor’s desired date is more common in software than any other engineering discipline.  Individual managers will need to “stiffen their backbones” and defend their estimates.

Regenerative Schedule Disaster

What happens when a software project is behind schedule?  There are four possible approaches:

    1. Assume that only the first part of the task was estimated incorrectly and add more manpower to make up the difference for the first portion misstep.
    1. Assume the entire project estimate is off and add more manpower for all phases of the project.
    1. Carefully reschedule.
  1. Trim the tasks.

The first two cases are disastrous.  For example, adding two new men, no matter how competent, will require training by the existing experience project member for a month.  Thus, three man-months (two new, plus one existing member) is invested without any progress on the project itself.

The bottom line is adding manpower to a late software project makes it later.  One cannot conform to a previously estimated project schedule by using more men and fewer months. 

Featured Image credit:

Data Science From Scratch Summary

Machine Learning
This post is an excerpt for the book Data Science From Scratch Summary

Machine Learning Chapter 11

Many people believe data science is machine learning and that data scientists mostly build and train and tweak machine-learning models. In reality, data science is mostly addressing business problems by collecting, understanding, cleaning, and formatting data.  But, once the data is prepared, you may have a chance to apply machine learning techniques.


A specification of a mathematical (or probabilistic) relationship that exists between different variables. For example, poker on television estimates each player’s “win probability” in real time based on a model that takes into account the cards that have been revealed so far and the distribution of cards in the deck.

What is Machine Learning?

Simply put, machine learning is creating and using models that are learned from data.  In other contexts, it might be called predictive modeling or data mining.


Predicting whether an email message is spam or not

Predicting whether a credit card transaction is fraudulent

Predicting which advertisement a shopper is most likely to click on

Predicting which football team is going to win the Super Bowl

This book looks at both supervised and unsupervised models.  Supervised models contain a set of data labeled with the correct answers to learn from, while unsupervised models do not contain such labels.

Overfitting and Underfitting

A common danger in machine learning is overfitting which is when a model is produced that performs well on the data you train it on but not on any new data.  The other side of this is underfitting.


Usually, the choice of a model involves a trade-off between precision and recall.  Precision measures the accuracy of positive predictions while recall measures what fraction of the positives our model identified.

The Bias-Variance Trade-off

Overfitting may be considered a trade-off between bias and variance.  Both are measures of what would happen if you were to retrain your model many times on different sets of training data (from the same larger population).

Feature Extraction and Selection

When your data doesn’t have enough features, your model is likely to underfit. And when your data has too many features, it’s easy to overfit.

For example, imagine trying to build a spam filter to predict whether an email is junk or not. Most models won’t know what to do so you’ll have to extract features such as:

Does the email contain the word “Viagra”?

How many times does the letter d appear?

What was the domain of the sender?”

Containing “Viagra” is yes or no or boolean encoded as a 1 or 0. The second question is a number. And the third question is a choice from a discrete set of options.

Usually, we’ll extract features from our data that fall into one of these three categories. The type of features we have constrains the type of models we can use:

The Naive Bayes classifier yes-or-no features.

Regression models require numeric features (which could include dummy variables that are 0s and 1s).

Decision trees can deal with numeric or categorical data

All these models are more will be covered in the following chapters.

This post was an excerpt for our new book Data Science From Scratch Summary

Featured image photo credit

Spark Streaming from Learning Spark Chapter 10

Spark Streaming

The following is a sample chapter from Learning Spark Summary.  For more information and to purchase see Learning Spark Summary.

Spark Streaming

Spark Streaming based applications are tracking statistics about page views in real time, train a machine learning model, or automatically detect anomalies.

The abstraction in Spark Streaming is called DStreams or discretized streams. A DStream is a sequence of data which arrives over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS.

DStreams offer two types of operations: transformations, which yield a new DStream, and outputs, which write data to an external system.

Note to Python devs: As of Spark 1.1, Spark Streaming is available only in Java and Scala. Experimental Python support was added in Spark 1.2, though it supports only text data.

Example Spark Streaming: Streaming filter for printing lines containing “error” in Scala

// Create a StreamingContext with a 1-second batch size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting 
// to port 7777 on the local machine
val lines = ssc.socketTextStream("localhost",7777)
// Filter our DStream for lines with "error"
val errorLines = lines.filter(_.contains("error"))
// Print out the lines

This sets up only the computation that will be done when the system receives data.  To start receiving, need to explicitly call start() on the StreamingContext.

// Start our streaming context and wait for it to "finish" 
// Wait for the job to finish 

Architecture and Abstraction

Spark Streaming uses a “micro-batch” architecture.  This means computation is a continuous series of batch operations on small batches of data.  New batches are created at regular time intervals typically configured between 500 milliseconds and several seconds.

For each streaming input source, a receiving task is launched within the application’s executors.  The executors receive the input data and replicate it (by default) to another executor for fault tolerance.  Data is stored memory of the executors the same way as cached RDDs.

The same fault-tolerance properties for DStreams are available as Spark has for RDDs.


Transformations on DStreams are either stateless or stateful.

Stateless Transformations

Stateless transformations, as the name implies, are simple RDD transformations being applied on every batch with no regard to previous or future transformations in the streaming pipeline.

Stateful Transformations

Stateful transformations track data from previous batches and are used to generate new results for the batch executor.

Windowed transformations

Windowed operations compute results across a longer time period than the StreamingContext’s batch interval.  Windowed operations combine results from multiple batches.

UpdateStateByKey transformation

When maintaining state across the batches in a DStream such as track clicks as a user visits a site, updateStateByKey() is suggested as a state variable for DStreams of key/ value pairs.

Output Operations

Output operations use the final transformed data in a stream and push it to an external database or print it to the screen.

Input Sources

Spark Streaming has built-in support for a number of different data sources. Some “core” sources are built into the Spark Streaming while others are available through libraries.

Core Sources include steam of files and Akka actor stream.

Additional input sources include Apache Kafka and Apache Flume.

Finally, developers are able to create their own input source receivers.

Multiple Sources and Cluster Sizing

DStreams may be combined using operations like union(). Through these operators, data can combined from multiple input DStreams.

24/7 Operation

An advantage of Spark Streaming is providing fault tolerance guarantees. As long as the input data is stored reliably, Spark Streaming can always compute the correct result from it.


Checkpointing needs to be set up to provide fault tolerance in Spark Streaming. Checkpointing periodically saves data from the application to a reliable storage system, such as HDFS or Amazon S3.  This saved data can be use in recovering.

Checkpointing serves two purposes: 1) limit the state that must be recomputed on failure and 2) provide fault tolerance for the driver.

Driver Fault Tolerance

If the driver program in a streaming application crashes, it can be relaunched and recover from a checkpoint.  Tolerating failures of the driver node requires a special way of creating our StreamingContext via StreamingContext.getOrCreate() function.

Worker Fault Tolerance

Spark Streaming uses the same techniques as Spark for its fault tolerance of streaming worker nodes.

Receiver Fault Tolerance

Whether receivers loses any of the data during a failure depends on the nature of the source  such as whether the source can resend data or not.  How?  This can be determined whether the receiver updates the source about data received or not.

Processing Guarantees

Spark Streaming’s worker fault-tolerance guarantees exactly-once semantics for all transformations.

Streaming UI

Spark Streaming provides a special UI page to display attributes of streaming applications are doing. This is available in a Streaming tab on the normal Spark UI.

Performance Considerations

Spark Streaming applications have specialized tuning options including batch and window sizes, level of parallelism, garbage collection and memory usage.

This post is an excerpt for the book Learning Spark Book Summary.

Featured image photo credit

Clean Code Summary – Sample Chapter

clean code

The following is chapter 3 from Clean Code Summary book available from Amazon. It was written and designed for experienced software engineers and managers looking to save time and learn key concepts from the critically acclaimed software engineering book, Clean Code: A Handbook of Agile Software Craftsmanship.

Functions Chapter 3

A function is a type of procedure or routine in computer programs.  What makes good functions?


Lines should not be 150 characters long.

Functions should not be 100 lines long.

Functions should hardly ever be 20 lines long.

Blocks and Indenting

Blocks within statements should be one line long.

The indent level of a function should not be greater than one or two.

Do One Thing

The following advice has appeared in one form or another for over 30 years:

Functions should do only one thing and do it well.

One Level of Abstraction per Function

To make sure our functions are doing “one thing,” we need to make sure that the statements within our function are all at the same level of abstraction.

To determine the abstraction level, use the “Step Down” rule.  This rule is reading the program as though it were a set of TO paragraphs, each of which is describing the current level of abstraction and referencing subsequent TO paragraphs at the next level down.

Making the code read like a top-down set of TO paragraphs is a useful technique for keeping the abstraction level consistent.

Switch Statements

Switch statements may be tolerated using the following guidelines:

appear only once

used to create polymorphic objects

hidden behind an inheritance relationship so that the rest of the system can’t see them

Use Descriptive Names

Choosing good names rely on small functions that do one thing.  And with this kind of function in place, consider the following:

Don’t be afraid of using long names

Don’t be afraid of spending time to choose a descriptive name

Be consistent in naming

Function Arguments

How many arguments should functions allow?  The ideal number of arguments is zero (niladic), followed by one (monadic), followed closely by two (dyadic). Avoid three arguments where possible and do not use more than three (polyadic).

Other argument considerations:

Do not use flag arguments such as booleans because it implies the function of doing more than one thing

Two arguments make the function more complicated to understand and three arguments even more so

Wrap multiple arguments into a class of their own

Have No Side Effects

Ensure functions have no side effects and especially side effects that include temporal coupling.  Temporal coupling creates dependencies between code and timing.

Command Query Separation

Functions should either perform an action or answer a question, but not both.

Prefer Exceptions to Returning Error Codes

And when using exceptions, it’s often preferable to extract try/catch block include into functions.

Don’t Repeat Yourself

Avoid duplication.  (For a summary version of this principle, check out The Pragmatic Programmer Summary on Amazon)


How do you write functions as previously described?  Writing functions within these guidelines doesn’t happen right away.  It requires iterating over versions. 



For more on Clean Code Summary


Featured Image credit

The Pragmatic Programmer Summary – A Pragmatic Philosophy

The Pragmatic Programmer Summary

The following is chapter 1 from The Pragmatic Programmer Summary book.  It was written and designed for people looking to save time and learn key concepts from the classic software engineering book, The Pragmatic Programmer: From Journeyman to Master.

Chapter 1 A Pragmatic Philosophy

Pragmatic programmers are determined by their attitude, style and their philosophy towards problems and their solutions.  Pragmatic programmers think in contexts as well as the particular challenges in focus. 

In this book, there are a total of 46 sections spread across eight chapters.

This chapter explores a pragmatic programmer’s philosophical approach in six sections:

1. The Cat Ate My Source Code

Take responsibility and don’t blame someone or something else.  Don’t make up excuses.

2. Software Entropy

Entropy is a term from physics that refers to the amount of “disorder” in a system.

“Broken Window Theory” states unrepaired items left for any substantial length of time leads to a sense of abandonment. When this happens, entropy increases rapidly.

Clean up all the broken glass of a project. Don’t make up excuses for not cleaning up.

3. Stone Soup and Boiled Frogs

Hungry soldiers return home to a village where everyone is locked in their homes and unwilling to share food.  The soldiers boil water with rocks as a stew.  Using rocks as stew ingredients captures the villagers’ attention they come out to investigate.  The soldiers collect a variety of ingredients from the previously stingy villagers and make a tasty, hearty stew.   

A moral of the story is the soldiers acting as a catalyst to produce something that they couldn’t have done by themselves. Eventually, everyone wins.  Occasionally, try to emulate these soldiers.

The villagers perspective is told with an analogy to boiling frogs.   If you place a frog in a pan of boiling water, the frog’s reaction will be dramatic.  But, if you place a frog in cold water, then gradually heat it, the frog won’t notice until it’s too late.  This is similar to how the villagers appeared unaware of being manipulated to contributing towards the stew.

Don’t be like the frog. Keep an eye on the big picture.

4. Good-Enough Software

Try to discipline yourself to write software that’s good enough for your users, for future maintainers and peace of mind. You may well find your programs are better because of shorter incubation time. Like those new types of programs where you can just get more Instagram followers instantly. 

Involve your users in the trade-offs because great software today is often preferable to perfect software tomorrow.

Know when to stop coding because over embellishment and over-refinement can ruin a program. Let your code stand in its own right for a while.

5. Your Knowledge Portfolio

Your knowledge and experience are your most valuable professional assets. But, they’re expiring assets.  Keep your knowledge portfolio diversified and up-to-date.  Invest regularly, diversify and attempt to manage risk.  Take time to review and rebalance.

6. Communicate!

Developers have to collaborate in meetings, listening and talking with managers, other engineers, and end-users.

Communication Key Points:

Know What You Want to Say

Plan what you want to say and the style appropriate to your audience.  Style examples include short and to the point or very descriptive and detailed.

Know Your Audience

You need to understand the needs, interests, and capabilities of your audience.

Choose Your Moment

Timing matters.

Choose a Style

Tailor your style to suit your audience.

Make It Look Good

Your ideas are important, but presentation matters.

Involve Your Audience

If possible, share early versions of your documents with readers.

Be a Listener

If you want people to listen to you, listen to them.

Get Back to People

Keeping people informed makes them far more forgiving for the occasional time you slip up.

Featured image:

This was chapter 1 from The Pragmatic Programmer Summary.  It was written and designed for people looking to save time and learn key concepts from the classic software engineering book, The Pragmatic Programmer: From Journeyman to Master.