Advanced Excel – Pivot Tables

This is not about programming but just my reference for some advanced spreadsheet skills for Excel 2007. To be updated for later Excel versions.

Steps to create pivot tables are provided here. This is just an elaboration on my part, with input from other websites like

  1. Preselect the data range. If the data to be used in the pivot table is in one of the worksheets, clicking on any one of the cells inside that range of data will automatically make the wizard select that data range. Make sure that data range does not have blank rows.
  2. Open the Create Pivot Table Wizard to Configure the pivot table. Navigate to the ribbon and follow this menu selection: Insert > Pivot Table > Pivot Table. When the wizard comes up, there are two choices you have to make:
    • Where to get the data? The data source and range, which can either be:
      1. From the same worksheet. If the default data range is incorrect, manually edit the values in this input field with the following syntax: [workbookname]sheetname!range
      2. From an external data source. If you choose this, a list of existing connections will pop up and you can create a new connection as well. image

        To create a new connection, click on the  the “Browse for more” button and the Select Data Source dialog box will pop up.image

        If the data source is MS SQL Server choose “NewSQLServerConnection”, otherwise, choose “Connect to New Data Source” and click Open. image

        The first two options will open a new connection dialog box with SQL Server while the fourth will connect to an Oracle database. image

        Choosing “ODBC DSN” opens up another dialog box: image

        Choosing “Other/Advanced” will allow you to fine tune the connection for other types of data sources:


    • Where to place the pivot table. You can place it in a new worksheet or in an existing worksheet. To make it easier placing it in an existing worksheet, you can collapse the dialog box, chose the first cell, and expand the dialog box afterwards. image
  3. Choose which fields are placed where on the pivot table. Upon clicking the OK button in the Create PivotTable dialog box, an empty pivot table will be created and a PivotTable Field list will be displayed. Configure the pivot table by either dragging and dropping them on the empty table or on the PivotTable Field.


    The four places to place them are:

    • Column Labels = Column Fields;
    • Row Labels = Row Fields,
    • Values  = Data Items ;
    • Report Filter = Page Fields.  
  4. Some of the things that are configurable in the pivot table are:
    • The row fields and the column fields can be sorted ascending, descending or custom.image
    • In the Field Table Field List, you can configure the field settings (move, remove, rename). The Values field settings data can be summarized in different ways as well:  imageimageimageimageimage

      Some explanation for the values in the Summarize by tab and explanation for Show value as tab.The p in VARP and STDVARP means “of population”, those without are “of a sample”. Some explanations here and here, here. VAR and VARP have been renamed in later Excel as VAR.S, VAR.P, respectively. Old function names may be phased out so its better to use the new names. Those with “A” in the statistical function will not ignore text and logical values, as explained here. Use the STDDEV to show the more intuitive distribution of data. Click here for a discussion why VAR and STDDEV are not the best way to measure dispersion since it amplifies larger deviations. It is better to use MAD mean (or AVEDEV in Excel) as distribution measure plus IQR. AVE and STDDEV are non-robust measures of dispersion.

    • Its possible to use the same underlying data but presented differently, say as another column but with different name and different summary or display value format. Or transpose the fields between rows and columns.
    • You can also modify how it is displayed by right clicking on the pivot table by formatting, sorting and other pivot table options . image
    • The context menu on the row and column fields are different, with Filter, Expand/Collapse, Group (by months, quarters, years, etc.) or Ungroup. Some other tips on grouping. imageimage
    • You can choose to display the summary or details of each higher level field if there is a lower level row/column fields by clicking on the icon to the left of the field names.image
    • You can choose to filter data display by unticking on which row values to not display.  image
    • Even the data values can be filtered using the “Value Filters” image
    • You can also filter report by like which “Regions” to display in this specific report.image
    • There are some pre-defined table formats that can be used, just click on any cell within the pivot table and click on  Home > Format as Table and choose any format. Conditional formatting can be done as well using HomeConditional Formatting or styles on the cell using Home > Cell Styles. Some example of zebra style formatting and increase/decrease icons and creating/editing formats.
    • There is also on the ribbon the Pivot Table Tools you can field header filter icons and other options. From this you can also add Calculated fields (Formula > Calculated Field), change data source and generate additional reports using show report filter pages. For Show Report Filter Pages to work, there must be a field in the Report Filter. image
  5. Other things you can do:

Object Oriented Programming, Part 1

Caveat: I am not a graduate of a 4-year computer course, so this post is just a summary of my understanding of OOP in general and not a definitive guide or an expert opinion.


Object-oriented programming paradigm is essentially the division of a system into modular objects (run-time entities) responsible for their own data and behaviour, with such objects being a combination of structured programming and abstract data types. It encourages hiding of data which is only accessible through methods bundled with it. OOP languages are said to have the following minimum fundamental feature set:

  1. Encapsulation, the bundling together of data structure and method collection that operate on these data structure into a single component called an object (similar to abstract data type or module but with inheritable and polymorphic properties and support dynamic dispatch) possessing state (data) and behaviour (methods) and used to abstract real world entities.
  2. Inheritance, basing of an object/class from another object/class (a) to re-use the same implementation as that of the base class or prototype object, or (b) to specify that the same methods and behaviours is maintained in the derived class.
  3. Polymorphism, the provision of a single interface to entities of different types.
  4. Open recursion, the invocation of a method inside another method of the same object using  a special keyword-like variable (this, self or Me), The invoking method may be defined in the superclass and the invoked method in the subclass thru overriding.


This is an example of a combined abstraction of data and code. Some classes though are just wrappers of methods, that’s why there are rants about this in here or here. There are two ways an object can be created:

  • Class-based, where an object is created out of a class which serves as its template or blueprint. This includes Simula67, Smalltalk, Eiffel, C++, Java, Objective-C, C# and PHP. Click here for a list of some class-based languages.
  • Prototype-based, where an object is created by cloning an existing object, which serves as its prototype. This includes Self, NewtonScript, Lua, Javascript, Perl, Python and Ruby and others.

Information hiding is a related but independent notion to encapsulation; that of restricting access to some of the object components, such that only thru an interface are object data or methods exposed to its clients. Information hiding is not a part of OOP features. The purposes of information hiding in objects are:

  • To act as a barrier to prevent unauthorized direct inspection or manipulation of the object’s internal state that could result in them becoming invalid, inconsistent or corrupted; and vice versa, to prevent internal change affecting or propagating to the other parts of the program outside of the object, like collision of identically named variables.
  • To limit inter-dependencies between software components and limit extensive modification if there’s a design change in the code, and group together related components for easier understanding and maintainability. Changes are local rather than global.
  • To reduce development risk by deciding which components will be stable (the interface) and which parts are allowed to change and improve (the implementation). By having a stable interface, the clients can rely on the interface that it will not change much like a contract.

With regards to information hiding in C#, although properties look like member fields by not having “( )”, they are in fact methods, access methods to be precise, which may get or set hidden internal data fields: “a property is a pair of getter and setter access methods that provide access to a field or calculate the needed values.”. Not all internal fields have to have a corresponding getter/setter access method.

Composition or the combination of simpler component objects without independent existence  into more complex composite objects resulting in a “has-a relation, is also one manifestation of encapsulation. Here, the composite object owns the component, being responsible for their creation and destruction. The component object gets destroyed together with its composite object and can never be transferred to another composite object, it may only be part of one composite. Multiple has-a relations combine to form a possessive hierarchy. One way of implementing this in C# is thru interfaces. Advantages of composition over inheritance include: (a) more stable business domain objects and (b) higher design flexibility since it can accommodate future requirement changes that would otherwise require complete restructuring of the class hierarchy or inheritance model. One disadvantage is that all of the methods being provided by the composed class must be implemented in the composite class, unlike inheritance where only classes with different behaviour from base class need to override the base methods, but this is avoided if the language supports mixins or traits.


  • Interface Inheritance, or specifying that the names of the methods and attributes in the base interface is found in the derived interface.
  • Subtyping, or specification inheritance, where all the methods (implementations) and attributes of a base class are found in the derived class resulting from is-a relationship. This is found only in statically-typed class-based OO languages, such as Java, C#, C++ and Scala.
  • Implementation Inheritance establishes syntactic relationship and not necessarily semantic relationship, especially those that allow mutable objects.

Types of inheritance by the number of base classes or objects its inheriting from:

  • Single inheritance, where a class/object may inherit only from a single other class/object but protocols in Objective-C and Swift, interfaces in Java and C#, traits in Scala, modules in Ruby) provide a functionality to have multiple inheritance. Languages that use this and how they resolve diamond problems are: C# , Objective-C, Smalltalk, and Swift. (none, since base implementations are either overridden or hidden), Java (in Java 8 must re-implement the method or result in compile error), Ruby (method redefinition obscures any previously existing definition at the time of execution) and Scala (right-first depth-first search of extended ‘traits’, then retain only the last occurrence of each module in the resulting list). 
  • Multiple Inheritance, where a class/object may inherit only from multiple classes/objects. Additionally, Eiffel has repeated inheritance from one superclass apart from multiple inheritance. Languages that uses this and how they resolve ambiguity like the diamond problems are C++ (virtual inheritance or explicitly by properly qualifying from which it is inheriting), Common Lisp Object System (method that matched the most specific argument, order in the declaration of containing class, overridden thru method combination or reflection thru metaobject protocol), Curl (secondary constructors on shared classes), Dylan (MRO/C3 linearization), Eiffel (automatic joining if same name and implementation; explicitly thru select and rename directives), Logtalk (masked out by default, renaming thru method alias), OCaml (precedence by the last in the class inheritance list, override by qualifying with class definition), Perl (left-first depth first searching on an inheritance ordered list, or overridden with MRO/C3 linearization or other algorithms), Perl 6 (as as Perl), Python (MRO/C3 linearization on an inheritance ordered list), and Tcl (MRO/C3 linearization).
  • Is there a language where
    • the default way to handle the diamond problem has this precedence: (1) automatic joining if same name and implementation (Eiffel), (2) method that matched the most specific argument (CLOS) and (3) right-first depth-first search of methods, then retain only the last occurrence of each module in the resulting list (Scala), then
    • can be overridden by any of these: (1) explicit qualification of ancestor class (Eiffel), (2) reflection thru metaobject protocol (CLOS), (3) method combination (CLOS), with
    • the ability to rename methods thru alias (Logtalk)?

Types of inheritance according to manner of inheritance:

  • Classical Inheritance or class inheritance, where attributes/variables and implementation/methods (and contracts in Eiffel) are automatically inherited from one or more base classes, parent classes or superclasses and retained by derived classes, child classes or subclasses creating a class hierarchy. Used mainly by class-based languages but can be simulated by some prototype-based OO languages. Derived classes may modify or supplement the inherited functionality thru overriding (marked as virtual, abstract or override in the base and override in derived C#) or hiding (marked in C# as new in the derived member) but must have the same method signature and visibility. Some classes or methods may be marked as uninheritable thru class modifiers (final in Java, sealed in C#, frozen in Eiffel). Private methods can’t be overridden due to its inaccessibility or invisibility outside of the class it is a member method. Static methods can be overridden as well.
  • Differential inheritance or prototypal inheritance,  where only  attributes that differentiates an object from the its prototypes need to be added to the derived object. Method lookup rules are used for delegation, the mechanism to share code between the two objects. A pointer referencing its prototype objects (called delegation link) is kept in the derived object. Used by prototype-based OO languages like Self, NewtonScript, JavaScript and Io.
  • Concatenative Inheritance, where properties from one object are copied to another, without retaining a reference between the two objects. Found in Javascript.
  • Parasitic Inheritance. Found in Javascript.

Common constraints in inheritance using classes are:

  • Some languages imposes restrictions on inheritance. Multiple inheritance partially solves single inheritance but not all languages allows repeated inheritance from a single superclass (only Eiffel supports this). Similar mechanism can be achieved thru mixins (a class that combines implemented methods and attributes from other classes and is included but not inherited by another class) and traits (collection of methods used to extend another class and with mechanisms for name conflict resolution but without attributes or state and can be composed apart from inheritance unlike a mixin). An interface is different from a mixin and trait in that it can not include implementation of the methods in it. Mixins is used in CLOS, Javascript, Perl, PHP, Python, Ruby, Scala and Tcl. C# 3.0 can mimic mixins thru marker interface pattern, generic programming, and extension methods (sample here) or interfaces combined with aspect-oriented programming. Traits are used in Self, Perl6, Javascript, Squeak, PHP, Ruby, Scala, Python,  Javascript support mixins and traits natively.
  • Fixed inheritance hierarchy and data type at instantiation and cannot change in time, locking the developer within their original designs. (Can be partially solved with the decorator pattern.) “Role based design should be used when it is conceivable that the same object participates in different roles at different times, and inheritance based design should be used when the common aspects of multiple classes are factored as superclasses, and do not change with time.
  • Client code have access to the base class data, either because the base class is public or by casting. This can be mitigated by making the base class members protected in most of these languages (C++, Java, C#, etc.)
  • The yoyo problem, where the programmer has to keep flipping between many different class definitions to follow the control flow of the program with a long and complicated inheritance graph. To avoid this problem, keep the inheritance as shallow as possible.


  • Ad hoc polymorphism, where several functions with an identical function name accept a number of parameters of different but limited types in distinct combinations. This allows the functions to provide potentially completely heterogeneous implementations depending on distinctness of names, order and types of parameters supplied and may involve type promotion, or type coercion. Advantages are (a) less verbose function names and reduced code length, (b) uniform interface or notation across different objects, (c) specialization of each function to only 1 set of distinct parameter combination, (d) increased program efficiency, and (e) enforce mandatory data members if constructors. Two examples:
    • function overloading, creating several functions that happen to have the same function name differing by method signature: ( (a) return type, and the input parameter’s (b) arity, (c) data types and (d) order), and is resolved at compile-time using “best match technique” in statically-typed OO languages. Class constructors may also be overloaded. Coder comprehension might be affected if parameter types involve type coercion or matches the base “object”. In C++, there is no overloading across scope, such that a function in outer scope must be imported with the “using” keyword. This is different from subtype polymorphism.
    • operator overloading, creating different implementations of operators depending on their arguments so as to resemble some notation in the domain the programmer is working on.
  • Parametric polymorphism or generics, where a functions (or data type) don’t depend on the type of the parameters it accepts (the code does not mention any specific type), and so is implemented generically to handle the identical data type. ML first introduced this in 1975 and has other forms of parametric polymorphism.
  • Subtype polymorphism or subsumption polymorphism or inclusion polymorphism or subtyping, where a method written to operate on elements of one type can work on elements of another type thru the notion of substitutability (either because (a) one is derived from the other thru inheritance from supertype to subtype [nominal subtyping] or (b) both types define all the same methods thru duck typing [structural subtyping]) with the correct method determined at runtime thru virtual functions. For this reason, this may also be called interface inheritance. Read more about covariance and contravariance as this relates to subtyping.
  • Bounded polymorphism, or the interaction between parametric polymorphism and subtype polymorphism, or parametric polymorphism within the bounds of a range of subtypes of a particular type.

Polymorphism is resolved thru either (a) Dynamic Dispatch, (b) Multiple dispatch, (c) Double dispatch or (d) Single dispatch.


Only contents of comments showing that what I’ve written is incorrect will be retained. All other comments will be deleted or edited.

Understanding KMP (Knuth-Morris-Pratt) Algorithm

I was trying to look for a C# function yesterday that I can use to locate a substring inside a StringBuilder object, but it seems there is no native support for that, and I came across this StackOverflow accepted answer which uses a KMP table, which I’ve never known before. I struggled to understand it from the start. Some of the detailed explanations here, here and here are too technical for me. Luckily, I have come across a clearer explanation at the Computer Science Source blog. But even in that blog, the “table”, which I would call the “FailedCharBackshift” table, has not been discussed in detail, and even with Wikipedia pseudo-code, some parts of the function are hard to understand, specially the part where a sub-pattern is broken inside the pattern (the second if condition inside the while clause), so I had a step through the function itself how the table is generated, and had a few patterns experimentation. I think I know now the idea behind the table, and since I can’t possibly remember what I know now until forever, I have to write it down. Looking at the blog post I mentioned, it should have discussed the table at the start, not after discussing the KMP algorithm. So here is another explanation of the KMP table algorithm for those who are reading about this algorithm for the first time.

A little Background

So the KMP algorithm is an improvement over the Naive Approach in searching a pattern inside a text. The improvement in KMP is that, while matching each character of the pattern and the text, if there is any mismatch along the character sequence, the KMP algorithm uses a table to determine how many character positions to the right it has to shift, and does not automatically shift by 1 character as the Naive Approach does.

Before we look at the formula KMP uses to determine how many positions to move the pattern from its current location in the text, we need a few symbols, such as the following:

t = the text from which the pattern is to be searched

p = the pattern to be searched

j =  the character position within t starting from which we match for p

i =  the character position within p that we are currently matching

o = the number of characters of the onset that is repeated to the immediate left of the failed character. The onset is the sub-pattern found at the beginning of the pattern and repeated to the immediate left of the failed character.

T = the table containing how many characters to the left we need to backtrack inside p starting from i if i fails to match, for all the characters in p.

So the formula used by KMP to determine how many characters to shift from its current position is:

 j = j + iT[i] :   We can add parentheses in this formula to group the terms, such as:

 j = (j + i )T[i]  :  By adding j and i first, we are determining the number of character shift for the pattern using the failed character index on the text t as base.

j = j + (iT[i])    : By adding i and T[i], we are determining the number of character shift for the pattern using the current position of the pattern p as base.

The FailedCharBackshift table

So we must understand this table T first before looking at how KMP algorithm proceeds. You can always look at the mentioned blog’s table for “abracadabra”. I experimented with a few patterns and I will be using the pattern “aaacaaacaaaaabra” here to illustrate better the workings of the table.


To read the above table, the blue box represents the onset from the start of the pattern, which is repeated right next to the left of the character that failed, which I highlighted using a red box. The arrows indicate the number and direction of the shift from the index of the failed character. The dot means no shift.

As you can see on the above, the table’s main purpose is to indicate for each character position, how many character position it has to shift left, right, or no shift if the character in that position fails to match. A negative number indicates a shift to the right, 0 means no shift, and positive means shift to the left, starting from the index of the failed character.

At the pattern’s character position of 0, the value is “-1”, which means that if the 1st character fails to match, move 1 position to the right the whole pattern, which is arrived at by using the formula j = j + (iT[i]) , and by substitution j = j + (0T[0])  becoming j = j + (0(-1)) and j = j + 1.

If the pattern fails to match at the 2nd character, the whole pattern is moved 1 to the right as well: j = j + (iT[i]) , and by substitution j = j + (1T[1])  becoming j = j + (1(0)) and j = j + 1. These two positions’ values are fixed for the algorithm to work properly.

For all the other positions on the pattern, the values are determined by checking how many characters to its immediate left matches that many characters at the start of the pattern (the onset, which applies for the indexes 2,3,5,6,7,8,9,10 and 11 ), or a possible match within the reduced onset if the match is broken (for indexes 12 and 13). If the character on the immediate left of the failed character does not exist in the onset, then the shift is zero (for indexes 4, 14, 15).

To determine the onset, we need to keep track of the value of o, which is simply the number of characters of the onset, which is initially set to zero, incremented as each character is matched along inside the pattern. If the sub-pattern is broken, the o is changed to the backshift value of the last onset character that matched. So o’s value is not always zero at each index position and can go down and up, or just down if the character to the left can not be matched against an onset.The handling  of the broken matched pattern is a bit tricky to understand, especially when I was using the pattern “abracadabra”, so I had to play around with different patterns.

The algorithm for the FailedCharBackShift table

So here is the algorithm in C# with comments:

public static int[] KMPTable(string pattern)
           int patternCharNumber = pattern.Length;

           //Create an array with elements that corresponds to each of the chars of the pattern
           int[] FailedCharBackshift = new int[patternCharNumber];
           //This is the value that the index must have so that if the
           //matching fails at this index, the pattern slides to the next char.           
           FailedCharBackshift[0] = -1;            
           FailedCharBackshift[1] = 0;
           // nucleus is the index inside the pattern under consideration,
           //which now starts at 2 after setting 0 and 1 above.
           int nucleus = 2;

           //onset is the char(s) at the start of the pattern repeated inside the pattern
           //onsetLastChar is the index of the last char of the onset to the
           //immediate left of nucleus, initially set at zero
           int onsetLastChar = 0; 

           //until the last char of the pattern
           while (nucleus < patternCharNumber)
               //compare the char to the immediate left of the nucleus
               //with the last char of the onset.
               if (pattern[nucleus - 1] == pattern[onsetLastChar])
                   //if the same, increment the number of onset chars
                   //assign it as the value of the nucleus
                   //and increment the nucleus
                   FailedCharBackshift[nucleus++] = ++onsetLastChar;
                   //if not the same (the onset match is broken on the left of the nucleus),                    
               else if (onsetLastChar > 0)
                   //it recursively reduces the onset by resetting it
                   //to the backshift value of the last char of the revised onset.                   
                   //this is the tricky part to understand, so need to step through      
                   //the code to really understand it and experiment with different patterns
                   onsetLastChar = FailedCharBackshift[onsetLastChar];
                   //if not the same and at the start of the loop
                   //or it reached the start of the onset (onsetLastChar = 0)                   
                   //then set it at zero
                   FailedCharBackshift[nucleus++] = 0;

           return FailedCharBackshift;

As for the KMP algorithm itself, the blog Computer Science Source blog already discussed it clearly, I think.

Ranking Functions in MS SQL Server 2012

Ranking functions return a ranking value for each row in the result set that may or may not specify a partition. The following are common characteristics of ranking functions:

  • Scope of Partition. If the PARTITION BY clause is specified, the ranking function is applied to cover rows within that partition, otherwise if unspecified, the partition is assumed to be the whole result set.
  • Sorting within Partition and Result Set. The row’s rank number is determined by the order_by_clause of the OVER clause. The ORDER BY clause of the SELECT statement orders the entire result set, including the rank order column. These ORDER BY clauses do not have to coincide.
  • Sorting Key for Result Set. If the PARTITION BY clause is specified, it is best to use this for the ORDER BY clause of the SELECT statement. If not specified, the ORDER BY clause of the OVER clause is probably best as the same as the ORDER BY clause of the SELECT statement.
  • Filtering Rows. To filter by the ranking function, you need to use either a) a subquery, b) a CTE or c) a table variable or temp table to generate the initial record set. As explained in this StackOver Q&A, “Windowing functions are not performed until the data is actually selected which is after the WHERE [and HAVING] clause [are dealt with]. So if you try to use a row_number in a WHERE clause the value is not yet assigned.”.

In MS SQL Server 2012, there are four ranking functions and their differences are tabulated below:

Syntax ROW_NUMBER ( ) OVER ( [ PARTITION BY value_expression , … [ n ] ] order_by_clause )
  • Number Sequence. Returns the sequential number of a row within the scope specified in the PARTITION BY clause, starting at 1 for the first row in each partition. There are no duplicates and skips in the order number.
  • Ties. ROW_NUMBER() order ignores ties and just assign them consecutively that is why there is no guarantee they will be the same unless order by columns and partitioned column are unique separately and in combination.
  • Usage. Generally used to generate a serial reference number for each row, like used in page processing.
Syntax RANK ( ) OVER ( [ partition_by_clause ] order_by_clause)
  • Number Sequence. Returns a non-sequential rank number of each row within the scope specified in the PARTITION BY clause, starting at 1 for the first row in each partition.
  • Ties. The rank number of rows with ties above it are counted separately (tied ranks above are not treated as one) but the rank number of rows it has ties with are counted collectively per rank (tied ranks are treated as one ), thus the RANK function may return several duplicate rank numbers and skips in the rank numbers.
Syntax DENSE_RANK ( ) OVER ( [ partition_by_clause ]  order_by_clause  )
  • Number Sequence. Returns the sequential rank number of each row within the scope specified in the PARTITION BY clause, starting at 1 for the first row in each partition.
  • Ties. The rank number of rows with ties are counted collectively per rank as one rank, thus the DENSE_RANK function may return several duplicate rank numbers but without skips in the rank numbers.
Syntax NTILE (integer_expression) OVER ( [ partition_by_clause ] order_by_clause )
  • Grouping. The rows in the scope are separated into “X” (the integer_expression) number of groups . If the rows are not evenly distributed, the first few groups will have an extra row than the last or succeeding rows.
  • Number Sequence. Returns the sequential order number of groups of rows within the scope specified in the PARTITION BY clause, starting at 1 for the first row in each partition. There may or may not have duplicates in ntile numbers and there are no skips.
  • Ties. Ties are ignored and not considered in the order/ntile numbers. Rows with identical values may fall on different groupings or ntile. See StackOverflow Q&A.

According to MS, all ranking functions are non-deterministic.