Thursday, 12 July 2018

awk match and return smallest matching part of a string

function smatch(str, re)
{
    start=1;
    len=length(str)+1;
    print "string length:" len
    i=0
    while (match(substr(str, 1, start+len-10), re)){
        i++;
        if (i>15) exit;
        start=RSTART;
        len=RLENGTH;
    }
    return substr(str, start, len);
}

Thursday, 27 October 2011

Cursor Sharing

The init.ora parameter "CURSOR_SHARING" can have the values of FORCE, SIMILAR or EXACT. When set to FORCE, all string/numeric literals will be replaced by bind variables. This will improve the performance of code that should use bind variables but doesn't (by making it possible to use the same plan for all the versions of the statements). Disadvantages are the binding of real constants (things that never change), which will give the optimiser less data to work with. The explain plan will see different query to the actual optimiser, and hence the results will be unreliable. See: http://asktom.oracle.com/pls/asktom/f?p=100:11:1328478115348624::::P11_QUESTION_ID:5180609822543 When set to SIMILAR, it will share the cursor if any of the different bind variable values don't change the plan (e.g. due to a difference in "selectiveness" which means that a Index Range scan should be used instead of Full Table scan). If bind peeking is enabled, it used to peek once and then use the same plan then onwards. Since 11g it keeps checking if the new values have increased the physical reads etc and if so it discards the original plan and creates new ones for each new value in that bind variable (bind sensitive and bind aware columns in the child cursor records) EXACT will never share a cursor, and will always create a new one for each statement with different values.

Useful links

Oracle optimisation blog: http://optimizermagic.blogspot.com

Tuesday, 14 December 2010

Medians and Quartiles

Problem statement:

From a subset of records (identified by a where clause) find the student with the mean salary, the student at the 25 percentile and the student at 75 percentile. Also get the "outliers", which are the students with the highest and the lowest salaries.

Added complication: Each student record also has a weighting associated with it, which goes from 1 to 1000, and this needs to be considered when counting the percentiles. In effect, this means that each student records represents a thousand "mili-students" and we need to find the student at the correct percentile.

All the following solutions need a supporting function which will calculate the required rank (e.g. for median) when given the total number of results. E.g. if the total is 11, the mean position is 6.

Solution 1 (the most basic and slowest):

We created another large table with each student repeated a thousand times. Then ordered this expanded result set by salary, looped through this ordered list until we were at the correct rank.

Obviously given that the result set was expanded a thousand times, this solution wasn't great for performance.

Solution 2 (Improvement, but still quite cumbersome):

We don't repeat each student a thousand times. Instead, we know the mean position in terms of the weight (e.g. the student at position 6.334 is the mean) and then while looping through the ordered list of students we keep a cumulative total of weights. As soon as the cumulative total goes beyond this position we exit the loop and return that student's salary as the mean value.

Solution 3 (Fastest and fits in a single SELECT statement):

Use Oracle Analytic functions to get the cumulative sum mentioned above, and the SIGN function to find whether it is more than or equal to the mean position. Then get the minimum where the sign value is on the right side. The query looks like this:


SELECT country,
        MAX(DECODE(ranking, 1, salary, NULL)) outlier1,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q11')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q12')), -1, NULL, salary)))/2 Q1,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'M1')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'M2')), -1, NULL, salary)))/2 mean,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q21')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q22')), -1, NULL, salary)))/2 Q3,
        MAX(DECODE(ranking, country_count, salary, NULL)) outlier2
FROM
    (SELECT country, student_id, salary, weight,
             SUM(weight) OVER (PARTITION BY country ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) weights_cumm_sum,
            SUM(weight) OVER (PARTITION BY country) weights_sum,
            RANK() OVER (PARTITION BY country ORDER BY salary) ranking,
            COUNT(*) OVER (PARTITION BY country) country_count
    FROM tqi_salary_test) s
GROUP BY country

The inner query orders the students in each country by their salary and calculates a cumulative sum of student_weighting. The outer query finds the minimum cumulative value where it equals or exceeds the required quartile value (supplied by the quartile index function) and returns the corresponding salary. The outliers are accessed simply based on their ranking by salary.

The above mentioned function to return the required quartile index looks like this:

CREATE OR REPLACE FUNCTION TQI.get_quartile_index(
    pn_count IN NUMBER,
    pv_index_required IN VARCHAR2
) RETURN NUMBER IS
    ln_m_1 NUMBER;
    ln_m_2 NUMBER;
    ln_q_1_1 NUMBER;
    ln_q_1_2 NUMBER;
    ln_q_2_1 NUMBER;
    ln_q_2_2 NUMBER;
    ln_output NUMBER;
BEGIN
    IF (MOD(pn_count, 2) = 1) THEN --Odd number of rows, e.g. 11 or 13.
        ln_m_1 := (pn_count+1)/2; --median row will be 6 or 7.
        ln_m_2 := ln_m_1;
        IF (MOD(ln_m_1, 2) = 1) THEN --median is odd number, say 7 
            ln_q_1_1 := (ln_m_1-1)/2; --first quartile is 3 and 4.
            ln_q_1_2 := (ln_m_1+1)/2; --first quartile is 3 and 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9 and 10.
            ln_q_2_2 := ln_m_1 + ln_q_1_2; --second quartile is 9 and 10.
        ELSE --median 1 is even, say 6.
            ln_q_1_1 := ln_m_1/2; --first quartile is 3.
            ln_q_1_2 := ln_q_1_1; --first quartile is 3.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9.
            ln_q_2_2 := ln_q_2_1; --second quartile is 9.
        END IF; 
    ELSE--Even number of rows, e.g. 12 or 14.
        ln_m_1 := pn_count/2; --median 1 will be 6 or 7  
        ln_m_2 := ln_m_1 + 1; --median 2 will be 7 or 8 
        IF (MOD(ln_m_1, 2) = 1) THEN --median 1 is odd number, say 7 
            ln_q_1_1 := (ln_m_1+1)/2; --first quartile is 4.
            ln_q_1_2 := ln_q_1_1; --first quartile is 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 11
            ln_q_2_2 := ln_q_2_1; --second quartile is 11.
        ELSE --median 1 is even, say 6.
            ln_q_1_1 := ln_m_1/2; --first quartile is 3 and 4.
            ln_q_1_2 := ln_q_1_1+1; --first quartile is 3 and 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9 and 10.
            ln_q_2_2 := ln_q_2_1 + 1; --second quartile is 9 and 10.
        END IF; 
    END IF;
    
    CASE pv_index_required
    WHEN 'M1' THEN ln_output := ln_m_1;
    WHEN 'M2' THEN ln_output := ln_m_2;
    WHEN 'Q11' THEN ln_output := ln_q_1_1;
    WHEN 'Q12' THEN ln_output := ln_q_1_2;
    WHEN 'Q21' THEN ln_output := ln_q_2_1;
    WHEN 'Q22' THEN ln_output := ln_q_2_2;
    END CASE;
    
    RETURN ln_output;
END get_quartile_index;
/

Tuesday, 23 November 2010

Scrum

Compiled based on the talk by Ken Schwaber at Google (http://www.youtube.com/watch?v=IyNPeTn8fpo)

Basic features of scrum:

Involves short sprints of durations ranging from 3 to 30 days.

At the end of each sprint the team has to deliver agreed functionality that is complete (coding standards followed, documentation produced, unit testing performed) and demonstrable (not abstract).

There are daily scrum meetings where the scrum master and product owners are observers only. The team members each say what they have done since the last meeting, what issues they are facing and what they plan to do next.

Roles involved:

Product Owner: Represents the customer, prioritises the requirements and gives feedback on delivered items.

Scrum Master (aka the Prick) who stands between the marketing and management people who believe that the deadlines can be met by working harder, and the dev team who tend to cut quality in order to meet the deadline. The Scrum Master's role is to ensure that the quality is maintained in all sprints, and not pass the sprint if the quality is not there (lack of documentation or unit testing or coding standards).

Team: Ideally cross-functional developers who are able to do each other's jobs.

Documents involved:

Product Backlog: List of requirements
Sprint Backlog: List of tasks being delivered in a given sprint, each task being worth one or two day's work.

Burn Down Chart: A chart with tasks on the Y axis and time on the X, showing how the outstanding tasks go down in time. The slope indicates the Sprint Velocity.

One big issue with old code is that such corner cutting has gone on for years and years and it slows down all development that depends on this core functionality that is not properly documented, tested and developed to standards.

QA people tend to make good scrum masters.

Thursday, 18 November 2010

External Tables

This is an alternative to using sql*Loader, .ctl files and batch files to load flat files into Oracle tables.

You define a directory on the database server:

create or replace directory external_dir as '/home/dev/tqi/data'

Give access to that folder to the oracle user, and then specify the structure of the flat file as follows, which then makes it look like a table you can select from!

create table x_emp
( empl_id varchar2(3),
last_name varchar2(50),
first_name varchar2(50)
)
organization external
( default directory external_dir
access parameters
( records delimited by newline
fields terminated by ','
)
location ('x_emp.csv')
);

Just put the file into this location and do this:

select * from x_emp