Wednesday, October 12, 2016

Flow, FlowConnector & Cascades : 101


  1. Flow
    1. Is a logical unit of work
    2. Combines multiple MapReduce jobs
    3. Is a combination of Source + pipes + Sink #52
  2. FlowConnector
    1. Is a class for supporting various platforms
    2. Available FlowConnector's #54
      1. LocalFlowConnector
      2. HadoopFlowConnector
      3. ...
  3. 'Cascade' #55
    1.  Is a connected series of Flow's

Tap : 101


  1. Source & Sink are specified in terms of Tap #51
  2. Available Taps
    1. Hfs
    2. Lfs
    3. Dfs
    4. FileTap
  3. Example
    1. ----- Example 1
    2. Scheme sourceScheme = new TextLine( new Fields( "line" ) ); 
    3. Tap source = new Hfs( sourceScheme, inputPath ); 
    4. Scheme sinkScheme = new TextLine(new Fields("department","salary" )); 
    5. Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE);

Pipe : 101


  1. Transformation is done through Pipe #40
  2. branch : Chain of pipes without merge or split is called as branch #40
  3. Pipe Operations #40
    1. Filter
    2. Function
    3. Aggregator
    4. Count
    5. Buffer
    6. Assertion
  4. Examples
    1. ----- Example 1
    2. Pipe firstPipe = new Pipe("main"); 
    3. Pipe eachPipe = new Each(firstPipe, new MyFunction()); 
    4. eachPipe = new Each(firstPipe, new MyFilter());
  5. 'Each' (given above in example) #41
    1. Flows tuples one at a time through processing chain
    2. Allows 'Function' or 'Filter' operation
    3. ----- Example 1
    4. Pipe payroll =Pipe("payroll"); 
    5. payroll = new Each(payroll, new calc_raise(), new Fields("name","division","salary","raise")); 
  6. Splitting a Pipe #42
    1. ----- Example 1
    2. Pipe hrdata = new Each("hrdata", new Fields("name","address","phone")); 
    3. Pipe developers = new Each(hrdata, new GetDevelopers());  //This is a filter
    4. Pipe managers = new Each(hrdata, new GetManagers());  // This is a filter
  7. 'GroupBy' Pipe (GroupBy & sorting) #44
    1. ----- Example 1
    2. Pipe payroll = new Each("payroll", new Fields("division", "name", "salary", "rise"), new Identity()); 
    3. // Group by division, sort by salary 
      Fields groupFields = new Fields( "division"); 
      Fields sortFields = new Fields( "salary" ); 
      Pipe assembly = new GroupBy( payroll, groupFields, sortFields ); 
  8. 'Every' Pipe #44
    1. Operates on Groups of records (from GroupBy or CoGroup pipes)
  9. 'Merge' Pipe #46
    1. To join multiple stream into a single stream
  10. Join pipes #46
    1. CoGroup Pipe
    2. HashJoin Pipe 

TupleEntry : 101


  1. Is used as a wrapper for Tuple & Fields #39
  2. ----- Example 1
    1. Fields selector = new Fields("name","ssn"); 
    2. Tuple tuple = Tuple.size(2); 
    3. TupleEntry tupleEntry = new TupleEntry(selector, tuple); 

Defining Scheme : 101



  1. A Scheme contain field definition #34
  2. Is also used to parse & transform data #36
  3. ----- Examples (Scheme, Fields & Coercion) 
    1. Scheme personScheme = new TextLine( new Fields( "first_name","last_name", "age") ); #35
    2. Fields longFields = new Fields("age", "height", Long.class); 
    3. ---
    4. Fields[] nameFields = new Fields[] {new Fields("name"), new Fields("age")}; #35
    5. Type[]typeFields = new Type[]{ String.class, Integer.class }; 
    6. Scheme personScheme = new TextLine(new Fields(nameFields, typeFields)); 
    7. ---
    8. ----- Fluent Interface
    9. Fields inFields = new Fields("name", "address", "phone", "age"); 
    10. inFields.appplyType("age", long.class); 
  4. Preexisting schemes #35
    1. NullScheme
    2. TextLine
      1. Used to read & write text files(gets offset<TAB>data) #36
    3. TextDelimited
      1. Used to read delimited text files #37
    4. SequenceFile #38

Tuple, Fields & Data type coercion...


  1. Tuple #31
    1. cascading.tuple.Tuple
    2. Data is represented as Tuple in Cascading
    3. Elements in a Tuple is represented by Fields #32
    4. ----- Methods
    5. size(): Tuple
      1. Used to create a Tuple of given size #31
    6. getObject(): Object
      1. Returns the element at a given position #32
      2. ----- Related
      3. getBoolean(), getString(), getFloat(), getDouble(), getInteger()
  2. Fields #32
    1. cascading.tuple.Fields
    2. Provides storage for field metadata : names, types, comparators & type coercion
    3. ----- Example
    4. Fields people = new Fields("first_name", "last_name")
    5. ----- Useful Field sets
      1. Fields.ALL #34
        1. A wild card that represents all the current available fields
      2. Fields.UNKNOWN
      3. Fields.ARGS
      4. .... 
    6. Can be used as both 'declarators' and 'selectors' #34
  3. Data typing & coercion
    1. For implicit conversion of one type to another 
    2. ----- Example
      1.  pipe = new Coerce(new Fields("timestamp"), Long.class) 
      2.  Fields simple = new Fields ("age", Long.class); 
    3. ----- Methods
      1. TupleEntry.getString()