Cascading: October 2016

Flow, FlowConnector & Cascades : 101

Flow

Is a logical unit of work
Combines multiple MapReduce jobs
Is a combination of Source + pipes + Sink #52

FlowConnector

Is a class for supporting various platforms
Available FlowConnector's #54

LocalFlowConnector
HadoopFlowConnector
...

'Cascade' #55

Is a connected series of Flow's

Tap : 101

Source & Sink are specified in terms of Tap #51
Available Taps

Hfs
Lfs
Dfs
FileTap

Example

----- Example 1
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine(new Fields("department","salary" ));
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE);

Pipe : 101

Transformation is done through Pipe #40
branch : Chain of pipes without merge or split is called as branch #40
Pipe Operations #40

Filter
Function
Aggregator
Count
Buffer
Assertion

Examples

----- Example 1
Pipe firstPipe = new Pipe("main");
Pipe eachPipe = new Each(firstPipe, new MyFunction());
eachPipe = new Each(firstPipe, new MyFilter());

'Each' (given above in example) #41

Flows tuples one at a time through processing chain
Allows 'Function' or 'Filter' operation
----- Example 1
Pipe payroll =Pipe("payroll");
payroll = new Each(payroll, new calc_raise(), new Fields("name","division","salary","raise"));

Splitting a Pipe #42

----- Example 1
Pipe hrdata = new Each("hrdata", new Fields("name","address","phone"));
Pipe developers = new Each(hrdata, new GetDevelopers()); //This is a filter
Pipe managers = new Each(hrdata, new GetManagers()); // This is a filter

'GroupBy' Pipe (GroupBy & sorting) #44

----- Example 1
Pipe payroll = new Each("payroll", new Fields("division", "name", "salary", "rise"), new Identity());
// Group by division, sort by salary

Fields groupFields = new Fields( "division");

Fields sortFields = new Fields( "salary" );

Pipe assembly = new GroupBy( payroll, groupFields, sortFields );

'Every' Pipe #44

Operates on Groups of records (from GroupBy or CoGroup pipes)

'Merge' Pipe #46

To join multiple stream into a single stream

Join pipes #46

CoGroup Pipe
HashJoin Pipe

TupleEntry : 101

Is used as a wrapper for Tuple & Fields #39
----- Example 1

Fields selector = new Fields("name","ssn");
Tuple tuple = Tuple.size(2);
TupleEntry tupleEntry = new TupleEntry(selector, tuple);

Defining Scheme : 101

A Scheme contain field definition #34
Is also used to parse & transform data #36
----- Examples (Scheme, Fields & Coercion)

Scheme personScheme = new TextLine( new Fields( "first_name","last_name", "age") ); #35
Fields longFields = new Fields("age", "height", Long.class);
---
Fields[] nameFields = new Fields[] {new Fields("name"), new Fields("age")}; #35
Type[]typeFields = new Type[]{ String.class, Integer.class };
Scheme personScheme = new TextLine(new Fields(nameFields, typeFields));
---
----- Fluent Interface
Fields inFields = new Fields("name", "address", "phone", "age");
inFields.appplyType("age", long.class);

Preexisting schemes #35

NullScheme
TextLine

Used to read & write text files(gets offset<TAB>data) #36

TextDelimited

Used to read delimited text files #37

SequenceFile #38

Tuple, Fields & Data type coercion...

Tuple #31

cascading.tuple.Tuple
Data is represented as Tuple in Cascading
Elements in a Tuple is represented by Fields #32
----- Methods
size(): Tuple

Used to create a Tuple of given size #31

getObject(): Object

Returns the element at a given position #32
----- Related
getBoolean(), getString(), getFloat(), getDouble(), getInteger()

Fields #32

cascading.tuple.Fields
Provides storage for field metadata : names, types, comparators & type coercion
----- Example
Fields people = new Fields("first_name", "last_name")
----- Useful Field sets

Fields.ALL #34

A wild card that represents all the current available fields

Fields.UNKNOWN
Fields.ARGS
....

Can be used as both 'declarators' and 'selectors' #34

Data typing & coercion

For implicit conversion of one type to another
----- Example

pipe = new Coerce(new Fields("timestamp"), Long.class)
Fields simple = new Fields ("age", Long.class);

----- Methods

TupleEntry.getString()

Wednesday, October 12, 2016

Flow, FlowConnector & Cascades : 101

Tap : 101

Pipe : 101

TupleEntry : 101

Defining Scheme : 101

Tuple, Fields & Data type coercion...