Welcome to the HadoopExam Spark Professional Training Courses.

CCA175 : Cloudera Hadoop and Spark Developer Certification Preparation


About Us and Book :  This Book is published by www.HadoopExam.com (HadoopExam Learning Resources). Where you can find material and training's for preparing for BigData, Cloud Computing, Analytics, Data Science and popular Programming Language. This Book will contain 8 practice questions for preparing CCA175 certification exam. www.HadoopExam.com  currently have in total 95 solved problem scenarios which you can get directly from site. This book not only provides how to prepare for CCA175 exam, but also gives you the platform detail to practice the material. Currently we are providing or in process of Developing following material for Hadoop Big Data Certfication.
Training Material
Acknowledgement : We really wish to thanks our learners who helped us to reach at this level in Just short span (3.5 Years) . We have 1000's of learners who subscribed our products and continuously provides the feedback on the material which we create. And based on their feedback we upgrade/update our material so that other learners can get benefit out of this.  Without their feedback , we can not provide this material. As their are 1000's of learners who provides feedback, so we can not write all the names here. But combined effort lead us to quality and low cost material. In future we are similarly expecting the same help from our network. Please subscribe here , if you want to keen in touch with HadoopExam Learning Resources.


Table of Contents

Chapter 1 : Installation of Cloudeara QuickStart VM (Step By Step)

Chapter 2 : Syllabus of CCA175 (Hadoop and Spark Developer Certification)

Chapter 3 : Tips and Tricks to Clear CCA175 exam

Chapter 4 : Practice Scenario  : Copying Data from RDBMS(MySQL) to HDFS

Chapter 5 : Practice Scenario  : Playing with HDFS command to various file Operation

Chapter 6 : Practice Scenario  : Selective Data Import using Sqoop from MySQL to HDFS

Chapter 7 : Practice Scenario  : Importing Data using Predicate

Chapter 8 : Practice Scenario  : Writing Hive DDL with Complex Data types

Chapter 9 : Practice Scenario  : Create Hive table with Given Data (Scenario 8)

Chapter 10 : Practice Scenario  : Joining DataSets (De-normalization)

Chapter 11 : Practice Scenario  : Apache Flume Introductions

Chapter 12 : Where to go from Here

Appendix : All material provided by www.HadoopExam.com

Appendix  : We are looking for Authors/Trainers


Chapter 1 :  
Installation of Cloudeara QuickStart VM (Step By Step)
Problem Scenario : You have been given MySQL DB with following details. 

(As per PDF Given)


Chapter 2 : Syllabus of CCA175 (Hadoop and Spark Developer Certification)

Required Skills for CCA175

Data Ingest

The skills to transfer data between external systems and your cluster. This includes the following:

  • Import data from a MySQL database into HDFS using Sqoop
  • Export data to a MySQL database from HDFS using Sqoop
  • Change the delimiter and file format of data during import using Sqoop
  • Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume
  • Load data into and out of HDFS using the Hadoop File System (FS) commands

Transform, Stage, Store

Convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in both Scala and Python:

  • Load data from HDFS and store results back to HDFS using Spark
  • Join disparate datasets together using Spark
  • Calculate aggregate statistics (e.g., average or sum) using Spark
  • Filter data into a smaller dataset using Spark
  • Write a query that produces ranked or sorted data using Spark

Data Analysis

Use Data Definition Language (DDL) to create tables in the Hive metastore for use by Hive and Impala.

  • Read and/or create a table in the Hive metastore in a given schema
  • Extract an Avro schema from a set of datafiles using avro-tools
  • Create a table in the Hive metastore using the Avro file format and an external schema file
  • Improve query performance by creating partitioned tables in the Hive metastore
  • Evolve an Avro schema by changing JSON files

Chapter 3 : Tips and Tricks to Clear CCA175 exam

1. Preparation:  Please go through all the CCA175 Questions and practice the code provided by http://www.HadoopExam.com  Thee content is based on recent syllabus. (Especially I have gone through all the Spark Professional training module as well)

2. No. Of Questions: Usually you will get 10 questions in real exam:  Topic will be coverings are Sqoop, Hive, Pyspark and Scala and avro-tools to extract schema (All questions are covered in CCA175 Certification Simulator).
3. Code Snippets: will be provided for Pyspark and Scala. You have to edit the snippets accordingly as per the problem statement. You can write answer in choice of your language. However, half solved scenario will be provided in a particular language. If you choose same language then it would be faster for you to solve the problem else you have to write program from scratch to solve given problem.

4. Real Exam Environment:  Gateway node will be accessible for execution of the problems during the exam. Keep in mind there will not be any on-screen timer available during the exam. You have to keep asking for the time left. There are three sections for each problem i.e.

·         Instructions

·         Data Set

·         Output Requirements.

·         Please go through all the three sections carefully before start developing the code.

·         Note:  If you started developing code right after looking at the Instruction part of the question, then later you will realized the exact details of the table like name of the table and HDFS directory are also mentioned. This can waste your time if you have to redo the code or might as well cost you a question.

5.       Editor: nano, gedit are not available. So if you have to edit any code snippets, you have to use vi alone. Please make yourself familiar with vi editor if you are not. 

6.       Fill in blanks: You dont have to write entire code for Python and Scala for Apache Spark, generally they will ask you to do fill in the blanks. (Template based questions)

7.      Apache Flume: Very few questions on flume, you might even not get a question on Flume. But prepare Flume as well.

8.       Difficulty Level: If you have enough knowledge, you will feel exam is quite easy.  The questions were logically easy and can be answered in the first attempt if you read the question carefully (all three sections).

9.       Common mistake in Sqoop: People use connector as localhost which is wrong, you have to use full name instead of localhost (Avoid wasting your time). Use given hostname

10.   Hive: Have initial knowledge of hive as well. 

11.   Spark: Using basic transform functions to get desired output. For instance filter according particular scenario, sorting and ranking etc.

12.   Avro-tool : avro-tool to get schema of avro file. (Very nicely covered in CCA175 HadoopExam.com Simulator)
13.   Big Mistake: Avoid accidentally deleting your data: good practice is necessary to avoid such mistakes. (Once you delete or drop hive table, you have to create it entirely once again.) Same is instructed by www.HadoopExam.com during their videos session provided at http://cca175cloudera.training4exam.com/ (Please go through sample sessions)

14.   Spark-sql: They will not ask questions based on Spark Sql learn importantly aggregate, reduce, sort. 

15.   Time management: It is very important, (That’s the reason you need too much practice, use CCA175 simulator to practice all the questions at least a week or two before your real exam).

16.   Data sets : in real exam is quite larger, hence it will take 2 to 5 mins for execution.
17.   Attempts:  try to attempt all questions at least 9/10, hence you must be able to score 70%.
18.   File format: In most of questions there was tab delimited file to process.
19.   Python or Scala: You will get a preloaded python or scala file to work with, so you don't have(Now you can choose) a choice whether you want to attempt a question via scala or pyspark. (I have gone through all the Video sessions provided by www.HadoopExam.com here 
20.   Connection Issue: If you got disconnected during exam, you may need to contact the proctor immediately. If he/she is not available log back into examslocal.com and use their online help. 
21.   Shell scripts: Have good experience to use shell scripts.
22.   Question types as mentioned in syllabus : Questions were from Sqoop(import and export), Hive(table creation and dynamic partitioning), Pyspark and Scala(Joining, sorting and filtering data), avro-tools. Snippets of code will be provided for Pyspark and Scala. You have to edit the snippets accordingly as per the problem statement and can the script file(which is another file apart from snippet) to get the results.

23.   Difficulty LevelOverall exam is easy, but require lot of practice to complete on time and for accurate solutions of the problem. Hence go through the all below material for CCA175 (It will not take more than a month, if you are new and already know the Spark and Hadoop then 2-3 weeks are good enough.





Chapter 4 : Practice Scenario  : Copying Data from RDBMS(MySQL) to HDFS

Problem Scenario : You have been given MySQL DB with following details. 

user=retail_dba 
password=cloudera 
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Connect MySQL DB and check the content of the tables.
2. Copy "retail_db.categories" table to hdfs, without specifying directory name.
3. Copy "retail_db.categories" table to hdfs, in a directory name "categories_target".
4. Copy "retail_db.categories" table to hdfs, in a warehouse directory name "categories_warehouse".
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solution

Step 1 : Connecting to existing MySQL Database
mysql --user=retail_dba --password=cloudera retail_db

Step 2 : Show all the available tables
show tables;

Step 3 : View/Count data from a table in MySQL
select count(1) from categories;

Step 4 : Check the currently available data in HDFS directory
hdfs dfs -ls

Step 5 : Import Single table (Without specifying directory).
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera -table=categories 

Note : Please check you dont have space between before or after '=' sign.
Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 6 : Read the data from one of the partition, created using above command.
hdfs dfs -cat categories/part-m-00000

Step 7 : Specifying target directory in import command (We are using number of mappers =1, you can change accordingly)
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --target-dir=categories_target --m 1

Step 8 : Check the content in one of the partition file.
hdfs dfs -cat categories_target/part-m-00000

Step 9 : Specifying parent directory so that you can copy more than one table in a specified target directory. Command to specify warehouse directory
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_warehouse --m 1

Step 10 : See the content in one of the file (partition)
hdfs dfs -cat categories_warehouse/categories/part-m-00000



Chapter 5 : Practice Scenario  : Playing with HDFS command to various file Operation


Problem Scenario  : There is a parent organization called "Acmeshell Group Inc", which has two child companies named QuickTechie Inc and HadoopExam Inc. Both compnaies employee information is given in two separate text file as below. Please do the following activity for employee details.

quicktechie.txt
1,Alok,Hyderabad
2,Krish,Hongkong
3,Jyoti,Mumbai
4,Atul,Banglore
5,Ishan,Gurgaon

hadoopexam.txt
6,John,Newyork
7,alp2004,California
8,tellme,Mumbai
9,Gagan21,Pune
10,Mukesh,Chennai

1. Which command will you use to check all the available command line options on HDFS and How will you get the Help for individual command.
2. Create a new Empty Directory named Employee using Command line. And also create an empty file named in it quicktechie.txt 
3. Load both companies Employee data in Employee directory (How to override existing file in HDFS).
4. Merge both the Employees data in a Single file called MergedEmployee.txt, merged files should have new line character at the end of each file content.
5. Upload merged file on HDFS and change the file permission on HDFS merged file , so that owner and group member can read and write, other user can read the file.
6. Write a command to export the individual file as well as entire directory from HDFS to local file System.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solution

Step 1 : Check All Available command
hdfs dfs

Step 2 : Get help on Individual command
hdfs dfs -help get

Step 3 : Create a directory in HDFS using named Employee and create a Dummy file in it called e.g. quicktechie.txt
hdfs dfs -mkdir Employee

Now create an emplty file in Employee directory using Hue. (As shown in video)

Step 4 : Create a directory on Local file System and then Create two files, with the given data in problems.

Step 5 : Now we have an existing directory with content in it, now using HDFS command line , overrid this existing Employee directory. 
While copying these files from local file System to HDFS.
cd /home/cloudera/Desktop/
hdfs dfs -put -f Employee

Step 6 : Check All files in directory copied successfully
hdfs dfs -ls Employee

Step 7 : Now merge all the files in Employee directory.
hdfs dfs -getmerge -nl Employee MergedEmployee.txt

Step 8 :  Check the content of the file.
cat MergedEmployee.txt

Step 9 : Copy merged file in Employeed directory from local file ssytem to HDFS.
hdfs dfs -put MergedEmployee.txt Employee/

Step 10 : Check file copied or not.
hdfs dfs -ls Employee

Step 11 : Change the permission of the merged file on HDFS
hdfs dfs -chmod 664 Employee/MergedEmployee.txt

Step 12  : Get the file from HDFS to local file system. 
hdfs dfs -get Employee Employee_hdfs



Chapter 6 : Practice Scenario  : Selective Data Import using Sqoop from MySQL to HDFS


Problem Scenario : You have been given MySQL DB with following details. 

user=retail_dba 
password=cloudera 
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Import data from catagories table, where catagory=22 (Data should be stored in categories_subset)
2. Import data from catagories table, where catagory>22 (Data should be stored in categories_subset_2)
3. Import data from catagories table, where catagory  between 1 and 22 (Data should be stored in categories_subset_3)
4. While importing catagories data change the delimiter to '|' (Data should be stored in categories_subset_6)
5. Importing data from catagories table and restrict the import to category_name,category_id columns only with delimiter as  '|'
6. Add null values in the table using below SQL statement 
ALTER TABLE categories modify category_department_id int(11);
INSERT INTO categories values (60,NULL,'TESTING');
7. Importing data from catagories table (In categories_subset_17 directory) using '|' delimiter and category_id between 1 and 61 
 and encode null values for both string and non string columns.
 8. Import entire schema retail_db in a directory categories_subset_all_tables
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solution : 

Step 1: Import Single table (Subset data) Note: Here the ` is the same you find on ~ key
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset --where \`category_id\`=22 --m 1

Step 2 : Check the output partition
hdfs dfs -cat categories_subset/categories/part-m-00000

Step 3 : Change the selection criteria (Subset data)
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db  --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset_2 --where \`category_id\`\>22 --m 1

Step 4 : Check the output partition
hdfs dfs -cat categories_subset_2/categories/part-m-00000

Step 5 : Use between clause (Subset data)
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset_3 --where "\`category_id\` between 1 and 22" --m 1

Step 6 : Check the output partition
hdfs dfs -cat categories_subset_3/categories/part-m-00000

Step 7 : Changing the delimiter during import.
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset_6 --where "\`category_id\` between 1 and 22" --fields-terminated-by='|' --m 1

Step 8 : Check the output partition
hdfs dfs -cat categories_subset_6/categories/part-m-00000

Step 9 : Selecting subset columns
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset_col --where "\`category_id\` between 1 and 22" --fields-terminated-by='|' --columns=category_name,category_id --m 1

Step 10 : Check the output partition
hdfs dfs -cat categories_subset_col/categories/part-m-00000

Step 11 : Inserting record with null values (Using mysql)
ALTER TABLE categories modify category_department_id int(11);
INSERT INTO categories values (60,NULL,'TESTING');
select * from categories;

Step 12 : Encode non string null column
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --warehouse-dir=categories_subset_17 --where "\`category_id\` between 1 and 61" --fields-terminated-by='|' --null-string='N' --null-non-string='N' --m 1

Step 13 : View the content
hdfs dfs -cat categories_subset_17/categories/part-m-00000

Step 14 : Import all the tables from a schema (This step will take little time)
sqoop import-all-tables --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --warehouse-dir=categories_subset_all_tables --m 1

Step 15 : View the contents
hdfs dfs -ls categories_subset_all_tables

Step 16 : Cleanup or back to originals.
delete from categories where category_id in (59,60);
ALTER TABLE categories modify category_department_id int(11) NOT NULL;
ALTER TABLE categories modify category_name varchar(45) NOT NULL;
desc categories;


Chapter 7 : Practice Scenario : Importing Data using Predicate


Problem Scenario : You have been given MySQL DB with following details. 

user=retail_dba 
password=cloudera 
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

Import Single table categories(Subset data) to hive managed table , where category_id between 1 and 22 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Solution

Step 1 : Import Single table (Subset data)

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table=categories --where "\`category_id\` between 1 and 22" --hive-import --m 1

Note: Here the ` is the same you find on ~ key
This command will create a managed table and content will be created in the following directory.
/user/hive/warehouse/categories

Step 2 : Check whether table is created or not (In Hive)
show tables;
select * from categories;


Chapter 8 : Practice Scenario  : Writing Hive DDL with Complex Data types


Problem Scenario : You have been given following data format file. Each data-point is separated by '|'.

Name|Sex|Age|Father_Name

Example Record

Anupam|Male|45|Daulat

Create an Hive database named "Family" with following details. You must take care that if database is already exist it should not be created again.

Comment : "This database will be used for collecting various family data and their daily habits"
Data File Location : '/hdfs/family'
Stored other properties : "'Database creator'='Vedic'" , "'Database_Created_On'='2016-01-01'"

Also write a command to check, whether database has been created or not, with new properties.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Solution :  

Step 1 : Create database

CREATE DATABASE IF NOT EXISTS Family
COMMENT 'This database will be used for collecting various family data and their daily habits'
LOCATION '/hdfs/family'
WITH DBPROPERTIES ('Database creator'='Vedic','Database_Created_On'='2016-01-01');

Step 2 :  Check the database
SHOW DATABASES;
DESCRIBE DATABASE extended  Family;

Chapter 9 : Practice Scenario  : Create Hive table with Given Data (Scenario 8)

Problem Scenario  :  You have been given following data format file. Each datapoint is separated by '|'.

Name|Sex|Age|Father_Name

Example Record

Anupam|Male|45|Daulat

Create an Hive table named "Family_Head" with following details. 

- Table must be created in existing database named  "Family"
- You must take care that if table is already exist it should not be created again.
- Table must be created inside Hive warehouse directory and should not be an external table.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solution :

Step 1. Use the existing database 
USE Family

Step 2. Create table 

CREATE TABLE IF NOT EXISTS Family_Head
 (
name string,
business_places ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
fatherName_NuOfChild MAP<string,int>
 )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

Step 3: Describe the created table

DESC Family_Head;

Chapter 10 : Practice Scenario  : Joining DataSets (De-normalization)

Problem Scenario :  You have been given following mysql database details as well as other info.

user=retail_dba 
password=cloudera 
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Import joined result of orders and order_items table join on orders.order_id = order_items.order_item_order_id.
2. Also make sure each tables file is partitioned in 2 files e.g. part-00000, part-00002
3. Also make sure you use order_id columns for sqoop to use for boundary conditions.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solutions

Step 1 : Clean the hdfs file system, if they exists clean out.

hadoop fs -rm -R departments
hadoop fs -rm -R categories
hadoop fs -rm -R products
hadoop fs -rm -R orders 
hadoop fs -rm -R order_itmes
hadoop fs -rm -R customers

Step 2 : Now import the department table as per requirement.
sqoop import \
  --connect  jdbc:mysql://quickstart:3306/retail_db \
  --username=retail_dba \
  --password=cloudera \
  --query="select * from orders join order_items on orders.order_id = order_items.order_item_order_id where \$CONDITIONS" \
  --target-dir /user/cloudera/order_join \
  --split-by order_id \
  --num-mappers 2

 Step 3 : Check imported data.

hdfs dfs -ls order_join
hdfs dfs -cat order_join/part-m-00000
hdfs dfs -cat order_join/part-m-00001

Chapter 11 : Practice Scenario : Data evaluation using Sqoop

Problem Scenario  : You have been given following mysql database details.

user=retail_dba 
password=cloudera 
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. List all the tables using sqoop command from retail_db
2. Write simple sqoop eval command to check whether you have permission to read database tables or not.
3. Import all the tables as avro files in /user/hive/warehouse/retail_cca174.db
4. Import departments table as a text file in /user/cloudera/departments.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Solution

Step 1 : List tables using sqoop

sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db  --username retail_dba --password cloudera 

Step 2 : Eval command, just run a count query on one of the table.

sqoop eval \
  --connect jdbc:mysql://quickstart:3306/retail_db \
  --username retail_dba \
  --password cloudera \
  --query "select count(1) from order_items"

Step 3 : Import all the tables as avro file.

sqoop import-all-tables \
  --connect jdbc:mysql://quickstart:3306/retail_db \
  --username=retail_dba \
  --password=cloudera \
  --as-avrodatafile \
  --warehouse-dir=/user/hive/warehouse/retail_stage.db \
  -m 1

Step 4 : Import departments table as a text file in /user/cloudera/departments
sqoop import \
  --connect jdbc:mysql://quickstart:3306/retail_db \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --as-textfile \
  --target-dir=/user/cloudera/departments

Step 5 : Verify the imported data.

hdfs dfs -ls /user/cloudera/departments
hdfs dfs -ls /user/hive/warehouse/retail_stage.db
hdfs dfs -ls  /user/hive/warehouse/retail_stage.db/products



Installation of Python : Best way use Anaconda to learn python.

Chapter -1
=================

-- Functions : Python has lots of pre-made functions, that you can use right now, simply by 'calling' them. '
print ('Hello Data Scientist')
print 'Hello Data Scientist' //Python 2 Syntex

-- Interactive Arithmatic operations 
100+200
102/2
20.0/2

-- Exponantial operations using **
3**3 

-- Find the type of value as below.
type(100)
type('Hello Data Scientist')
type("Hello Data Scientist")
type(80.0/2) #float
type('2') #String

--What is Value in Python ?
All below are values.
2, 'Hello Data Scientist' , "Hello Data Scientist",  '2'

-- Important : Comma separated values are converted into sequnce of integers and do not considered as a single number.
1,00,000
(1, 0, 0)


Chapter -2 :
===================

-- Variable : Variable is a name that refers to a value. For example

name='John'

>>> name
'John'

-- Valid Variable Names and assignment Statement : Assign values to a variable
name='John'
message="Welcome to Python Learning for Data Scientist"

Notes : Variable names can be as long as you like. They can contain both letters and numbers, but they can’t begin with a number. It is legal to use uppercase letters, but it is conventional to use only lower case for variables names.

-- Example of proving Wrong variable name example
1name='john'
name@="john"
try='hello' #Keyword used

-- What is Expression in Python ?
Ans: An expression is a combination of values, variables, and operators.

name+' Welcome to Python Learning for Data Scientist' # Variable + value

welcome=name+' Welcome to Python Learning for Data Scientist'  #Assignment statement
print(welcome) #Print statement

Notes : When you type an expression at the prompt, the interpreter evaluates it, which means that it finds the value of the expression. 

-- String Operations
- Use + operator for String concatenation
 
'Hello' + ' John' 

Another way

name='John'
message='Hello '
message+name + ' Welcome to Python Learning for Data Scientist'

-  Use * operator for String repetations
 
'John '*7 

-- Use '#' for Comment

'John '*7  # repeat name 7 times

Chapter : 3
=====================
-- Functions : Function is a collection of multiple statments and you can find in almost all programming langugae. You will be declaring or defining once, and use it in future code. 

Example of functions
print(name); #Where is the declarations
type(2); 

Arguments and Return values : The name of the function is type/print. The expression in parentheses is called the argument of the function. The result, for this function, is the type of the argument. It is common to say that a function takes one or more arguments and returns a result. The result is also called the return value. Function will always return a single value.

Some more examples of function is 
int(1.11111)
int(-1.1)
float(3**3) #Converting floating point numbers
float('7')  #Converting string to flaot
str(2)    #Converting integers to string

-- Mathematical functions : Python has a math module that provides most of the familiar mathematical functions. A module is a file that contains a collection of related functions.

Before we can use the functions in a module, we have to import it with an import statement:

import math #Importing math module

Module : In module, we will be having functions and variables defined. If you want to access functions from a module , you have to use module name and function name separated by '.' This format is called dot notation.

math.log10(100) # will be producing 2
math.pi
math.sqrt(100) 

-- Composition
One of the most useful features of programming languages is their ability to take small building blocks and compose them. Almost anywhere you can put a value, you can put an arbitrary expression, with one exception: the left side of an assignment statement has to be a variable name.

print(10+2**2-100+20)

using function call as an argument : 
print(math.log10(100)*2)

-- Defining your own function : A function definition specifies the name of a new function and the sequence of statements that run when the function is called

def printMessage():
print('Welcome ')
print('Python ')
print('Learning ')
print('For ')
print('Data Scientist ')
Now call our function
printMessage()

def: is a keyword that indicates that this is a function definition 
printMessage: name of the function is printDataRange.
Arguments : The empty parentheses after the name indicate that this function doesn’t take any arguments.
header : The first line of the function definition is called the header.
Body : Other than header all other part is called body.

- The header has to end with a colon and the body has to be indented. By convention, indentation is always four spaces. (However, we have used tab for indentation) The body can contain any number of statements. 

- To end the function, you have to enter an empty line.

-- Entire program


def printMessage():
print('Welcome ')
print('Python ')
print('Learning ')
print('For ')
print('Data Scientist ')

def printMessageTwice():
printDataRange()
print()
printDataRange()

printMessageTwice()


- you have to create a function before you can run it.

-- Functions with arguments : 

#Single argument function
def doItTenTimes(value):
print(10*value)
doItTenTimes(7)
doItTenTimes('John')
doItTenTimes(math.log10(100))

The argument is evaluated before the function is called. In above function first math.log10(100) will be evaluated.

#Two argument function

def doAdd(a,b):
print(a+b)

doAdd(2,3)
doAdd('Welcome', ' John') 
doAdd(3, ' John') #Syntex error

#Another example of multiple arguments.

def printName(u1, u2):
print(u1 + 'and ' + u2)

printName(' John ' ,' Mike')

Scope of Arguments and Parameters : Any parameters you create inside a function , will be local to that function and can not be used once you exit from function.

def printName(user1, user2):
localMessage= u1 + ' and ' + u2
print(localMessage)

Parameter 'localMessage' will not be available once fuction completes.

Function Return values : All the fuctions, we have created does not return values for instance. add, doItTenTimes, printName etc. These all functions will only print messages.

Example :

from numpy import add
x=add(2,3) . 

You can see x does not have any value assigned. Because add function does not return any value.

y=math.sqrt(9)

Now check y has a value assigned as 3.0 . Because math.sqrt(value) function will return a value. All the functions, which does not return value is called void return type (Return nothing).
Generally you create functions, because you have to define once and use it many places.


-- 
function object: A value created by a function definition. The name of the function is a variable that refers to a function object.
parameter: A name used inside a function to refer to the value passed as an argument.
argument: A value provided to a function when the function is called. This value is assigned
to the corresponding parameter in the function.
return value: The result of a function. If a function call is used as an expression, the return value is the value of the expression.
fruitful function: A function that returns a value.
void function: A function that always returns None.
module: A file that contains a collection of related functions and other definitions
module object: A value created by an import statement that provides access to the values defined in a module.

==================================
Chapter 4 : 
===================================

Turtal : Its a graphics module of Python.

import turtle
bob = turtle.Turtle()

When you run this code, it should create a new window with small arrow that represents the turtle.

Create a file named mypolygon.py and type in the following code:
import turtle
bob = turtle.Turtle()
print(bob)
turtle.mainloop()

-- mainloop tells the window to wait for the user to do something.
-- Once you create a Turtle, you can call a method to move it around the window. A method is similar to a function, but it uses slightly different syntax. For example, to move the turtle
forward:

bob.fd(100)

-- you are asking bob to move forward (The argument of fd is a distance in pixels, so the actual size depends on your display.)
-- Other methods you can call on a Turtle are bk to move backward, lt for left turn, and rt right turn. The argument for lt and rt is an angle in degrees.

To draw a right angle, add these lines to the program (after creating bob and before calling
mainloop):
bob.fd(100)
bob.lt(90)
bob.fd(100)

-- For loop
for i in range(4):
print('Hello!')
-- Create a Square using for loop
for i in range(4):
bob.fd(100)
bob.lt(90)

The syntax of a for statement is similar to a function definition. It has a header that ends with a colon and an indented body. The body can contain any number of statements.

-- Encapsulation : put your square-drawing code into a function definition and then call the function, passing the turtle as a parameter.
def square(t):
for i in range(4):
t.fd(100)
t.lt(90)

square(bob)

Inside the function, t refers to the same turtle bob, so t.lt(90) has the same effect as bob.lt(90). In that case, why not call the parameter bob? The idea is that t can be any
turtle, not just bob, so you could create a second turtle and pass it as an argument to square:

alice = turtle.Turtle()
square(alice)

Wrapping a piece of code up in a function is called encapsulation. One of the benefits of encapsulation is that it attaches a name to the code, which serves as a kind of documentation.
Another advantage is that if you re-use the code, it is more concise to call a function twice than to copy and paste the body!

-- Generalization : The next step is to add a length parameter to square. Here is a solution:

def square(t, length):
for i in range(4):
t.fd(length)
t.lt(90)
square(bob, 100)

Adding a parameter to a function is called generalization because it makes the function more general: in the previous version, the square is always the same size; in this version it
can be any size.

-- The next step is also a generalization. Instead of drawing squares, polygon draws regular polygons with any number of sides. Here is a solution:

def polygon(t, n, length):
angle = 360 / n
for i in range(n):
t.fd(length)
t.lt(angle)
polygon(bob, 7, 70)

keyword arguments : 
When a function has more than a few numeric arguments, it is easy to forget what they are, or what order they should be in. In that case it is often a good idea to include the names of
the parameters in the argument list:

polygon(bob, n=7, length=70)

when you call a function, the arguments are assigned to the parameters.

-- Interface design

The next step is to write circle, which takes a radius, r, as a parameter. Here is a simple solution that uses polygon to draw a 50-sided polygon:

import math
def circle(t, r):
circumference = 2 * math.pi * r
n = 50
length = circumference / n
polygon(t, n, length)
n is the number of line segments in our approximation of a circle, so length is the length of each segment. Thus, polygon draws a 50-sided polygon that approximates a circle with radius r.

One limitation of this solution is that n is a constant, which means that for very big circles, the line segments are too long, and for small circles, we waste time drawing very small
segments. One solution would be to generalize the function by taking n as a parameter. This would give the user (whoever calls circle) more control, but the interface would be less clean.

The interface of a function is a summary of how it is used: what are the parameters? What does the function do? And what is the return value? An interface is “clean” if it allows the
caller to do what they want without dealing with unnecessary details.

In this example, r belongs in the interface because it specifies the circle to be drawn. n is less appropriate because it pertains to the details of how the circle should be rendered.

Rather than clutter up the interface, it is better to choose an appropriate value of n depending on circumference:

def circle(t, r):
circumference = 2 * math.pi * r
n = int(circumference / 3) + 1
length = circumference / n
polygon(t, n, length)
Now the number of segments is an integer near circumference/3, so the length of each segment is approximately 3, which is small enough that the circles look good, but big enough to be efficient, and acceptable for any size circle.

-- Refactoring : 
-- docstring : 

A docstring is a string at the beginning of a function that explains the interface (“doc” is short for “documentation”). Here is an example:

def polyline(t, n, length, angle):
"""Draws n line segments with the given length and
angle (in degrees) between them. t is a turtle.
"""
for i in range(n):
t.fd(length)
t.lt(angle)

-- An interface is like a contract between a function and a caller. The caller agrees to provide certain parameters and the function agrees to do certain work.
-- For example, polyline requires four arguments: t has to be a Turtle; n has to be an integer; length should be a positive number; and angle has to be a number, which is understood
to be in degrees.

method: A function that is associated with an object and called using dot notation.
encapsulation: The process of transforming a sequence of statements into a function definition.
generalization: The process of replacing something unnecessarily specific (like a number) with something appropriately general (like a variable or parameter).
interface: A description of how to use a function, including the name and descriptions of the arguments and return value.
refactoring: The process of modifying a working program to improve function interfaces and other qualities of the code.

==========================
Chapter 5
==========================

-- The floor division operator, //, divides two numbers and rounds down to an integer.
minutes = 105
minutes / 60
1.75

hours = minutes // 60
hours
1

remainder = minutes - hours * 60

modulus operator
remainder = minutes % 60

-- The division operator, /, performs floor division if both operands are integers, and floating-point division if either operand is a float.

-- Boolean expressions
5 == 5
True
5 == 6
False

-- True and False are special values that belong to the type bool; they are not strings:
type(True)

-- Relational Operator
======================
The == operator is one of the relational operators; the others are:
x != y # x is not equal to y
x > y # x is greater than y
x < y # x is less than y
x >= y # x is greater than or equal to y
x <= y # x is less than or equal to y

-- Remember that = is an assignment

Logical operators
======================
-- x > 0 and x < 10 is true
-- n%2 == 0 or n%3 == 0 is true if either or both of the conditions is true, that is, if the number is divisible by 2 or 3.
-- Any nonzero number is interpreted as True: 42 and True => True

Conditional execution
=====================
if x > 0:
print('x is positive')
Alternative execution
=====================
if x % 2 == 0:
print('x is even')
else:
print('x is odd')
Chained conditionals
====================
if x < y:
print('x is less than y')
elif x > y:
print('x is greater than y')
else:
print('x and y are equal')
-- elif is an abbreviation of “else if”.
-- If there is an else clause, it has to be at the end, but there doesn’t have to be one.

if choice == 'a':
draw_a()
elif choice == 'b':
draw_b()
elif choice == 'c':
draw_c()
-- Each condition is checked in order. If the first is false, the next is checked, and so on.
-- Even if more than one condition is true, only the first true branch runs. \
-- Python provides a more concise option:  if 0 < x and x < 10:
if 0 < x < 10:
print('x is a positive single-digit number.')
Recursion : function to call itself
=====================================
def countdown(n):
if n <= 0:
print('Blastoff!')
else:
print(n)
countdown(n-1)

Infinite Recurssion
===================
def recurse():
recurse()
Input from Keyboard : input  function
===================
text = input()
text

name = input('What...is your name?\n')

Chapter 6 : Fruitful functions : Which returns a value.
======================================================

Example of return values

from numpy import add
a=add(1,2) #Return a value 

-- If a function does not return a value. It means function retunr void.

-- Define a funtion which return a value.

def calculateTotalSalary(salary,bonus):
total = salary + (salary*bonus)/100
return total
-- calling a function, which return a value
totalSalary = calculateTotalSalary(100000,12)

=== Another way of retruning values from a function

def calculateTotalSalary(salary,bonus):
return salary + (salary*bonus)/100

== Using conditions
def calculateTotalSalary(salary,bonus):
if salary < 100000:
total = salary + (salary*bonus)/100 + 5000
else:
total = salary + (salary*bonus)/100
return total

== Wrong way of using condition
def calculateTotalSalary(salary,bonus):
total=0
if salary < 100000:
total = salary + (salary*bonus)/100 + 5000
if salary > 100000:
total = salary + (salary*bonus)/100
return total

== call the function
print(calculateTotalSalary(100000,12))

=== Quadratic equation

ax2 + bx +c

There are two soltions for given value of a,b,c
Sol = -b +/- sqrt(b2 - 4ac)/2a

# import complex math module
import cmath

a = 1
b = 3
c = 4

distance = (b**2) - (4*a*c)

x1 = (-b-cmath.sqrt(distance))/(2*a)
x2 = (-b+cmath.sqrt(distance))/(2*a)

print x1
print x2

=======================
Boolean functions : Functions which returns either True or False

str1='Hello and Welcome'
str1.__contains__('Hello')

def myContain(st1,str2):
if(str1!='' and str2!='' ):
return str1.__contains__(str2)
else :
return False

-- Call function
myContain('Hello and Welcome','Hello')
myContain('Hello and Welcome','')
myContain('','')
==================
Chapter 7 : Looping

def printSequenceUpto100(n):
while n <= 100:
print(n)
n = n +1
print("Thanks we are done")
printSequenceUpto100(91)

-- break statement

def printSequenceUpto100(n):
while n <= 100:
print(n)
n = n +1
break
print("Thanks we are done")
printSequenceUpto100(91)

def printSequenceUpto100(n):
while n <= 100:
print(n)
n = n +1
if(n%2==0):
break
print("Thanks we are done")
printSequenceUpto100(91)

-- Continue statement

def printOddValuesUpto100(n):
while n <= 100:
n = n +1
if(n%2==0):
continue
print(n)
print("Thanks we are done")
printOddValuesUpto100(91)

-- Infinite loop

def doSomeInsignificant(n):
while True:
n=n+1
print(n)

doSomeInsignificant(91)

===============
Chapter 8 : String

-- string is a sequence, which means it is an ordered collection of other values.
-- A string is a sequence of characters.

name = 'Work Done !'
name[0]
name[1]

-- Finding length of the string
len(name)
-- Error
name[11]
-- Correct way
name[len(name)-1] 

-- String traversal

for c in name:
print(c)

i=0
while i < len(name):
print(name[i])
i=i+1

-- String Slicing
name[3:7]
name[:3]
name[3:]
name[3:3]
name[:]

=== Immutable String ==========

name[2]='e'
name= 'Yes that is possible'

===========================
Methods of String

name.upper()
name.find('o')
name.find('E')
name.find('is')

=============================
String In Operator

'a' in name
'ye' in name
'No' in name

=========================
String Comparision

name='john'
if name == 'john':
print('All right, ' + name)

-- Correct it
name='Hello'
if name < 'Welcome':
print('Welcome, ' + name + ', comes before Welcome.')
elif name > 'john':
print('Your Welcome, ' + Welcome + ', comes after Welcome.')
else:
print('All right, '+Welcome)


sequence: An ordered collection of values where each value is identified by an integer index.
item: One of the values in a sequence.

=============
Chapter 9 

Read file

fin = open('c:/test/mywords.txt','r')
fin = open('c:\\test\\mywords.txt','r') #This syntax should also be correct
for line in fin:
word = line.strip()
print(word)

====================

=============== Chapter 13 : (Update code as per your understanding) -- This chapter must be taken as part of Statistics -- Random numbers : -- random function from module random : returns a random float between 0.0 and 1.0 (including 0.0 but not 1.0). -- import random for i in range(10): x = random.random() print(x) The function randint takes parameters low and high and returns an integer between low and high (including both). -- Selecting random variables from sequence t = [1, 2, 3] random.choice(t) -- The random module also provides functions to generate random values from continuous distributions including Gaussian, exponential, gamma, and a few more. -- ===================== chapter 14 Files ===================== 1. File in Write mode fout = open('C:/test/myoutdata.txt', 'w') Note : If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created. -- open returns a file object that provides methods for working with the file. The write method puts data into the file. -- line1 = "This here's the wattle,\n" fout.write(line1) line2 = "the emblem of our land.\n" fout.write(line2) -- When you are done writing, you should close the file. fout.close() -- If you don’t close the file, it gets closed for you when the program ends. 2. Format Operator : -- The argument of write has to be a string, so if we want to put other values in a file, we have to convert them to strings. The easiest way to do that is with str: x = 52 fout.write(str(x)) -- An alternative is to use the format operator, %. When applied to integers, % is the modulus operator. But when the first operand is a string, % is the format operator. -- The first operand is the format string, which contains one or more format sequences, which specify how the second operand is formatted. The result is a string. -- For example, the format sequence '%d' means that the second operand should be formatted as a decimal integer: camels = 42 '%d' % camels -- The result is the string '42', which is not to be confused with the integer value 42 -- A format sequence can appear anywhere in the string, so you can embed a value in a sentence: 'I have spotted %d camels.' % camels -- If there is more than one format sequence in the string, the second argument has to be a tuple. Each format sequence is matched with an element of the tuple, in order. -- The following example uses '%d' to format an integer, '%g' to format a floating-point number, and '%s' to format a string 'In %d years I have spotted %g %s.' % (3, 0.1, 'camels') 'In 3 years I have spotted 0.1 camels.' -- The number of elements in the tuple has to match the number of format sequences in the string. Also, the types of the elements have to match the format sequences: '%d %d %d' % (1, 2) '%d' % 'dollars' -- The os module provides functions for working with files and directories -- os.getcwd returns the name of the current directory import os cwd = os.getcwd() cwd -- Absolute path os.path.abspath('myoutdata.txt -- os.path provides other functions for working with filenames and paths. os.path.exists('memo.txt') If it exists, os.path.isdir checks whether it’s a directory os.path.isdir('memo.txt') -- Similarly, os.path.isfile checks whether it’s a file --os.listdir returns a list of the files (and other directories) in the given directory os.listdir(cwd) -- os.path.join takes a directory and a file name and joins them into a complete path. def walk(dirname): for name in os.listdir(dirname): path = os.path.join(dirname, name) if os.path.isfile(path): print(path) else: walk(path) walk(cwd) -- Exception Handling ====================== fin = open('bad_file') fout = open('/etc/passwd', 'w') #No permission to write on file fin = open('/home') #And if you try to open a directory for reading It is better to go ahead and try—and deal with problems if they happen—which is exactly what the try statement does. The syntax is similar to an if...else statement: try: fin = open('bad_file') except: print('Something went wrong.') Python starts by executing the try clause. If all goes well, it skips the except clause and proceeds. If an exception occurs, it jumps out of the try clause and runs the except clause. Handling an exception with a try statement is called catching an exception. In this example, the except clause prints an error message that is not very helpful. In general, catching an exception gives you a chance to fix the problem, or try again, or at least end the program gracefully. Database ====================== A database is a file that is organized for storing data. Many databases are organized like a dictionary in the sense that they map from keys to values. The biggest difference between a database and a dictionary is that the database is on disk (or other permanent storage), so it persists after the program ends. The module dbm provides an interface for creating and updating database files. As an example, I’ll create a database that contains captions for image files. Opening a database is similar to opening other files: import dbm db = dbm.open('captions', 'c') -- The mode 'c' means that the database should be created if it doesn’t already exist. The result is a database object that can be used (for most operations) like a dictionary. When you create a new item, dbm updates the database file. db['cleese.png'] = 'Photo of John Cleese.' -- db['cleese.png'] ============== Pickle ============== -- The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”. -- It is used for serializing and de-serializing a Python object structure. Any object in python can be pickled so that it can be saved on disk. What pickle does is that it “serialises” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script. -- -- The pickle module can help. It translates almost any type of object into a string suitable for storage in a database, and then translates strings back into objects. import pickle t = [1, 2, 3] pickle.dumps(t) -- The format isn’t obvious to human readers; it is meant to be easy for pickle to interpret. pickle.loads (“load string”) reconstitutes the object: t1 = [1, 2, 3] s = pickle.dumps(t1) t2 = pickle.loads(s) t2 -- Although the new object has the same value as the old, it is not (in general) the same object: t1 == t2 t1 is t2 You can use pickle to store non-strings in a database. In fact, this combination is so common that it has been encapsulated in a module called shelve. === a = ['test value','test value 2','test value 3'] a file_Name = "testfile" # open the file for writing fileObject = open(file_Name,'wb') # this writes the object a to the # file named 'testfile' pickle.dump(a,fileObject) # here we close the fileObject fileObject.close() # we open the file for reading fileObject = open(file_Name,'r') # load the object from the file into var b b = pickle.load(fileObject) b ['test value','test value 2','test value 3'] a==b True ========= 1) saving a program's state data to disk so that it can carry on where it left off when restarted (persistence) 2) sending python data over a TCP connection in a multi-core or distributed system (marshalling) 3) storing python objects in a database 4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization). One thing to note is that there is a brother of pickle as well with the name of cpickle. As the name suggests it is written in c which makes it 1000 times more faster than pickle. So why should we ever use pickle instead of cpickle ? Here’s the reason >> Because pickle handles unicode objects. >> Because pickle is written in pure Python, it's easier to debug. For further reading i suggest the official pickle documentation or if you want to read more tutorials then check out the sqlite tutorial. Now we come to the end of today’s post. I hope you liked it. Do follow my blog to get regular updates. If you have any suggestions or comments then post them below. ===================================== Pipes : Any program that you can launch from the shell can also be launched from Python using a pipe object, which represents a running program. ============================================================ For example, the Unix command ls -l normally displays the contents of the current directory in long format. You can launch ls with os.popen cmd = 'ls -l' fp = os.popen(cmd) Pipe is deprecated from subprocess import Popen, PIPE os.popen() is deprecated, use the subprocess module instead. from subprocess import Popen, PIPE output = Popen(['command-to-run', 'some-argument'], stdout=PIPE) print output.stdout.read() ============= Now that we've discussed the sort of functionality offered on the command line, let's experiment with the subprocess module. Here's a simple command you can run on the command line: echo "Hello world!" In the past, process management in Python was dealt with by a large range of different Python functions from all over the standard library. Since Python 2.4 all this functionality has been carefully and neatly packaged up into the subprocess module which provides one class called Popen which is all you need to use. ==================== The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions: os.system os.spawn* os.popen* popen2.* commands.* ============ subprocess.call(["ls", "-l"]) -- The underlying process creation and management in this module is handled by the Popen class. It offers a lot of flexibility so that developers are able to handle the less commoncases not covered by the convenience functions. import subprocess subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE) == from subprocess import Popen, PIPE proc = Popen(["python","test.py"], stdout=PIPE) output = proc.communicate()[0] == Here's a function to find the actual location of a program: import os def whereis(program): for path in os.environ.get('PATH', '').split(';'): #print path #print os.path.join(path, program) if os.path.exists(os.path.join(path, program)) and \ not os.path.isdir(os.path.join(path, program)): print os.path.join(path, program) return os.path.join(path, program) return None location = whereis('python.exe') if location is not None: print location Windows command whereis python.exe === Writing Modules (Use this for further tutorials : https://www.learnpython.org/) =========== -- Any file that contains Python code can be imported as a module -- For example, suppose you have a file named wc.py with the following code: -- Modules in Python are simply Python files with the .py extension, which implement a set of functions. Modules are imported from other modules using the import command. -- The first time a module is loaded into a running Python script, it is initialized by executing the code in the module once. If another module in your code imports the same module again, it will not be loaded twice but once only - so local variables inside the module act as a "singleton" - they are initialized only once. -- Two very important functions come in handy when exploring modules in Python - the dir and help functions. -- We can look for which functions are implemented in each module by using the dir function: -- Packages are namespaces which contain multiple packages and modules themselves. They are simply directories, but with a twist. -- Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be empty, and it indicates that the directory it contains is a Python package, so it can be imported the same way a module can be imported. -- If we create a directory called foo, which marks the package name, we can then create a module inside that package called bar. We also must not forget to add the __init__.py file inside the foo directory. -- To use the module bar, we can import it in two ways: import foo.bar from foo import bar def linecount(filename): count = 0 for line in open(filename): count += 1 return count print(linecount('wc.py')) import wc wc.linecount('wc.py') The only problem with this example is that when you import the module it runs the test code at the bottom. Normally when you import a module, it defines new functions but it doesn’t run them. Programs that will be imported as modules often use the following idiom: if __name__ == '__main__': print(linecount('wc.py')) __name__ is a built-in variable that is set when the program starts. If the program is running as a script, __name__ has the value '__main__'; in that case, the test code runs. Otherwise, if the module is being imported, the test code is skipped. import com.mycode.custom as cus print(cus.linecount('custom.py')) dir(cus) ============= Chapter 15 : Classes and objects ============= -- We have used functions to organize our code till now. -- Now define our own custom data types. -- Creating a new type is more complicated than the other options, but it has advantages that will be apparent soon. -- A programmer-defined type is also called a class. A class definition looks like this class Point: """Represents a point in 2-D space.""" -- The header indicates that the new class is called Point. -- The body is a docstring that explains what the class is for. You can define variables and methods inside a class definition -- Because Point is defined at the top level, its “full name” is __main__.Point -- The class object is like a factory for creating objects. To create a Point, you call Point as if it were a function. blank = Point() -- The return value is a reference to a Point object, which we assign to blank. -- Creating a new object is called instantiation, and the object is an instance of the class. -- When you print an instance, Python tells you what class it belongs to and where it is stored in memory (the prefix 0x means that the following number is in hexadecimal). -- Every object is an instance of some class, so “object” and “instance” are interchangeable. Attributes ========== -- You can assign values to an instance using dot notation: blank.x = 3.0 blank.y = 4.0 --In this case, though, we are assigning values to named elements of an object. These elements are called attributes. x = blank.x '(%g, %g)' % (blank.x, blank.y) distance = math.sqrt(blank.x**2 + blank.y**2) distance def print_point(p): print('(%g, %g)' % (p.x, p.y)) print_point(blank) -- Inside the function, p is an alias for blank, so if the function modifies p, blank changes. -- Objects are an encapsulation of variables and functions into a single entity. Objects get their variables and functions from classes. Classes are essentially a template to create your objects. -- class MyClass: variable = "blah" def function(self): print("This is a message inside the class.") -- class MyClass: variable = "blah" def function(self): print("This is a message inside the class.") myobjectx = MyClass() -- Now the variable "myobjectx" holds an object of the class "MyClass" that contains the variable and the function defined within the class called "MyClass". -- To access the variable inside of the newly created object "myobjectx" you would do the following class MyClass: variable = "blah" def function(self): print("This is a message inside the class.") myobjectx = MyClass() print(myobjectx.variable) --- You can create multiple different objects that are of the same class(have the same variables and functions defined). However, each object contains independent copies of the variables defined in the class. For instance, if we were to define another object with the "MyClass" class and then change the string in the variable above: class MyClass: variable = "blah" def function(self): print("This is a message inside the class.") myobjectx = MyClass() myobjecty = MyClass() myobjecty.variable = "yackity" # Then pring out both values print(myobjectx.variable) print(myobjecty.variable) -- rectange class class Rectangle: """Represents a rectangle. attributes: width, height, corner. """ box = Rectangle() box.width = 100.0 box.height = 200.0 box.corner = Point() box.corner.x = 0.0 box.corner.y = 0.0 -- The expression box.corner.x means, “Go to the object box refers to and select the attribute named corner; then go to that object and select the attribute named x.” Instance as return value ======================== -- Functions can return instances. For example, find_center takes a Rectangle as an argument and returns a Point that contains the coordinates of the center of the Rectangle: def find_center(rect): p = Point() p.x = rect.corner.x + rect.width/2 p.y = rect.corner.y + rect.height/2 return p center = find_center(box) print_point(center) ================== Mutable Objects =================== -- You can change the state of an object by making an assignment to one of its attributes. box.width = box.width + 50 box.height = box.height + 100 You can also write functions that modify objects. For example, grow_rectangle takes a Rectangle object and two numbers, dwidth and dheight, and adds the numbers to the width and height of the rectangle: def grow_rectangle(rect, dwidth, dheight): rect.width += dwidth rect.height += dheight box.width, box.height grow_rectangle(box, 50, 100) box.width, box.height ================== Object Copying ================== -- Aliasing can make a program difficult to read because changes in one place might have unexpected effects in another place. It is hard to keep track of all the variables that might refer to a given object. -- Copying an object is often an alternative to aliasing. The copy module contains a function called copy that can duplicate any object: p1 = Point() p1.x = 3.0 p1.y = 4.0 import copy p2 = copy.copy(p1) -- p1 and p2 contain the same data, but they are not the same Point -- print_point(p2) -- print_point(p1) p1 is p2 p1 == p2 -- The is operator indicates that p1 and p2 are not the same object, which is what we expected. -- But you might have expected == to yield True because these points contain the same data. -- In that case, you will be disappointed to learn that for instances, the default behavior of the == operator is the same as the is operator; it checks object identity, not object equivalence. That’s because for programmer-defined types, Python doesn’t know what should be considered equivalent. At least, not yet. -- If you use copy.copy to duplicate a Rectangle, you will find that it copies the Rectangle object but not the embedded Point. box2 = copy.copy(box) box2 is box box2.corner is box.corner -- This operation is called a shallow copy because it copies the object and any references it contains, but not the embedded objects. -- Fortunately, the copy module provides a method named deepcopy that copies not only the object but also the objects it refers to, and the objects they refer to, and so on. You will not be surprised to learn that this operation is called a deep copy. box3 = copy.deepcopy(box) box3 is box box3.corner is box.corner box3 and box are completely separate objects. -- You can also use isinstance to check whether an object is an instance of a class: isinstance(p, Point) -- If you are not sure whether an object has a particular attribute, you can use the built-in function hasattr: hasattr(p, 'x') hasattr(p, 'z') try: x = p.x except AttributeError: x = 0 class: A programmer-defined type. A class definition creates a new class object. class object: An object that contains information about a programmer-defined type. The class object can be used to create instances of the type. instance: An object that belongs to a class. instantiate: To create a new object. attribute: One of the named values associated with an object. embedded object: An object that is stored as an attribute of another object. shallow copy: To copy the contents of an object, including any references to embedded objects; implemented by the copy function in the copy module. deep copy: To copy the contents of an object as well as any embedded objects, and any objects embedded in them, and so on; implemented by the deepcopy function in the copy module. object diagram: A diagram that shows objects, their attributes, and the values of the attributes.

=====End =========


Use : U:\Python

For this , we need to learn following topics

1. Core Python.
2. Pandas Data Structure : http://pandas.pydata.org/pandas-docs/stable/dsintro.html
3. Anaconda
4. 



Exploratory Data Analysis
===========================
We have to use Anaconda to work on python
========================

1. Anecdotal Analysis : This are based on data that is unpublished and udually personal.

These are usually fails , because 
- Small number of observations.
- Selection Bias : 
- Confirmation Bias :
- Inaccuracy : often misremembered, misrepresented, repeated inaccurately.
2. To avoid these limitations, we will use the tools of statistics which include.
- Data collection : Surveys, Social networking etc.
- Descriptive statistics (Varnan in hindi) : Generate statistics that summarize the dataconcisely, and evaluate different ways to visualize data.
- Exploratory data analysis (Khoj in hindi) : Look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
- Estimation : We will use data from a sample to estimate characteristics of the general population.
- Hypothesis testing (Anuman in hindi): Where we see apparent e.ects, like a di.erence between two groups, we will evaluate whether the e.ect might have happened by chance.
3. By performing abve steps , we can reach the conclusions that are more justifiable and more likely to be correct.

4. Cross-sectional study : which means that it captures a snapshot of a group at a point in time. 

5. Longitudinal study : Which observes a group repeatedly over a period of time.

6. Ideally surveys would collect data from every member of the pupulation, but that's seldom possible. 

7. Sample : we collect data from a subset of the population called sample.

8. People who participate in a survey are called respondents.

9. Oversampling : The drawback of oversampling is that , it is not as easy to draw conclusion about the general population based on statistics from the survey.

10. Data File (github url : https://github.com/AllenDowney/ThinkStats2)
- 2002FemPreg.dat.gz
- 2002FemPreg.dct (Stata Disctionary file) 
11. Stats : It is a statistical software system. Where dictionary : list of variable name, its types and indices that identify where in each line to find each variable.

Example : 

infile dictionary {
_column(1) str12 caseid %12s "RESPONDENT ID NUMBER"
_column(13) byte pregordr %2f "PREGNANCY ORDER (NUMBER)"
}

This dictionary describes two variables: caseid is a 12-character string that
represents the respondent ID; pregorder is a one-byte integer that indicates
which pregnancy this record describes for this respondent.

12. thinkstats2.py : which is a Python module that contains many classes and functions used in this book, including functions that read the Stata dictionary and the NSFG data .le.

Example : 

def ReadFemPreg(dct_file='2002FemPreg.dct' ,
dat_file='2002FemPreg.dat.gz');
dct = thinkstats2.ReadStataDct(dct_file)
df = dct.ReadFixedWidth(dat_file, compression=gzip)
CleanFemPreg(df)
return df


ReadStataDct : takes the name of the dictionary .le and returns dct.
FixedWidthVariables : Object that contains the information from the dictionary .le. dct provides ReadFixedWidth, which reads the data .file.

13. Data frames : The result of ReadFixedWidth is a DataFrame, which is the fundamental data structure provided by pandas, which is a Python data and statistics package
we'll use throughout this book.

A DataFrame contains a row for each record, in this case one row per pregnancy, and a column for each variable.

- In addition to the data, a DataFrame also contains the variable names and
their types, and it provides methods for accessing and modifying the data.

14. If you print df you get a truncated view of the rows and columns,
and the shape of the DataFrame, which is 13593 rows/records and 244
columns/variables.
>>> import nsfg
>>> df = nsfg.ReadFemPreg()
>>> df
...
[13593 rows x 244 columns]

15. DataFrame attribute : df.columns : equence of column names as Unicode strings

>>> df.columns
Index([u'caseid, u'pregordr, u'howpreg_n, u'howpreg_p, ... ])

The result is an Index, which is another pandas data structure. Similar to list

16. 

Python List Data Structure
======================

1. List : Like a string, a list is a sequence of values. In a string, the values are characters; in a list, they can be any type. The values in a list are called elements or sometimes items.
examples : 
numbers=[10, 20, 30, 40]
animals=['crunchy frog', 'ram bladder', 'lark vomit']
mixed=['spam', 2.0, 5, [10, 20]] #A list within another list is nested
empty=[] #Empty list

print(numbers, animals, mixed,empty )

2. Lists are mutable

numbers = [42, 123]
numbers[1] = 5
numbers

-- If you try to read or write an element that does not exist, you get an IndexError.
-- If an index has a negative value, it counts backward from the end of the list.

3. List traversal
--For loop (reading the elements)
for animal in animals:
print(animal)
-- Updating elements in list
for i in range(len(animals)):
animals[i]=animals[i]*2
-- A for loop over an empty list never runs the body:
for x in []:
print('This never happens.')
-- Although a list can contain another list, the nested list still counts as a single element. The length of this list is four:
['spam', 1, ['Brie', 'Roquefort', 'Pol le Veq'], [1, 2, 3]]

4. List operations

-- The + operator concatenates lists
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b

-- The * operator repeats a list a given number of times
[0] * 4
[1, 2, 3] * 3


5. List Slicing
t = ['a', 'b', 'c', 'd', 'e', 'f']
t[1:3]
t[:4]
t[3:]

t[1:3] = ['x', 'y']

t = ['a', 'b', 'c', 'd', 'e', 'f']
t[1:5] = ['x', 'y']

6. List Methods

-- append method : append adds a new element to the end of a list

t = ['a', 'b', 'c']
t.append('d')

-- extend : extend takes a list as an argument and appends all of the elements:
t1 = ['a', 'b', 'c']
t2 = ['d', 'e']

t1.extend(t2)

-- sort : 
t = ['d', 'c', 'e', 'b', 'a']
t.sort()

-- Most list methods are void; they modify the list and return None.
t = t.sort() #very interesting


7. Map, filter and reduce

-- To add up all the numbers in a list,

def add_all(t):
total = 0
for x in t:
total += x
return total

#Here total is know as Accumulator.

-- Built in function
t = [1, 2, 3]
sum(t)
add_all(t)

-- reduce : An operation like this that combines a sequence of elements into a single value is sometimes called reduce.

-- Create another list : 
t = ['d', 'c', 'e', 'b', 'a']

def capitalize_all(t):
res = []
for s in t:
res.append(s.capitalize())
return res

capitalize_all(t)

-- Map Function : An operation like capitalize_all is sometimes called a map because it “maps” a function (in this case the method capitalize) onto each of the elements in a sequence.

-- Getting sublist : 

def only_upper(t):
res = []
for s in t:
if s.isupper():
res.append(s)
return res

t = ['d', 'C', 'E', 'b', 'A']
 only_upper(t)
 
-- Filte : An operation like only_upper is called a filter because it selects some of the elements and filters out the others

-- Most common list operations can be expressed as a combination of map, filter and reduce.

8. Deleting elements : 
t = ['a', 'b', 'c']
x = t.pop(1)

-- pop modifies the list and returns the element that was removed. If you don’t provide an index, it deletes and returns the last element.

-- Another way : Using del operator
t = ['a', 'b', 'c']
del t[1]
t

-- removing elements
t = ['a', 'b', 'c']
t.remove('b')

-- The return value from remove is None.
t = ['a', 'b', 'c', 'd', 'e', 'f']
del t[1:5]

9. List and Strings : A string is a sequence of characters and a list is a sequence of values, but a list of characters is not the same as a string. To convert from a string to a list of characters, you can use list

s = 'spam'
t = list(s)

-- Because list is the name of a built-in function, you should avoid using it as a variable name.

10. Splitting string
s = 'pining for the fjords'
t = s.split()

s = 'spam-spam-spam'
t = s.split('-')

11. Joining strings
t = ['pining', 'for', 'the', 'fjords']
delimiter = ' '
s = delimiter.join(t)

Objects and Values
======================

1. Assignment
a = 'banana'
b = 'banana'

a is b => True (Means both a and b are pointing to same object). Python only created one string object

Now in case of list
a = [1, 2, 3]
b = [1, 2, 3]

a is b => False
-- In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object.

List as Arguments
========================

-- When you pass a list to a function, the function gets a reference to the list. If the function modifies the list, the caller sees the change

def delete_head(t):
del t[0]
letters = ['a', 'b', 'c']
delete_head(letters)

-- It is important to distinguish between operations that modify lists and operations that create new lists. For example, the append method modifies a list, but the + operator creates a new list.

t1 = [1, 2]
t2 = t1.append(3)

t3 = t1 + [4]

print (t1,t2,t3)


This difference is important when you write functions that are supposed to modify lists. For example, this function does not delete the head of a list:
def bad_delete_head(t):
t = t[1:] # WRONG!

The slice operator creates a new list and the assignment makes t refer to it, but that doesn’t affect the caller.
>>> t4 = [1, 2, 3]
>>> bad_delete_head(t4)
>>> t4
[1, 2, 3]
At the beginning of bad_delete_head, t and t4 refer to the same list. At the end, t refers to a new list, but t4 still refers to the original, unmodified list. An alternative is to write a function that creates and returns a new list. For example, tail returns all but the first element of a list:

def tail(t):
return t[1:]

Imporatnt Points about list
========================

-- Most list methods modify the argument and return None. This is the opposite of the string methods, which return a new string and leave the original alone.
-- Part of the problem with lists is that there are too many ways to do things. For example, to remove an element from a list, you can use pop, remove, del, or even a slice assignment.
-- To add an element, you can use the append method or the + operator.
-- Make copies to avoid aliasing : If you want to use a method like sort that modifies the argument, but you need to keep the original list as well, you can make a copy.

Python Dictionaries Data Structure 
==============



We must be able to do list compresnsions (As Shown in Video)
=============

1. Empty dictioary creation
eng2sp = dict()

-- Add item to dictionary
eng2sp['one'] = 'uno'

-- Multiple items
eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}

-- In general, the order of items in a dictionary is unpredictable
-- Access values using keys
eng2sp['two']

-- Getting error
eng2sp['four']

len(eng2sp)

-- Check elements in dict
'one' in eng2sp
'uno' in eng2sp

2. Check for values in dict
vals = eng2sp.values()
'uno' in vals

3. Counting the number of characters in string
def histogram(s):
d = dict()
for c in s:
if c not in d:
d[c] = 1
else:
d[c] += 1
return d

h = histogram('brontosaurus')

4. looping and dictionaries
def print_hist(h):
for c in h:
print(c, h[c])
h = histogram('parrot')
print_hist(h)

Sorting Dictionary
for key in sorted(h):
print(key, h[key])


5. Reverse lookup
Lookup : Given a dictionary d and a key k, it is easy to find the corresponding value v = d[k]. This operation is called a lookup.

First Matchine keys

def reverse_lookup(d, v):
for k in d:
if d[k] == v:
return k
raise LookupError()
h = histogram('parrot')
key = reverse_lookup(h, 2)
key

key = reverse_lookup(h, 3)

-- A reverse lookup is much slower than a forward lookup; if you have to do it often, or if the dictionary gets big, the performance of your program will suffer.

6. Dictionary and Lists

def invert_dict(d):
inverse = dict()
for key in d:
val = d[key]
if val not in inverse:
inverse[val] = [key]
else:
inverse[val].append(key)
return inverse
hist = histogram('parrot')
hist
inverse = invert_dict(hist)
inverse

-- Lists can be values in a dictionary, as this example shows, but they cannot be keys.
-- A dictionary is implemented using a hashtable and that means that the keys have to be hashable.
-- Since dictionaries are mutable, they can’t be used as keys, but they can be used as values.

7. Memos : 

known = {0:0, 1:1}

def fibonacci(n):
if n in known:
return known[n]
res = fibonacci(n-1) + fibonacci(n-2)
known[n] = res
return res
known is a dictionary that keeps track of the Fibonacci numbers we already know. It starts with two items: 0 maps to 0 and 1 maps to 1.

8. Global Variables (Page 111 : Get it in more detail): 

-- known is created outside the function, so it belongs to the special frame called __main__. Variables in __main__ are sometimes called global because they can be accessed from any function.
-- To reassign a global variable inside a function you have to declare the global variable before you use it

Python Data Structure Tuples (Built-In)
======================

1. Tuples are immutable : A tuple is a sequence of values. The values can be any type, and they are indexed by integers, so in that respect tuples are a lot like lists. The important difference is that tuples are immutable.

t = 'a', 'b', 'c', 'd', 'e'

-- To create a tuple with a single element, you have to include a final comma:
t1 = 'a',
type(t1)

-- A value in parentheses is not a tuple:
t2 = ('a')
type(t2)

-- Another way to create a tuple is the built-in function tuple. With no argument, it creates an empty tuple
t = tuple()
t

-- If the argument is a sequence (string, list or tuple), the result is a tuple with the elements of the sequence
t = tuple('lupins')

-- Most list operators also work on tuples. The bracket operator indexes an element:
t[0]
t[1:3]

-- if you try to modify one of the elements of the tuple, you get an error
t[0] = 'A'

-- Because tuples are immutable, you can’t modify the elements. But you can replace one tuple with another:
t = ('A',) + t[1:]

This statement makes a new tuple and then makes t refer to it.This statement makes a new tuple and then makes t refer to it.

-- The relational operators work with tuples and other sequences; Python starts by comparing the first element from each sequence. If they are equal, it goes on to the next elements, and
so on, until it finds elements that differ. Subsequent elements are not considered (even if they are really big).

(0, 1, 2) < (0, 3, 4)
(0, 1, 2000000) < (0, 3, 4)


2. Tuple Assignment : The return value from split is a list with two elements; the first element is assigned to uname, the second to domain.

addr = 'monty@python.org'
uname, domain = addr.split('@')

Try below
a, b = 1, 2, 3

3. Tuple as return value

-- Strictly speaking, a function can only return one value, but if the value is a tuple, the effect is the same as returning multiple values.
-- The built-in function divmod takes two arguments and returns a tuple of two values, the quotient and remainder. You

t = divmod(7, 3)
quot, rem = divmod(7, 3)

def min_max(t):
return min(t), max(t)
4. Variable length argumnet tuples : 

-- Functions can take a variable number of arguments. A parameter name that begins with * gathers arguments into a tuple.

def printall(*args):
print(args)

printall(1, 2.0, '3')


t = (7, 3)
divmod(t)

Error : divmod expected 2 arguments, got 1

divmod(*t)

-- Many of the built-in functions use variable-length argument tuples. For example, max and min can take any number of arguments

5. List and tuples : 

-- zip is a built-in function that takes two or more sequences and returns a list of tuples where each tuple contains one element from each sequence. The name of the function refers to a zipper, which joins and interleaves two rows of teeth.

s = 'abc'
t = [0, 1, 2]

zip(s, t)
[('a', 0), ('b', 1), ('c', 2)]

-- The result is a zip object that knows how to iterate through the pairs. The most common use of zip is in a for loop.

for pair in zip(s, t):
print(pair)

-- A zip object is a kind of iterator, which is any object that iterates through a sequence. Iterators are similar to lists in some ways, but unlike lists, you can’t use an index to select an element from an iterator. 

-- If you want to use list operators and methods, you can use a zip object to make a list.
list(zip('Anne', 'Elk'))

t = [('a', 0), ('b', 1), ('c', 2)]
for letter, number in t:
print(number, letter)
-- If you combine zip, for and tuple assignment, you get a useful idiom for traversing two (or more) sequences at the same time. For example, has_match takes two sequences, t1 and t2, and returns True if there is an index i such that t1[i] == t2[i]:

def has_match(t1, t2):
for x, y in zip(t1, t2):
if x == y:
return True
return False

-- If you need to traverse the elements of a sequence and their indices, you can use the built-in function enumerate:
for index, element in enumerate('abc'):
print(index, element)
-- The result from enumerate is an enumerate object, which iterates a sequence of pairs; each pair contains an index (starting from 0) and an element from the given sequence.

6. Dictionaries and tuples : 

-- Dictionaries have a method called items that returns a sequence of tuples, where each tuple is a key-value pair
d = {'a':0, 'b':1, 'c':2}
t = d.items()
Now you can use it in loop
for key, value in d.items():
print(key, value)
-- As you should expect from a dictionary, the items are in no particular order.
-- creating dict from tuple

t = [('a', 0), ('c', 2), ('b', 1)]
d = dict(t)

or
d = dict(zip('abc', range(3)))

-- The dictionary method update also takes a list of tuples and adds them, as key-value pairs, to an existing dictionary.

-- It is common to use tuples as keys in dictionaries (primarily because you can’t use lists). For example, a telephone directory might map from last-name, first-name pairs to telephone
numbers. Assuming that we have defined last, first and number, we could write: 

directory[last, first] = number

for last, first in directory:
print(first, last, directory[last,first])

7. Sequences of sequences

-- strings are more limited than other sequences because the elements have to be characters. They are also immutable. If you need the ability to change the characters in a string (as opposed to creating a new string), you might want to use a list of characters instead

-- Lists are more common than tuples, mostly because they are mutable.

-- Prefer tuples
- like a return statement, it is syntactically simpler to create a tuple than a list.
- If you want to use a sequence as a dictionary key, you have to use an immutable type like a tuple or string.
- If you are passing a sequence as an argument to a function, using tuples reduces the potential for unexpected behavior due to aliasing.
-- Because tuples are immutable, they don’t provide methods like sort and reverse, which modify existing lists. But Python provides the built-in function sorted, which takes any sequence and returns a new list with the same elements in sorted order, and reversed, which takes a sequence and returns an iterator that traverses the list in reverse order.


===============================
Selecting Proper Data Structure
================================

import string
string.punctuation
string.whitespace

Random Numbers : 

-- Deterministic : Given the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic.

-- The random module provides functions that generate pseudorandom numbers
-- 



Python NumPy Data Structure : 
=========================

1. ndarray : An ndarray is a (usually fixed-size) multidimensional container of items of the same type and size.

Example : A 2-dimensional array of size 2 x 3, composed of 4-byte integer elements

x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
x
type(x)
x.shape
x.dtype
-- fetch the element
x[1, 2]


2. Slicing : For example slicing can produce views of the array:
y = x[:,1]
y
y[0] = 9 # this also changes the corresponding element in x

3. axis in array :  if a 2D array a has shape (5,6)
- then you can access a[0,0] up to a[4,5]
- Axis 0 is thus the first dimension (the "rows")
- and axis 1 is the second dimension (the "columns").
- In higher dimensions, where "row" and "column" stop really making sense, try to think of the axes in terms of the shapes and indices involved.
- if you do .sum(axis=n), for example, then dimension n is collapsed and deleted, with all values in the new matrix equal to the sum of the corresponding collapsed values. 
- For example, if b has shape (5,6,7,8), and you do c = b.sum(axis=2), then axis 2 (dimension with size 7) is collapsed, and the result has shape (5,6,8)
- Furthermore, c[x,y,z] is equal to the sum of all elements c[x,y,:,z]

In General
 axis = 0, means all cells with first dimension varying with each value of 2nd dimension and 3rd dimension and so on
 
Simply : Think as a multiple for loops. 

3. Array Construction (3*3*3) matrix
x=np.array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],
       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],
       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])
        
x
[[ 0,  1,  2],
 [ 3,  4,  5],
 [ 6,  7,  8]] 
 
 [[ 9, 10, 11],
  [12, 13, 14],
 [15, 16, 17]],
 
 
[[18, 19, 20],
 [21, 22, 23],
 [24, 25, 26]]]
 
 
-- summed over each of its three axes   (.sum(axis=0) which sums along the rows (producing column totals). )     
x.sum(axis=0)        
============== Chapter 16 Best place to learn Python : https://www.datacamp.com/courses/tech:python ============= Classes and Functions : Defining a class named Time class Time: """Represents the time of day. attributes: hour, minute, second """ Create a new Time object and assign attributes for hours, minutes, and seconds time = Time() time.hour = 11 time.minute = 59 time.second = 30 Pure Functions : This is called a pure function because it does not modify any of the objects passed to it as arguments and it has no effect, like displaying a value or getting user input, other than returning a value -- The function creates a new Time object, initializes its attributes, and returns a reference to the new object. def add_time(t1, t2): sum = Time() sum.hour = t1.hour + t2.hour sum.minute = t1.minute + t2.minute sum.second = t1.second + t2.second return sum -- Now use this function start = Time() start.hour = 9 start.minute = 45 start.second = 0 duration = Time() duration.hour = 1 duration.minute = 35 duration.second = 0 done = add_time(start, duration) print(done) -- Now convert extra minute and seconds (Improved function) def add_time(t1, t2): sum = Time() sum.hour = t1.hour + t2.hour sum.minute = t1.minute + t2.minute sum.second = t1.second + t2.second if sum.second >= 60: sum.second -= 60 sum.minute += 1 if sum.minute >= 60: sum.minute -= 60 sum.hour += 1 return sum -- Modifiers : Sometimes it is useful for a function to modify the objects it gets as parameters. In that case, the changes are visible to the caller. Functions that work this way are called modifiers. def increment(time, seconds): time.second += seconds if time.second >= 60: time.second -= 60 time.minute += 1 if time.minute >= 60: time.minute -= 60 time.hour += 1 -- There is some evidence that programs that use pure functions are faster to develop and less error-prone than programs that use modifiers. But modifiers are convenient at times, and functional programs tend to be less efficient. def time_to_int(time): minutes = time.hour * 60 + time.minute seconds = minutes * 60 + time.second return seconds def int_to_time(seconds): time = Time() minutes, time.second = divmod(seconds, 60) time.hour, time.minute = divmod(minutes, 60) return time -- Writing code to check invariants can help detect errors and find their causes. def valid_time(time): if time.hour < 0 or time.minute < 0 or time.second < 0: return False if time.minute >= 60 or time.second >= 60: return False return True -- At the beginning of each function you could check the arguments to make sure they are valid: -- assert statements are useful because they distinguish code that deals with normal conditions from code that checks for errors. def add_time(t1, t2): if not valid_time(t1) or not valid_time(t2): raise ValueError('invalid Time object in add_time') seconds = time_to_int(t1) + time_to_int(t2) return int_to_time(seconds) prototype and patch: A development plan that involves writing a rough draft of a program, testing, and correcting errors as they are found. designed development: A development plan that involves high-level insight into the problem and more planning than incremental development or prototype development. pure function: A function that does not modify any of the objects it receives as arguments. Most pure functions are fruitful. modifier: A function that changes one or more of the objects it receives as arguments. Most modifiers are void; that is, they return None. functional programming style: A style of program design in which the majority of functions are pure. invariant: A condition that should always be true during the execution of a program. assert statement: A statement that check a condition and raises an exception if it fails. Chapter 17 : Classes and Methods ================================== Python is an object-oriented programming language, which means that it provides features that support object-oriented programming, which has these defining characteristics: • Programs include class and method definitions. • Most of the computation is expressed in terms of operations on objects. • Objects often represent things in the real world, and methods often correspond to the ways things in the real world interact. -- a method is a function that is associated with a particular class. Methods are semantically the same as functions, but there are two syntactic differences: • Methods are defined inside a class definition in order to make the relationship between the class and the method explicit. • The syntax for invoking a method is different from the syntax for calling a function. -- we will take the functions from the previous two chapters and transform them into methods class Time: """Represents the time of day.""" def print_time(time): print('%.2d:%.2d:%.2d' % (time.hour, time.minute, time.second)) start = Time() start.hour = 9 start.minute = 45 start.second = 00 print_time(start) -- Call the function print_time(time) -- To make print_time a method, all we have to do is move the function definition inside the class definition class Time: def print_time(time): print('%.2d:%.2d:%.2d' % (time.hour, time.minute, time.second)) Now there are two ways to call print_time. The first (and less common) way is to use function syntax: Time.print_time(start) -- The second (and more concise) way is to use method syntax: start.print_time() -- In this use of dot notation, print_time is the name of the method (again), and start is the object the method is invoked on, which is called the subject. Just as the subject of a sentence is what the sentence is about, the subject of a method invocation is what the method is about. Inside the method, the subject is assigned to the first parameter, so in this case start is assigned to time. -- By convention, the first parameter of a method is called self, so it would be more common to write print_time like this: class Time: def print_time(self): print('%.2d:%.2d:%.2d' % (self.hour, self.minute, self.second)) -- In object-oriented programming, the objects are the active agents. A method invocation like start.print_time() says “Hey start! Please print yourself.” class Time: def print_time(self): print('%.2d:%.2d:%.2d' % (self.hour, self.minute, self.second)) def increment(self, seconds): seconds += self.time_to_int() return int_to_time(seconds) start.print_time() end = start.increment(1337) end.print_time() -- Use this code for example http://thinkpython2.com/code/Time2.py # inside class Time: def increment(self, seconds): seconds += self.time_to_int() return int_to_time(seconds) -- This version assumes that time_to_int is written as a method. Also, note that it is a pure function, not a modifier. start.print_time() end = start.increment(1337) end.print_time() -- The subject, start, gets assigned to the first parameter, self. The argument, 1337, gets assigned to the second parameter, seconds. -- This mechanism can be confusing, especially if you make an error. For example, if you invoke increment with two arguments, you get: end = start.increment(1337, 460) -- TypeError: increment() takes 2 positional arguments but 3 were given The error message is initially confusing, because there are only two arguments in parentheses. But the subject is also considered an argument, so all together that’s three. By the way, a positional argument is an argument that doesn’t have a parameter name; that is, it is not a keyword argument. In this function call: sketch(parrot, cage, dead=True) parrot and cage are positional, and dead is a keyword argument. -------------- A more complicated example : Rewriting is_after (from Section 16.1) is slightly more complicated because it takes two Time objects as parameters. In this case it is conventional to name the first parameter self and the second parameter other: # inside class Time: def is_after(self, other): return self.time_to_int() > other.time_to_int() To use this method, you have to invoke it on one object and pass the other as an argument: >>> end.is_after(start) True One nice thing about this syntax is that it almost reads like English: “end is after start?” ------------- The init method ====================== The init method (short for “initialization”) is a special method that gets invoked when an object is instantiated. Its full name is __init__ def __init__(self, hour=0, minute=0, second=0): self.hour = hour self.minute = minute self.second = second The parameters are optional, so if you call Time with no arguments, you get the default values. time = Time() time.print_time() 00:00:00 If you provide one argument, it overrides hour: >>> time = Time (9) >>> time.print_time() 09:00:00 If you provide two arguments, they override hour and minute. >>> time = Time(9, 45) >>> time.print_time() 09:45:00 And if you provide three arguments, they override all three default values ================= The __str__ method ================= __str__ is a special method, like __init__, that is supposed to return a string representation of an object. def __str__(self): return '%.2d:%.2d:%.2d' % (self.hour, self.minute, self.second) When you print an object, Python invokes the str method: >>> time = Time(9, 45) >>> print(time) 09:45:00 When I write a new class, I almost always start by writing __init__, which makes it easier to instantiate objects, and __str__, which is useful for debugging. Operator Overloading ==================== __add__ for the Time class, you can use the + operator on Time objects def __add__(self, other): seconds = self.time_to_int() + other.time_to_int() return int_to_time(seconds) When you apply the + operator to Time objects, Python invokes __add__. When you print the result, Python invokes __str__. So there is a lot happening behind the scenes! Changing the behavior of an operator so that it works with programmer-defined types is called operator overloading. =================== Type-based dispatch =================== In the previous section we added two Time objects, but you also might want to add an integer to a Time object. The following is a version of __add__ that checks the type of other and invokes either add_time or increment: # inside class Time: def __add__(self, other): if isinstance(other, Time): return self.add_time(other) else: return self.increment(other) def add_time(self, other): seconds = self.time_to_int() + other.time_to_int() return int_to_time(seconds) def increment(self, seconds): seconds += self.time_to_int() return int_to_time(seconds) The built-in function isinstance takes a value and a class object, and returns True if the value is an instance of the class. If other is a Time object, __add__ invokes add_time. Otherwise it assumes that the parameter is a number and invokes increment. This operation is called a type-based dispatch because it dispatches the computation to different methods based on the type of the arguments. >>> start = Time(9, 45) >>> duration = Time(1, 35) >>> print(start + duration) 11:20:00 >>> print(start + 1337) 10:07:17 =========== # inside class Time: def __radd__(self, other): return self.__add__(other) And here’s how it’s used: >>> print(1337 + start) 10:07:17 ========== Polymorphism ============== Functions that work with several types are called polymorphic. Functions that work with several types are called polymorphic. Polymorphism can facilitate code reuse. For example, the built-in function sum, which adds the elements of a sequence, works as long as the elements of the sequence support addition. Since Time objects provide an add method, they work with sum: >>> t1 = Time(7, 43) >>> t2 = Time(7, 41) >>> t3 = Time(7, 37) >>> total = sum([t1, t2, t3]) >>> print(total) 23:01:00 object-oriented language: A language that provides features, such as programmerdefined types and methods, that facilitate object-oriented programming. object-oriented programming: A style of programming in which data and the operations that manipulate it are organized into classes and methods. method: A function that is defined inside a class definition and is invoked on instances of that class. subject: The object a method is invoked on. positional argument: An argument that does not include a parameter name, so it is not a keyword argument. operator overloading: Changing the behavior of an operator like + so it works with a programmer-defined type. type-based dispatch: A programming pattern that checks the type of an operand and invokes different functions for different types. polymorphic: Pertaining to a function that can work with more than one type. information hiding: The principle that the interface provided by an object should not depend on its implementation, in particular the representation of its attributes. Chapter 18 : ================ Inheritance is the ability to define a new class that is a modified version of an existing class. The class definition for Card looks like this: class Card: """Represents a standard playing card.""" def __init__(self, suit=0, rank=2): self.suit = suit self.rank = rank As usual, the init method takes an optional parameter for each attribute. The default card is the 2 of Clubs. To create a Card, you call Card with the suit and rank of the card you want. queen_of_diamonds = Card(1, 12) We assign these lists to class attributes: # inside class Card: suit_names = ['Clubs', 'Diamonds', 'Hearts', 'Spades'] rank_names = [None, 'Ace', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'Jack', 'Queen', 'King'] def __str__(self): return '%s of %s' % (Card.rank_names[self.rank], Card.suit_names[self.suit]) Variables like suit_names and rank_names, which are defined inside a class but outside of any method, are called class attributes because they are associated with the class object Card. This term distinguishes them from variables like suit and rank, which are called instance attributes because they are associated with a particular instance. Every card has its own suit and rank, but there is only one copy of suit_names and rank_names. Putting it all together, the expression Card.rank_names[self.rank] means “use the attribute rank from the object self as an index into the list rank_names from the class Card, and select the appropriate string.” The first element of rank_names is None because there is no card with rank zero. By including None as a place-keeper, we get a mapping with the nice property that the index 2 maps to the string '2', and so on. To avoid this tweak, we could have used a dictionary instead of a list. With the methods we have so far, we can create and print cards: >>> card1 = Card(2, 11) >>> print(card1) Jack of Hearts Comparing cards =============== For built-in types, there are relational operators (<, >, ==, etc.) that compare values and determine when one is greater than, less than, or equal to another. For programmer-defined types, we can override the behavior of the built-in operators by providing a method named __lt__, which stands for “less than”. __lt__ takes two parameters, self and other, and True if self is strictly less than other. The answer might depend on what game you are playing, but to keep things simple, we’ll make the arbitrary choice that suit is more important, so all of the Spades outrank all of the Diamonds, and so on. With that decided, we can write __lt__: # inside class Card: def __lt__(self, other): # check the suits if self.suit < other.suit: return True if self.suit > other.suit: return False # suits are the same... check ranks return self.rank < other.rank You can write this more concisely using tuple comparison: # inside class Card: def __lt__(self, other): t1 = self.suit, self.rank t2 = other.suit, other.rank return t1 < t2 As an exercise, write an __lt__ method for Time objects. You can use tuple comparison, but you also might consider comparing integers. Decks ========= Here is a __str__ method for Deck: #inside class Deck: def __str__(self): res = [] for card in self.cards: res.append(str(card)) return '\n'.join(res) This method demonstrates an efficient way to accumulate a large string: building a list of strings and then using the string method join. The built-in function str invokes the __str__ method on each card and returns the string representation. Since we invoke join on a newline character, the cards are separated by newlines. Here’s what the result looks like: >>> deck = Deck() >>> print(deck) Ace of Clubs 2 of Clubs 3 of Clubs ... 10 of Spades Jack of Spades Queen of Spades King of Spades Even though the result appears on 52 lines, it is one long string that contains newlines. Complete Chapter 18 and 19 yourself ============================
Panda : Data Structure
=======================

1. To get started, import numpy and load pandas into your namespace:

import numpy as np
import pandas as pd

2. Series : One dimrensional labeled array. 

Syntex
s= pd.Series(data,index=index)
Here, data can be many different things:

- a Python dict
- an ndarray
- a scalar value (like 5)
The passed index is a list of axis labels.
3. Creating Series using ndarray. 
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

Example : 
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])


Getting only indexes from Series
s.index

Creating Series without indexes.
pd.Series(np.random.randn(5))


4. Creating Series using dict : If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

pd.Series(d, index=['b', 'c', 'd', 'a'])

NaN (not a number) is the standard missing data marker used in pandas

5. Creating Series from Scalar values 
 pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
 
6. Series as  ndarray like
s[0]
s[:3]
s[s > s.median()]
s[[4, 3, 1]]
np.exp(s)

6. Series like dict
s['a']
s['e'] = 12.
s
'e' in s
'f' in s

-- If a label is not contained, an exception is raised:
s['f']

-- Using the get method, a missing label will return None or specified default:
s.get('f')
s.get('f', np.nan)

7. Vectorized operations and label alignment with Series
-- When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can be also be passed into most NumPy methods expecting an ndarray.

s + s
s * 2
np.exp(s)

-- A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

- As in below , first Series will not have lalel 'a' and Second will not have label 'e' . Hence, respective results will have null values.
s[1:] + s[:-1]

-- The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.
-- Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function

8. Name attribue : Series can also have a name attribute
s = pd.Series(np.random.randn(5), name='something')
s
s2 = s.rename("different")
s2

-- Both  s and s2 refer to different objects.

9. Data frames : DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 
===========================================

10. DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame
11. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

12. If axis labels are not passed, they will be constructed from the input data based on common sense rules.

13. From dict of Series or dicts : 
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   
df = pd.DataFrame(d)

df

So by looking at dataset, you can see that dict key names are equivalent to column names. And labels of the series become rowid/index.
pd.DataFrame(d, index=['d', 'b', 'a'])

result : 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Result : 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaNf

-- The row and column labels can be accessed respectively by accessing the index and columns attributes.
df.index
df.columns

14. DataFrame using dict of ndarrays / lists
============================================

d = {'one' : [1., 2., 3., 4.],
   'two' : [4., 3., 2., 1.]}

pd.DataFrame(d)
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

15.  numpy.zeros(shape, dtype=float, order='C') : Return a new array of given shape and type, filled with zeros.
order : {‘C’, ‘F’}, optional Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.
Returns: : ndarray

np.zeros(5)
np.zeros((5,), dtype=np.int)
np.zeros((2, 1))

s = (2,2)
np.zeros(s)

np.zeros((2,), dtype=[('x', 'i4'), ('y', 'i4')]) # custom dtype

16. From structured or record array
==================================

Create a two dimensional array (With two rows and 3 columns)
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

pd.DataFrame(data)

Result
   A    B      C
0  1  2.0  Hello
1  2  3.0  World

pd.DataFrame(data, index=['first', 'second'])

Result: 
        A    B      C
first   1  2.0  Hello
second  2  3.0  World

pd.DataFrame(data, columns=['C', 'A', 'B'])
       C  A    B
0  Hello  1  2.0
1  World  2  3.0

-- Note DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

17. Now create from list of dict

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2, index=['first', 'second'])
pd.DataFrame(data2, columns=['a', 'b'])


18. From a dict of tuples : You can automatically create a multi-indexed frame by passing a tuples dictionary

pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

19. Creating DataFrame using Series.

-- The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).
-- Missing Data  : Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, use np.nan for those values which are missing. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

20. DataFrame.from_dict : 

















Chapter 12 : Apache Flume Introductions


Chapter 12 : Where to go from Here


Appendix : All material provided by www.HadoopExam.com

Appendix  : We are looking for Authors/Trainers






=================