for starters what exactly is unit testing and why exactly is that needed?

A unit test is a way of testing a unit - the smallest piece of code that can be logically isolated in a system. Essentially, a unit test is a method that instantiates a small portion of our application and verifies its behavior independently from other parts.

setting up environments:

first, run the following commands to install pytest pandas which are required:

```
pip install pytest
pip install pandas
```

following is a simple folder structure:

heading towards the practical things now, we will see how exactly we should write unit tests

we have simple data where the name and age of the person are available but for some reason, we need one new column to be created that will have their birth year so here is a function to achieve that...

```
extract_birth_year.py
import datetime
def get_birth_year(df):
'''
This function is used to calculate birth year and create a new column
called birth_year in dataframe.
parameters:
df: dataframe having Age column
returns:
dataframe with birth_year column.
'''
year = datetime.date.today().year
replace_boolean_values = [True,False]
if 'Age' in df.columns:
df['Age'] = df['Age'].replace(replace_boolean_values,0)
df['birth_year'] = year - df['Age']
else:
raise NotImplementedError('unsupported dataframe')
return df
```

You might be thinking why are we replacing the boolean values.... spot on you are right here's a scenario suppose if the boolean value is present in the age column it gives out the wrong birth_year how??? as boolean values are considered as 0 and 1 for False True respectively. so for the current year suppose 2022 it will do like 2022 - 1 = 2021 or 2022 -0 = 2022 so those results will be wrong in that case it will be replaced by 0 so the birth year will be a current year and we can create next steps around that.

Now, we are going to write test cases for the above function. we will mainly learn the framework to write test cases as well as how exceptions are tested using pytest which is one of the widely used python libraries.

- We are writing code to check if the behavior of the function is working as expected or not
- Is our function capable to provide user-friendly error messages so that they can be easy to understand or debug
- last but not least test cases are also one of the key elements to understanding the functionality of the code apart from usual docstrings and type hints.

the first part of starting to write a test case is creating a test.py file, in this case, our function is located in a file called extract_birth_year.py.

so by convention, we will create a new python file named test_extract_birth_year.py

inside the python file we just created, we will import our function which is present in extract_birth_year.py please keep both of the two files in the same folder as of now, we will discuss ways to call functions from other folders I'll add a link for that here.

```
from extract_birth_year import get_birth_year
```

This will import the intended function, let's start with the cases now,

There is a simple framework that is recommended by pytest documentation which says use following four step framework to write your test cases.

- Arrange
- Act
- Assert
- Cleanup

You can explore more about it here

```
import pandas as pd
from extract_birth_year import get_birth_year
def test_get_birth_year():
# Arrange
expected_data = [['tom', 10, 2012], ['nick', 15, 2007], ['juli', 14, 2008]]
df_expected_output = pd.DataFrame(expected_data, columns=['Name', 'Age','birth_year'])
# Act
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df_actual = pd.DataFrame(data, columns=['Name', 'Age'])
df_actual_output = get_birth_year(df_actual)
# Assert
assert df_actual_output.equals(df_expected_output)
```

for running a test cases you will have to use following command:

```
pytest test_extract_birth_year.py
```

In Arrange I have kept expected values from the function output, Act is more related to calling a dedicated function and saving the output in a variable, and in the end, assert will check if actual values and expected values are matching or not.

here we are using .equals which is one of the powerful ways provided by pandas that will check if two dataframes are identical or not

if you run the above test you will somewhat like this screen:

in green means, it says your test is passed!!!

if everything is red that means the test failed

if you just add -v to the command like this:

```
pytest test_extract_birth_year.py -v
```

it will show the summary kinda like this:

if you want to run any specific test there is a way for that too.

```
pytest test_extract_birth_year.py::test_get_birth_year
```

This will select only specified test functions and execute in this case output will be somewhat like this

Cool, enough with the pytest commands let's get back to our function and test cases..

Now we have the second part to the function as well, which is if provided dataframe is not the correct one then our function should raise an exception.. let's write it for that now...

Things we need to think of before writing a function is like we need to exactly check that it is raising the exception called NotImplementedError, we cannot check if just the general Exception is raised because there are possibilities that our age column may have some string data, in that case, it would throw some errors like

*TypeError: unsupported operand type(s) for -: 'int' and 'str'*

which is different than our use case if we add an Exception in the test it will pass the test but in fact, it should get noticed.

cooming back to exceptions testing now, there is a pretty useful syntax provided by pytest to actually check whether is expected exception is occurred or not

```
test_extract_birth_year.py
import pandas as pd
import pytest
from extract_birth_year import get_birth_year
def test_get_birth_year_unsupported_excpetion():
# Arrange & Act
data = [['krish', 1], ['jack', 50], ['elon', 100]]
df_input = pd.DataFrame(data, columns=['Name', 'Amount'])
with pytest.raises(NotImplementedError) as exc_info:
df_actual_output = get_birth_year(df_input)
# Assert
assert exc_info.type is NotImplementedError
assert exc_info.value.args[0] == "unsupported dataframe"
```

here you will observe that Act and Arrange is clubbed together, we can club this together considering the readability of the code.

Finally, in one edge case if the Boolean value is present in the Age column then it should return the birth year as the current year.

```
test_extract_birth_year.py
import datetime
import pandas as pd
import pytest
from extract_birth_year import get_birth_year
def test_get_birth_year_unclean_data():
# Arrange
year = datetime.date.today().year
expected_data = [['tom', 0, year], ['nick', 15, 2007], ['juli', 14, 2008]]
df_expected_output = pd.DataFrame(expected_data, columns=['Name', 'Age','birth_year'])
# Act
data = [['tom', True], ['nick', 15], ['juli', 14]]
df_actual = pd.DataFrame(data, columns=['Name', 'Age'])
df_actual_output = get_birth_year(df_actual)
#Assert
assert df_actual_output.equals(df_expected_output)
```

Conclusion: we learned about how to write test cases for

- Checking the behavior of the function when the data provided is correct.
- Behavior of function when data provided is not cleaned.
- Is exception handling working as intended when the data provided is unsupported?

Git-repo link to access the folder.

Cheers till the next one!!!

feel free to contact Happy Learning :)

]]>Type Hint is literally what the word mean. you get hint of the data type of functions input & output, type hints helps understanding and debugging the code. in standard way of shipping code using Type Hints can add significant value.

Python is a dynamically typed language, which means you never have to explicitly indicate what kind of types variable has.

But in some cases, dynamic typing can lead to some bugs that are very difficult to debug and in those cases, Type Hints or Static Typing can be convenient.

Type hints help document your code. Traditionally, you would use docstrings if you wanted to document the expected types of a functions arguments. This works for now as well but as there is no standard for docstrings, they cant be easily used for automatic checks.

You need to consider using type hints to help others and yourself. Type hints increase the readability with self-explanatory code.

Type hints also help you to build and maintain a cleaner code architecture as you need to consider types when annotating them in your code.

Type Hints has been introduced as a new feature in Python 3.5.

let's look at following function as an example

```
# function without Type Hints
def addition(a, b):
return a + b
print(addition(3, 5)) # 8
```

**Heres how to add type hints to the above function:
**

- Add a colon and a data type after each function parameter
- Add an arrow (->) and a data type after the function to specify the return data type

Code-wise, it looks more like this:

```
# functions with Type Hints
def addition(a: int, b: int) -> int:
return a + b
print(addition(3, 5)) # 8
```

Pythons typing module can make data type annotations even more verbose. For example, you can specifically declare a list of strings, a tuple containing three integers, and a dictionary with strings as keys and integers as values.

Heres how:

```
from typing import List, Tuple, Dict
my_list: List[str] = ['a', 'b', 'c']
my_tuple: Tuple[int, int, int] = (1, 2, 3)
my_dictionary: Dict[str, int] = {'a': 1, 'b': 2}
```

let's go to fun part now, let's say you have lot of functions in your python file and you want to add Type Hints for all of them... then a library called MonkeyType can help.

let's see how to use it.

for installation:

```
pip install MonkeyType
```

for simplicity, I'm using simple functions you can follow the same process on your own files

here's mango_without_typehints.py file contains all the functions I'm using (it's mango season here so I named it as mango๐)

and test.py is a file where I'm calling those functions

open your preferred command prompt and activate your working environment if any, and navigate to the particular directory where your files are present and run following command:

```
monkeytype run test.py
```

after that you'll get somewhat like this output, so in the backend it will create dump call traces into a SQLite database in the file monkeytype.sqlite3

Then, you'll apply those traces to your mango_without_typehints.py file where your functions are defined, make sure pass that file without extension

```
monkeytype apply mango_without_typehints
```

you'll get somewhat like this output

Now, go check your mango_without_typehints.py file, type hints will be updated there for each functions

Here's the files for you to download and try which are being used here Download via GitHub

Until next time, chill, relax and keep learning!

]]>The above Image will certainly help us to understand what exactly is happening with true and false in the actual vs predicted table.

We want more True Positive and True Negative because these are the results that are correct in the actual scenario as well as in the test result.

Similarly, we want to avoid False Negative and False Positive, because these are the false results given by our test where actual and test results don't match.

Still, any test is not 100% accurate, so we don't have the choice to avoid both of them, but we can certainly reduce the occurrences of these errors.

More technically, these are referred to as **Type I and Type II Errors.**

We commit a **Type I Error** when we reject the null hypothesis when, in fact, it was true.

A **Type II Error** occurs when we accept the null hypothesis when, in fact, it was false.

False Positive = Type I Error

False Negative = Type II Error

While a curious question might be...

Well, everything is not so black and white when it comes to this concept it may depend upon the situations where these concepts get implemented.

For example, When you take a pregnancy test, you have your null hypothesis:

I am not pregnant

Rejecting the hypothesis gives you a + Congratulations! You are pregnant! Accepting the hypothesis gives you a Sorry, better luck next time!

Although tests can malfunction, and false positives do occur; In this case, a false positive would be that little + when you are, in fact not pregnant. A false negative, of course, would be the when youve got a little baby growing inside you.

Imagine Someone has been trying for a child for a long time then by some miracle their pregnancy test comes back positive. They mentally prepare themselves for having a baby and after a short period of ecstasy, in some manner, they find out that they are, in fact, not pregnant!

This is a terrible outcome!

Similarly, A false negative for someone who really does not want a child, is not ready for one and when assuring themselves with a negative result, proceeds to drink and smoke can be incredibly damaging for her, her family, and her baby.

So, it is quite challenging to generalize which is the worst error.

**Note**:*However, Pregnancy tests are improved a lot and there are minimum chances of a false negative.*

As you can see from the example above, each error has its own set of issues depending on the problem. There is no worst error to commit because each problem brings its own set of complications to the table. Errors will happen throughout experiments. It is up to the project designer, test maker, or data scientist to determine which error needs to be reduced the most.

Because each problem has its own context and obstacles, you will need to take them all into consideration when designing your experiments and projects to know which error is the worst one to commit.

Similar cases were reported while COVID tests there's one of the insightful articles from BBC, where false positive is discussed,

You may find many more examples where these concepts are discussed and I feel it is one of the fascinating concepts of statistics.

That's all for this week, something interesting next week, and you can always connect with me Here ๐

]]>When we look at predicted values or outcomes that are provided by our machine learning model, in many cases it's not enough to go with those predictions, we need to understand what is happening behind the curtains, we need to know what parameters the model is taking into account or if the model contains any bias.

at times, with machine learning models data goes in and predictions (output) come out but we don't know what exactly happens inside? and all you can say is nobody knows, its like a black box.

there comes Explainability, which means that you can explain what happens in your model from input to output. It makes models transparent and solves the black box problem.

Simply, Explainability is being able to quite literally explain what is happening.

**Accountability**: necessary to avoid similar problems in the future.**Trust**: for humans, it's hard to sink in something that they don't understand fully.**Compliance**: to ensure compliance with company policies, industry standards, and government regulations.**Performance**: explainability will help to understand where exactly fine-tune the model.**Enhanced control**: self-explanatory

Explainability is more important in high-risk domains like healthcare or finance. basically, tiny mistakes can lead to the death of patients or the loss of millions.

**Globally** This is the overall explanation of model behavior.

**Locally** This tells us about each instance and feature in the data individually.

Similarly, another important parameter that I learned about is that different countries have different laws and regulations regarding how companies process user data, different types of consents that are being explicitly required from the user or the respective governments.

However, I came across this site which helps us to understand laws in different countries(related to data) with simple visuals and also in text format as well.

DATA PROTECTION LAWS OF THE WORLD

You can simply notice that some countries have heavily defined laws and regulations around data and some of them don't even have laws and regulations themselves. most of them are limited and moderate as well. moreover, you can even compare your country with others and gain some insights to satisfy your curiosity.

That's all for this week, something interesting next week, and you can always connect with me Here๐

]]>```
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_df = df.select_dtypes(include=numerics)
len(numeric_df.columns)
```

Here, the first line creates a list of all the data types having numeric values and saves them into a variable called `numerics`

then `numeric_df`

variable stores column for elements stated in `numerics`

variable.

Simply, in the last, we are using the simple `len function`

to calculate the number of columns.

Tasks Performed under this project:

- Importing Dataset
- Data Preparation and Cleaning
- Exploratory Analysis and Visualization
- Answering Relevant Questions
- Summary and Conclusion

the dataset contains countrywide car accidents records, which cover 49 states of the USA. The accident data are collected from February 2016 to Dec 2020 Currently, there are about 1.5 million accident records in the dataset.

Imported accidents data from Kaggle in Google Collab for more than 1.5 million accident records across the US. Read dataset with pandas and performed Data Exploration & Visualization using python.

Used Pandas for analysis of top cities having most accidents, figured out top 5 cities. Converted string to timestamp and analyzed the hourly pattern where accidents are more frequent. Used Seaborn to visualize yearly data and found considerable data is missing for the year 2016.

Prepared clear inferences and conclusions and answered relevant questions using Pandas methods and attributes.

**Insights:**

No data from New York less than 3% of cities have more than 1000 yearly accidents. Over 1100 cities have reported just one accident (need to investigate) data points are missing for the year 2016

**Areas of future work:**

Accident analysis according to weather Humidity(%) 'Pressure(in) 'Visibility(mi) Accident analysis according to road condition (related columns 'Bump', 'Crossing')

Check out scripts here:

]]>Design of data model in Power BI

As a data analyst working at a news company you are asked to visualize data that will help readers understand how countries have performed historically in the summer Olympic Games.

You also know that there is an interest in details about the competitors, so if you find anything interesting then dont hesitate to bring that in also.

The main task is still to show historical performance for different countries, with the possibility to select your own country.

The data was in .bak format Dataset link which is restored in SQL database and afterward transformed using the transformations that you can see below.

```
/****** Script for SelectTopNRows command from SSMS ******/
SELECT TOP (100) *
FROM [olympic_games].[dbo].[athletes_event_results]
SELECT ID
,Name as 'Competitor Name' -- Renamed Column
,CASE WHEN SEX = 'M' THEN 'Male'
ELSE 'Female'
END AS Sex -- Better name for filters and visualisations
,Age
,CASE WHEN Age < 18 THEN 'Under 18'
WHEN Age BETWEEN 18 AND 25 THEN '18-25'
WHEN Age BETWEEN 25 AND 30 THEN '25-30'
WHEN Age > 30 THEN 'Over 30'
END AS'Age Grouping'
,Height
,Weight
,Sport
,City
,NOC AS 'Nation Code' -- Expanded abbreviation
,LEFT(Games, CHARINDEX(' ', Games) - 1) AS 'Year' -- Split column to isolate Year, based on space
,CASE WHEN Medal = 'NA' THEN 'Not Registered'
ELSE Medal
END AS Medal -- Replaced NA with Not Registered
From [olympic_games].[dbo].[athletes_event_results]
WHERE RIGHT(Games,CHARINDEX(' ', REVERSE(Games))-1) = 'Summer' -- Where Clause to isolate Summer Season
```

As this is a view where Dimensions and Measures have been combined, the data model that is created in Power BI is from the table having these Dimensions and Measures. The query from the previous step was loaded directly into the PowerBi

The following calculations were created in the Power BI reports using DAX (Data Analysis Expressions).

```
Number of Competitors:
1.No of Competitor = DISTINCTCOUNT(olympic_games[ID])
2.No of Meadals = COUNTROWS(olympic_games)
3.No of medals (registered) =
CALCULATE (
[# No of Meadals],
FILTER (
'olympic_games',
olympic_games[Medal] = "Bronze"
|| 'olympic_games'[Medal] = "Gold"
|| olympic_games[Medal] = "Silver"
)
)
```

The finished dashboard consists of visualizations and filters that give an easy option for the end-users to navigate the summer games through history. Some possibilities are to filter by period using year, nation code to focus on one country or look into either a competitor or specific sports over time.

Click here to download the dashboard and try it out!

]]>

**DataFrame**: is just a table having rows and columns (pandas.core.frame.DataFrame)

**DataSeries**: Each column in DataFrame is called as DataSeries (pandas.core.series.Series)

here's an example:

```
df.head()
df.describe()
```

df is any DataFrame in pandas

head(), describe() takes and parentheses

but let's look at the following commands

```
df.shape
df.dtypes
```

these commands don't take parentheses,

so when to use parentheses and when not to use them?

the simple reason behind it is they are data frame and to understand that in detail visualize methods as actions, descriptions as attributes

let's take the example of a random boy in the group named as sam,

```
# these are the actions called methods that require parentheses
sam.eat()
sam.talk()
```

```
# these are the descriptions that don't require parentheses
sam.weight
sam.height
```

in simple words, we can conclude that Actions(Methods) require parentheses, and Attribute(Descriptions) don't need any parentheses.

it's not a hard and fast rule, but the example makes sense.

Thanks for reading๐ค

]]>let's make this clear first, I assume you have a basic understanding before reading further a refresher will be a plus before reading this article. Check out my SQL essentials which are 10 mins read and come back here.

I am using here is a legendary titanic dataset, download here.

After landing on the page right-click and save as select the proper location in your pc. I will encourage you to try out it on your own dataset.

I am querying the data using pandasql inside jupyter notebook using pandas library of python. considering hard ways to import CSV into SQL I found this easy way. all syntax and clauses will work fine in your databases. regardless to say it's a great opportunity to get hands-on pandasql.

Let's start with Where first what where clause does is filter out the data according to a given condition.

```
import pandas as pd
import pandasql as ps
df = pd.read_csv("titanic.csv")
#Selecting everything from dataset
ps.sqldf("""
SELECT * FROM df
LIMIT 5
""")
#Filtering with where from dataset
ps.sqldf("""
SELECT * FROM df
WHERE Age = 25
LIMIT 5
""")
```

Note: ps.sqldf(""" """) is syntax of pandasql you can ignore if you are using another database

Look closely at the results of the WHERE clause output, you will be getting all the rows where Age is 25.

here we can conclude that WHERE is filtering our data set and coming out with output.

So here I would like to think about a problem If I am going to create another column with a SELECT statement then our WHERE can not filter out the values of that newly created column.

```
#creating new columns with aggregate functions.
ps.sqldf("""
SELECT
Name,
Age,
AVG(Age) AS avg_age,
SUM(Age) AS tot_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age
FROM df
GROUP BY Embarked
""")
```

In this case, we can use HAVING to filter out the output of the above query. like this

```
#Using HAVING to filter out the output
ps.sqldf("""
SELECT
Name,
Age,
AVG(Age) AS avg_age,
SUM(Age) AS tot_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age
FROM df
GROUP BY Embarked
HAVING Age > 30
""")
```

This will simply give you the filtered results on the output of the query.

we can conclude that

**Where clause** will filter the data from a table.

**HAVING clause** will filter the data from the output of the query.

You will get an understanding of what's happening here once you try out it on your own.

Till then thanks for reading, catch you in the next one.

]]>So in simple terms:

Statistics is the grammar of science. Karl Pearson

Let's get into this,

Suppose you have a dataset of a company having 56k employees,
when you are taking out 10k random rows from that data that is a **sample**
when you are considering all 56k rows its **population**

Descriptive statistics (summarizes or describes the characteristics of a data set.)

Inferential statistics (you take out sample from data(also called population) to describe and make inferences about the population.)

measures of central tendency (Mean, Median, Mode)

measures of variability or spread (Standard Deviation, variance, range)

Where Measures of central tendency describe the center of a data set.

Measures of variability or spread describe the dispersion of data within the set.

Inferential statistics is all about taking a sample and conducting appropriate tests. key tests in Inferential statistics are Hypothesis testing Z-test T-test Chi-square test Anova

In Descriptive statistics, you take the data or population and further analyze, Visualize and summarize the data in form of numbers and graphs.

On the other side in Inferential statistics, we take the sample of the population do some tests to come up with inferences and conclusions about that particular population.

A random variable is a numerical description of the outcome of a statistical experiment.

Consider the experiment of tossing two coins. We can define X to be a random variable that measures the number of heads observed in the experiment. For the experiment, the sample space is shown below:

```
S = {(H, H), (H, T), (T, H),(T, T)}
```

There are 4 possible outcomes for the experiment, and this is the domain of X. The random variable X takes these 4 outcomes/events and processes them to give different real values. For each outcome, the associated value is shown as:

```
X(H, H) = 2 (two heads)
X(H, T) = 1 (one head)
X(T, H) = 1 (one head)
X(T, T) = 0 (no heads)
```

There are three types of random variables- discrete random variables, continuous random variables, and mixed random variables.

Discrete- Discrete random variables are random variables, whose range is a countable set. A countable set can be either a finite set or a countably infinite set. Eg: Bank Account number in a random group.

Continuous- Continuous random variables, on the contrary, have a range in the forms of some interval, bounded or unbounded, of the real line. E.g., Let Y be a random variable that is equal to the height of different people in a given population set.

Mixed Random Variables: Lastly, mixed random variables are ones that are a mixture of both continuous and discrete variables.

**Mean**: is the sum of all observations divided by the number of observations.
Mean is denoted by x (pronounced as x bar). also pronounced mew

**Median**: The value of the middlemost observation, obtained after arranging the data in ascending order, is called the median of the data.

**Why/Where median is used:**
if there are outliers in your data mean represents a different form of distribution and it is harmful to analysis.

```
height=[10,20,30,40,50,10000]
print(np.mean(height))
print(np.median(height))
#output:
#1691.6666666666667
#35.0
```

For instance, in the above example, there is a significant difference in mean and median with only a single outlier. so median is used in such cases to find central tendency.

**Mode**: The value which appears most often in the given data i.e. the observation with the highest frequency is called a mode of data.

```
# calculating mean median and mode in python
import numpy as np
from scipy import stats
# sample of height in cms
sample_of_height= [145,170,160,182,142,175,149,143,161,148,155,158,145,145]
print(np.mean(sample_of_height))
print(np.median(sample_of_height))
print(stats.mode(sample_of_height))
```

**Range**: Range, stated simply, is the difference between the largest (L) and smallest (S) value of the data in a data set. It is the simplest measure of dispersion.

**Quartiles**: are special percentiles, which divide the data into quarters.

The first quartile, Q1, is the same as the 25th percentile,

The median is called both the second quartile, Q2, and the 50th percentile.

and the third quartile, Q3, is the same as the 75th percentile.

**Interquartile Range (IQR)
**The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is and can help determine outliers. It is the difference between Q3 and Q1.

Understanding Interquartile Range (IQR) with boxplot

Generally speaking, outliers are those data points that fall outside of the lower whisker and upper whisker.

**Standard Deviation**: measures the dispersion of a dataset relative to its mean. denoted by symbol
Standard Deviation is the square root of the variance.

**Variance**: Variance is a measurement of the spread between numbers in a data set. denoted by symbol 2 (Square of Standard Deviation)

variance is used to see how individual numbers relate to each other within a data set.

Values that are symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. Normal distributions are symmetrical, but not all symmetrical distributions are normal.

The empirical rule also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations

In particular, the empirical rule predicts that

68% of observations fall within the first standard deviation ( ),

95% within the first two standard deviations ( 2), and

99.7% within the first three standard deviations ( 3).

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution, as the sample size increases. This fact holds especially true for sample sizes over 30.

for better understanding watch this.

Here n= number of observations, as n increases distribution starts looking like a normal distribution.

The central limit theorem tells us that no matter what the distribution of the population is, the shape of the sampling distribution will approach normality as the sample size (N) increases.

if the dataset does not belong to a normal distribution with the help of Chebyshev's InEquality theorem we can find out how much percentage of data points will be falling within the range of standard deviation.

for better understanding watch this

Skewness, in statistics, is the degree of asymmetry observed in a probability distribution.

Distributions can exhibit right (positive) skewness or left (negative) skewness to varying degrees. A normal distribution (bell curve) exhibits zero skewness.

Kurtosis is a statistical measure used to describe the degree to which extreme values in either the tails or the peak of a frequency distribution.

There are three types of kurtosis: mesokurtic, leptokurtic, and platykurtic.

**Mesokurtic**: Distributions that are moderate in breadth and curves with a medium peaked height.

**Leptokurtic**: More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails)

**Platykurtic**: Fewer values in the tails and fewer values close to the mean (i.e. the curve has a flat peak and has more dispersed scores with lighter tails).

This is the most foundational part of statistics, apart from the above key concepts two other key concepts are listed below considering the perceptive of data preprocessing and analysis.

In simple terms, covariance is quantifying the relationship between two variables or columns in a particular data set.

- relationship with negative trends
- relationship with no trends
- relationship with positive trends

these positive and negative trends help us to identify the direction of the relationship. Even if covariance can be positive or negative still it can't tell you how much positive or how much negative or positive trend is.

For better understanding watch this

(denoted by )

the limitation of covariance is it can't tell you how much positive or how much negative or positive trend is to overcome we have Pearson Correlation Coefficient which helps us to find strength based on the variance of variables from your data, how strong those are correlated as well as the direction of the relationship.

when you are calculating the pearson correlation coefficient your value will lie between -1 and +1 always.

= 1 : all data points fall in a straight line X(increase), Y(increase)

= -1 : all data points fall in a straight line X(Decrease), Y(increase)

= 0 : There is no relationship

-1 <= => 0 : data points don't fall on straight line while X(Decrease), Y(increase)

Covariance and Correlation are two terms that are exactly opposite to each other, they both are used in statistics and regression analysis, covariance shows us how the two variables vary from each other whereas correlation shows us the relationship between the two variables and how are they related.

**Nominal**: Data at this level is categorized using names, labels, or qualities. eg: Brand Name, ZipCode, Gender.

**Ordinal**: Data at this level can be arranged in order or ranked and can be compared. eg: Grades, Star Reviews, Position in Race, Date

**Interval**: Data at this level can be ordered as it is in a range of values and meaningful differences between the data points can be calculated. eg: Temperature in Celsius, Year of Birth

**Ratio**: Data at this level is similar to interval level with the added property of an inherent zero. Mathematical calculations can be performed on these data points. eg: Height, Age, Weight

let us understand Hypothesis Testing by using a simple example.

**Example**: Class 8th has a mean score of 40 marks out of 100. The principal of the school decided that extra classes are necessary in order to improve the performance of the class. The class scored an average of 45 marks out of 100 after taking extra classes. Can we be sure whether the increase in marks is a result of extra classes or is it just random?

Hypothesis testing lets us identify that. It lets a sample statistic to be checked against a population statistic or statistic of another sample to study any intervention etc. Extra classes being the intervention in the above example.

Null Hypothesis and Alternate Hypothesis.

**Null Hypothesis** being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis for the above example would be that the average marks after extra class are the same as those before the classes.

An **Alternate Hypothesis** for this example would be that the marks after extra class are significantly different from those before the class.

Hypothesis Testing is done on different levels of confidence and makes use of a z-score to calculate the probability. So for a 95% Confidence Interval, anything above the z-threshold for 95% would reject the null hypothesis.

We cannot accept the Null hypothesis, only reject it or fail to reject it. As a practical tip, the Null hypothesis is generally kept which we want to disprove. For eg: You want to prove that students performed better after taking extra classes on their exams. The Null Hypothesis, in this case, would be that the marks obtained after the classes are the same as before the classes.

Now we have defined a basic Hypothesis Testing framework. It is important to look into some of the mistakes that are committed while performing Hypothesis Testing and try to classify those mistakes if possible.

Now, look at the Null Hypothesis definition above. What we notice at the first look is that it is a statement subjective to the tester like you and me and not a fact. That means there is a possibility that the Null Hypothesis can be true or false and we may end up committing some mistakes along the same lines.

There are two types of errors that are generally encountered while conducting Hypothesis Testing.

**Type I error**: Look at the following scenario A male human tested positive for being pregnant. Is it even possible? This surely looks like a case of False Positive. More formally, it is defined as the incorrect rejection of a True Null Hypothesis. The Null Hypothesis, in this case, would be Male Human is not pregnant.

**Type II error**: Look at another scenario where our Null Hypothesis is A male human is pregnant and the test supports the Null Hypothesis. This looks like a case of False Negative. More formally it is defined as the acceptance of a false Null Hypothesis.

We have covered some basic yet fundamental statistical concepts. If you are working or plan to work in the field of data science, you are likely to encounter these concepts.

There is, of course, much more to learn about statistics. Once you understand the basics.

Thanks for reading, catch you in the next one.

]]>However, I came across various steps but somehow they were not exact, not really documented in a proper way.

to solve that here I am going to document those steps which will be beneficial to follow and get it done easily.

To generate-config file run this in Anaconda prompt or CMD whichever you are using:

```
"jupyter notebook --generate-config "
```

Locate the generated configuration file in the path In my case the path was

C:\Users\Shrey.jupyter\jupyter_notebook_config.py

and open it with notepad/any text editor then find below syntax:

```
## Specify what command to use to invoke a web browser when opening the notebook.
# If not specified, the default browser will be determined by the `webbrowser`
# standard library module, which allows setting of the BROWSER environment
# variable to override it.
# Default: ''
# c.NotebookApp.browser = ''
```

Here you care about the last line where you need to do change

I am using chrome to do so but you can do it for any browser of your wish :)

Just right click on the browser icon in the taskbar right-click again on chrome or a new tab whichever works for you Go to properties and copy the target path

and paste that path as stated below

Note: start with this syntax u'path %s'

```
c.NotebookApp.browser = u'C:\Program Files\Google\Chrome\Application\chrome.exe %s'
```

don't forget to remove "#" of that particular line

Save the file and close it, close anaconda prompt launch it again, this try usual way to open jupyter notebook. this time the jupyter notebook will open in your desired browser.

for jupyter lab, the procedure is the same just syntax is different I am listing required changes to be done down here, please follow the same for the jupyter lab

To generate-config file run this in Anaconda prompt or CMD whichever you are using:

```
jupyter-lab --generate-config
```

this time changes are needed here:

c.ServerApp.browser =

In my case, the path is just for your reference

```
c.ServerApp.browser = u'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
```

copy browser target path and paste it in the required way, don't forget to remove the # of that particular line and save the file.

Done, Next time lab will also open in your desired browser

Thanks for reading, catch you in the next one.

]]>this is going to be short concise and point to point because I don't want to waste your time as well as it will be enough SQL for you to feel confident.

if you have the habit of reading and experimenting with music I would suggest that put your headphones in the ear and let's start with SQL

let's start with very basic,

SQL is a Structured Query Language which is basically a language used by databases. This language allows you to handle the information using tables and allows you to query these tables

1970 Dr. Edgar Frank Codd worked for IBM invented the relational database in the 70s. It was initially called SEQUEL, but due to a trademark problem, it was changed to SQL. However, many people still say SEQUEL.

1974 Structured Query Language appeared.

1978 IBM worked to develop Codd's ideas and released a product named System/R.

1986 IBM developed the first prototype of relational database and standardized by ANSI. The first relational database was released by Relational Software which later came to be known as Oracle.

The American National Standards Institute (ANSI) make SQL a standard in 1986 and the International Organization for Standardization (ISO) makes SQL the database standard

it means if you learn SQL you can almost be ready to use all the databases because they all follow ANSI Standard there are simple differences in some functions but apart from that, you are good to go.

What is Database?

A database is an organized collection of structured information, or data, typically stored electronically in a computer system.

A database is usually controlled by a database management system (DBMS).

#What is DBMS(Database Management System)?

Database Management Systems (DBMS) are software systems used to store, retrieve, and run queries on data. A DBMS serves as an interface between an end-user and a database, allowing users to create, read, update, and delete data in the database.

- Hierarchical databases
- Network databases
- Object-oriented databases
- Relational databases
- NoSQL databases

Hierarchical Databases : Just as in any hierarchy, this database follows the progression of data being categorized in ranks or levels, wherein data is categorized based on a common point of linkage. As a result, two entities of data will be lower in rank and the commonality would assume a higher rank.

Network Databases : In Laymans terms, a network database is a hierarchical database, but with a major tweak. The child records are given the freedom to associate with multiple parent records. As a result, a network or net of database files linked with multiple threads is observed. Notice how the Student, Faculty, and Resources elements each have two-parent records, which are Departments and Clubs.

Object-Oriented Databases : Those familiar with the Object-Oriented Programming Paradigm would be able to relate to this model of databases easily. Information stored in a database is capable of being represented as an object which responds as an instance of the database model. Therefore, the object can be referenced and called without any difficulty. As a result, the workload on the database is substantially reduced.

Relational Databases : Considered the most mature of all databases, these databases lead in the production line along with their management systems. In this database, every piece of information has a relationship with every other piece of information. This is on account of every data value in the database having a unique identity in the form of a record. Note that all data is tabulated in this model. Therefore, every row of data in the database is linked with another row using a primary key. Similarly, every table is linked with another table using a foreign key.

NoSQL Databases : A NoSQL originally referring to non SQL or non-relational is a database that provides a mechanism for storage and retrieval of data. This data is modeled in means other than the tabular relations used in relational databases.

Top SQL DBMS you should know about: (Not ordered chronologically)

- MySQL
- Oracle
- PostgreSQL
- Microsoft SQL Server
- MongoDB

Before moving ahead I would like to create a sample table, please note for every SQL query written over here works on this table. if some code doesn't work please check this table has those values because queries may overlap each other or you might just delete the table but no worries keep the below script handy in the second query tab and execute it again.

```
DROP DATABASE IF EXISTS `hashnode`;
create database hashnode;
use hashnode;
create table person
(
id int not null,
first_name varchar(50) not null,
location varchar(50) not null,
salary bigint not null
);
# insert values into table
insert into person values(100,'Raj','Chennai',50000);
insert into person values(101,'john','Delhi',40000);
insert into person values(102,'Hari','Mumbai',20000);
insert into person values(103,'Suresh','Banglore',25000);
insert into person values(104,'Kaif','Pune',30000);
```

There are five types of SQL commands: DDL, DML, DCL, TCL, and DQL.

DDL changes the structure of the table like creating a table, deleting a table, altering a table, etc. All the command of DDL are auto-committed that means it permanently save all the changes in the database.

Here are commands that come under DDL:

```
# create command is used to create a new table or database
create table person
(
id int not null,
first_name varchar(50) not null,
location varchar(50) not null
);
# alter is used to alter the structure of the database.
alter table person ADD last_name varchar(10);
# drop is used to delete both the structure and record stored in the table.
drop table person
# truncate is used to delete all the rows from the table and free the space containing the table.
truncate table person;
```

DML commands are used to modify the database. It is responsible for all forms of changes in the database. The command of DML is not auto-committed which means it can't permanently save all the changes in the database. They can be rollback.

Here are commands that come under DML:

INSERT UPDATE DELETE

```
# The insert statement is a SQL query. It is used to insert data into the row of a table
insert into person values(100,'Raj','Chennai');
# update command is used to update or modify the value of a column in the table.
update person
set first_name = 'Rohini'
where id = 102;
# delete is used to remove one or more rows from a table.
delete from person
where id = 102;
```

DCL commands are used to grant and take back authority from any database user.

Here are commands that come under DCL:

Grant: It is used to give user access privileges to a database.

Revoke: It is used to take back permissions from the user.

TCL commands can only use with DML commands like INSERT, DELETE and UPDATE only.

These operations are automatically committed in the database that's why they cannot be used while creating tables or dropping them.

Here are some commands that come under TCL:

Commit: Commit command is used to save all the transactions to the database.

```
start transaction;
set sql_safe_updates = false;
update person set first_name = 'Pratik' where id = 101;
commit;
select * from person;
```

Rollback: Rollback command is used to undo transactions that have not already been saved to the database.

```
start transaction;
set sql_safe_updates = false;
update person set first_name = 'Pratik' where id = 101;
rollback;
select * from person;
```

Savepoint: It is used to roll the transaction back to a certain point without rolling back the entire transaction

```
SAVEPOINT SAVEPOINT_NAME;
```

DQL is used to fetch the data from the database.

It uses only one command:

select which is the most basic command in SQL

```
select * from persons
```

Each column in a database table is required to have a name and a data type.

An SQL developer must decide what type of data will be stored inside each column when creating a table. The data type is a guideline for SQL to understand what type of data is expected inside of each column, and it also identifies how SQL will interact with the stored data.

MySQL uses many different data types broken into three categories:

Numeric Date and Time String Types.

INT A normal-sized integer that can be signed or unsigned. If signed, the allowable range is from -2147483648 to 2147483647. If unsigned, the allowable range is from 0 to 4294967295. You can specify a width of up to 11 digits.

BIGINT A large integer that can be signed or unsigned. If signed, the allowable range is from -9223372036854775808 to 9223372036854775807. If unsigned, the allowable range is from 0 to 18446744073709551615. You can specify a width of up to 20 digits.

DATE A date in YYYY-MM-DD format, between 1000-01-01 and 9999-12-31. For example, December 30th, 1973 would be stored as 1973-12-30.

DATETIME A date and time combination in YYYY-MM-DD HH:MM:SS format, between 1000-01-01 00:00:00 and 9999-12-31 23:59:59. For example, 3:30 in the afternoon on December 30th, 1973 would be stored as 1973-12-30 15:30:00.

VARCHAR(M) A variable-length string between 1 and 255 characters in length. For example, VARCHAR(25). You must define a length when creating a VARCHAR field.

CHAR(M) A fixed-length string between 1 and 255 characters in length (for example CHAR(5)), right-padded with spaces to the specified length when stored. Defining a length is not required, but the default is 1

An SQL operator is a special word or character used to perform tasks. These tasks can be anything from complex comparisons, to basic arithmetic operations. Think of an SQL operator as similar to how the different buttons on a calculator function.

- (Addition) The + symbol adds two numbers together.

```
SELECT 10 + 10;
```

- (Subtraction) The - symbol subtracts one number from another.

```
SELECT 10 - 10;
```

- (Multiplication) The * symbol multiples two numbers together.

```
SELECT 10 * 10;
```

/ (Division) The / symbol divides one number by another

```
SELECT 10 / 10;
```

% (Remainder/Modulus) The % symbol (sometimes referred to as Modulus) returns the remainder of one number divided by another.

```
SELECT 10 % 10;
```

& (Bitwise AND) | (Bitwise OR) ^ (Bitwise exclusive OR)

= (Equal to)

```
-- = (Equal to)
select first_name from person where salary = 25000;
```

!= (Not equal to)

```
-- != (Not equal to)
select first_name from person where salary != 25000;
```

(Greater than)

```
-- > (Greater than)
select first_name from person where salary > 20000;
```

< (Less than)

```
-- < (Less than)
select first_name from person where salary < 40000;
```

= (Greater than or equal to)

```
-- >= (Greater than or equal to)
select first_name from person where salary >= 25000;
```

<= (Less than or equal to)

```
-- <= (Less than or equal to)
select first_name from person where salary <= 40000;
```

<> (Not equal to)

```
-- <> (Not equal to)
select first_name from person where salary <> 50000;
```

The ALL operator returns TRUE if all of the subquery values meet the specified condition.

```
select first_name from person
where salary > all (select salary from person where id > 101)
```

The ANY operator returns TRUE if any of the subquery values meet the specified condition.

```
select first_name from person
where salary > any (select salary from person where id > 101)
```

The AND operator returns TRUE if all of the conditions separated by AND are true.

```
select first_name from person
where salary > 10000 and location = 'Delhi';
```

The BETWEEN operator filters your query to only return results that fit a specified range.

```
select first_name from person
where salary between 10000 AND 35000;
```

The EXISTS operator is used to filter data by looking for the presence of any record in a subquery.

```
select first_name, id from person
where exists (select salary from person where id = 104);
```

The IN operator includes multiple values set into the WHERE clause.

```
select * from person
where first_name in ('Raj','Hari');
```

LIKE operator searches for a specified pattern in a column.

```
select * from person
WHERE first_name like '%i%';
```

The NOT operator returns results if the condition or conditions are not true.

```
select * from person
WHERE first_name not in ('Raj','Hari');
```

The OR operator returns TRUE if any of the conditions separated by OR are true.

```
select first_name from person
where salary > 10000 or location = 'Delhi';
```

IS NULL

```
select first_name from person
where location is null;
# here we don't have any null data so it will return 0 rows.
```

Something more on the writing SQL queries in a more effective way:

here I am listing out query writing sequence, please note this is not an execution sequence it is simple query writing sequence.

- SELECT
- FROM
- JOIN
- ON
- WHERE
- GROUP BY
- HAVING
- ORDER BY
- LIMIT

I know we did not cover some of the above queries here, but here we have covered essentials which are backbone concepts of SQL and those who are at the beginning of getting their hands dirty with SQL this SQL essential guide going to help them to get started with the most demanding yet somewhat easy skill.

on the ending note, I want to list simple tips that can eventually deepen your understanding in some sense.

SQL is not case-sensitive however some organizations do follow some code patterns so according to them, you will need to use the case.

you can write your queries in a single line as well as multiple lines its recommended practice to write queries in multiple lines because it is easy to read and makes more sense.

it is best practice to use comments wherever necessary because it helps to understand scripts better and if you are looking at code after some time interval it's going to help to recall.

Now I feel those who are at this end of the page are ready to take beginner level challenges and explore some statements and combinations

I would like to stop here, it was a pretty lengthy article but it's worth it!

Thanks for reading...

]]>so to follow with me, you will need the data set which you can create by just simply opening and running the file in MySQL. download file here .

download the above file and open it in MySQL then select all and just run it, then open the new SQL tab... and here starts the real stuff.

Understanding the data

here we are getting to know our dataset, with the below queries you will be returned with a list of tables available in our data set here we have only three tables.

```
use school;
show tables;
# returns
# Tables_in_school
# subject
# teacher
# teacher_subject
```

let's look at fields and attributes available in the tables

```
desc teacher;
desc subject;
desc teacher_subject;
```

Run each of the above queries individually, you will understand the fields and primary keys and data types of different columns as well as connecting factors between these three tables.

here you can easily conclude that there are three columns in the teacher table where tid is the primary key, then the subject table has two columns where sid is the primary key, and lastly, in the third table, we have two columns tid and sid which is together a composite key.

with this basic understanding, we can conclude that the teacher_subject table is a junction table for the other two tables.

now we understood the data and table structure, let's start with writing subqueries for getting the following answer

write a subquery to retrieve all subject names taken by teacher Arun

disclaimer: I know it's very easy to retrieve the same by using the inner join but here we care about subquery.

here we are going to approach the same with steps, likewise, we are going to divide the query into small steps, and then we will merge the same to form a subquery.

follow these steps

```
# step 1 Retrieve tid for tutor Arun
select tid from teacher where tname ='Arun';
# step 2 Retrieve sid for tid received in step 1 query
select sid from teacher_subject where tid = 102;
# step 3 merge step 2 and step 1 queries
select sid from teacher_subject
where tid = (select tid from teacher where tname ='Arun');
# step 4 Retrieve sname for sid received in step 3 query
select sname from subject where sid in (12,14);
# Final step merge step 4 and step 3 queries
select sname from subject where sid in (select sid from teacher_subject
where tid = (select tid from teacher where tname ='Arun'));
```

This is how we write our subquery, here I want to tell you that what we want is just the final step but while writing we might face some errors, and fixing those errors will make a different level of headache in this approach we have a proper understanding of our subquery in each stage so even if get caught into some errors it will be very easy to traceback and most importantly this approach gets some sort of clarity in mind.

so writing subquery is a very important skill, sometimes these queries may get too much longer so practicing it in a clear way makes it easy to document.

here is the end of this small exercise, if want something to do on this same data set then I have a challenge for you, try to retrieve the below query.

Write a query to retrieve all the tutors who are taking mathematics subject using a subquery

The answer key is here

I would love to hear out from you if you find some errors or issues while doing so. reach out here

thanks for reading, see you in the next one.

]]>on the other side **package** is a collection of python modules, and it contains an additional **init**.py file.

let's get some clarity here, you can put your python files in some folder that's a usual way of saving your scripts but here you want to make that folder a package so for that you will need to add an **init**.py file. so that it will get identified as a package.
and each module in that folder can be called we will see it further.

so before jumping into creating our package I would like to list down the why:

Advantages of using python packages:

- to break down large programs into smaller manageable and organized files.
- ensure reusability of code
- can be called in any python file, where the file name becomes the module name.

let's start the real stuff:

First, find the path where you will need to save your package, here I have listed the most common path but if you have installed your python or anaconda somewhere else then the path may differ.

```
# if you are using anaconda
C:\ProgramData\Anaconda3\Lib\site-packages
# if you are using core python
C:\Program Files\Python39\Lib\site-packages
```

here you need to create a folder named geometry, yeah you can create whatever you want here for the sake of the demo I am giving you all the inputs,

```
# create folder called geometry
```

Now open notepad, and we will create our first module here, just copy-paste the following code in the notepad, and name it volumelib.py and save it in the geometry folder.

make sure there's no extension other than ".py"

```
# volumelib module
# calculate cube volume
def calculateCubeVolume(sideSize):
volume = sideSize * sideSize * sideSize
print('Cube volume :',volume)
# calculate cuboid volume
def calculateCuboidVolume(length,breadth,height):
volume = length * breadth * height
print('Cuboid volume :',volume)
```

now you are following me on this, here we will create our second module, follow the same procedure and again save it in the same folder here file name should be arealib.py

make sure there's no extension other than ".py"

```
# arealib module
# calculate area of square
def calculateSquareArea(sideSize):
area = sideSize * sideSize
print('Area of square :',area)
# calculate area of rectangle
def calculateRectangleArea(length,breadth):
area = length * breadth
print('Area of rectangle :',area)
```

now we are halfway done and the next step is to create an **init**.py file.

```
# save an empty file named as __init__.py
# underscore underscore init underscore underscore .py
```

Now you are ready with your package let's go to the IDE, I will prefer jupyter here you are free to choose spyder or pycharm

we will import our module first:

```
# import module
import geometry.arealib
import geometry.volumelib
```

if you are getting errors at this stage then there will be either of the cases

- check the path of the file you saved.
- check for spelling mistakes in saved module files if you haven't just copied and pasted.

and if you're not getting any output then congratulations you made it, your package is imported successfully and waiting for you to ask him to get something done out of it

```
# use function from module (arealib)
geometry.arealib.calculateSquareArea(10)
geometry.arealib.calculateRectangleArea(10,15)
```

you made it, your package is working at first you are calculating the area of the square and second you are calculating the rectangle area.

```
# use function from module (volumelib)
geometry.volumelib.calculateCubeVolume(10)
geometry.volumelib.calculateCuboidVolume(10,20,30)
```

here you are calculating the volume of cube and cuboid respectively. remember python is case sensitive so make sure you follow the syntax.

we made our package and it's time to celebrate, here you are 1% better than you were 15 minutes ago I feel its enough reason to celebrate.

So I would be glad if you show some decency to your own self and treat yourself as an achiever until then keep visiting here for getting some python and data science stuff.

thanks for reading, happy coding, catch you in the next one!

]]>- Inbuilt functions (R itself have a lot of inbuilt functions)
- User-defined function (User writes it for future use)

The R Programming language considers these functions as an object and provides them the temporary access of interpreter at the time of execution. Once the function is done with the task provided, the control gets reverted to the interpreter as it was before.

R has a rich library of inbuilt functions which make life easy for an analyst. However, if you find any certain task can be automated using a function, you are free to write a function of your own under the R environment.

function_name - specifies the name of the function. You need to store a function somewhere.

argument_list - specifies the list of argument/s we use under a function.

expressions - specifies the expressions which get executed under a function body to have the required task done.

return -allows you to return a value for the defined function. If we dont specify return while defining a function, the last line of code inside the function will get executed.

```
add <- function(a,b,c,d){
result = a + b + c + d
print(paste("addition is ", result))
}
add(1,2,3,4)
#Returns "addition is 10"
```

- here we have defined our function with the name add which consists of 4 arguments.
- inside the function body we have created a variable called result which holds the addition of 4 values
- in the end we have to print and paste so that we can print out our results
- we are giving inputs to the functions and after execution, we will get the results.

This can make your defined function more user-friendly. below function asks the user for input for values for variables the function becomes more generalized a, b, c since the user can provide the value of his interest for a, b, c.

```
# Function in R with user input
product <- function(a, b, c)
{
a <- readline("Please enter value for a: ")
b <- readline("Please enter value for b: ")
c <- readline("Please enter value for c: ")
#convert characters into integers
a <- as.integer(a)
b <- as.integer(b)
c <- as.integer(c)
product = a * b * c
print(paste("Product of given values is ", product))
}
product(a,b,c)
#Returns:
# product(a,b,c)
#Please enter value for a: 5
#Please enter value for b: 2
#Please enter value for c: 3
#"Product of given values is 30"
```

So we have seen how to write a function to automate some random tasks, like a calculator can do the same in no time, but to understand a bit about functions we should always start with a familiar field.

hope you are enjoying it, I am planning to write more often because while writing it makes things more clear so let's make something interesting in the next writing till then stay subscribed.

]]>R is primarily used in its dedicated IDE called R studio, However, it's open-source and free to use still tech giants like Microsoft, IBM provides commercial support to R for their customers. with 9000+ inbuilt packages, R is easy and most widely used for statistical computation, Machine Learning, Data Analysis.

R is platform-independent AKA it works with Windows, Linux, MACOS without a hassle. R Studio server is also available which allows users to use it on a web browser.

- Logical
- Numeric
- Integer
- Complex
- Character
- Raw data type

Logical data type

logical data type contains true and false values

```
var1 < - TRUE
var2 < - FALSE
print(typeof(var1)) #returns "logical"
print(typeof(var2)) #returns "logical"
```

Numerical data type

Numeric data type consists of all numbers whole numbers as well as fractions.

```
var3 < - 10
print(var3)
typeof(var3) #returns "numeric"
var4 < - 10.4
print(var4)
typeof(var4) #returns "numeric"
```

integer data type

in integer data type we provide L at the end of number as stated in an example where L stands for strong

```
data < - 10L
print(data)
typeof(data) #returns "integer"
```

Complex data type

A complex data type is referred to as real + imaginary number

```
data <- 3 + 6i
print(data)
typeof(data) #returns "complex"
```

Character (string) data type

Character is a combination of alphanumeric values in quotation

```
data <- "Bangalore"
print(data)
typeof(data) #returns "character"
```

Raw data type (less used in data science)

```
x <- "jsgfuyewygfiubb"
y <- charToRaw(x)
typeof(x) #returns "character"
typeof(y) #returns "raw"
```

If you are from a python background it will be easy to relate with R, the difference is just R considers all values in one umbrella as numeric (Whole and Fraction) while in python whole number is under integer and fraction is under a float.

A valid variable name in R can start with a name, letters, dot but not started by a number.

there are three ways we can assign value or data to a variable

```
Equal to operator
var1 = 23:30
print(var1)
cat("var1 is ",var1)
# leftword assignment
var2 < - c("learn","R")
print(var2)
# rightword assignment
c("learn","R") - > var3
print(var3)
```

usually, all programmers who write codes in R use leftward assignment not necessarily for any specific reason but it's easy to understand.

Unlike other programming languages, the print function of R is pretty limited so there is cat function that serves the purpose

```
var1 < - c(23,25,26,74,52,62)
print(var1) #returns 23,25,26,74,52,62
# but if you want to print anything other than the
#variable print function cant do that
#so in such case, you need to use the cat function
cat("var1 is ", var1) #returns var1 is 23 25 26 74 52 62
```

So if you are reading this closely you have observed "c" in some statements and probably you may be wondering why that damm "C" is sitting there. let me tell you his story,

that "C" stands for combine when you need to assign more than one element to the variable then you need to add "C" basically what it does is it combines all the values and shows them in a single line. R is more obsessed with vectors. In R, numbers are just vectors of length one.

you can try it for fun to write code without C R will not accept that input because R is just a bit different, that's what makes it interesting.

So that's it for now folks will be writing some good informative stuff on R here on, let's see where it goes, I hope you find it interesting and want some for content, a simple way can be you can subscribe to the newsletter so you will get some insightful post right into your inbox, till then have a great whatever you wish you want to be great!

]]>I am going to list some handy shortcuts that can make your process of writing code fast, that being said let's get to the main content.

Basically, we need to understand two different states of jupyer environment.

**Command mode: in simple words command mode is something where the cell is not active, you can't write or edit something when you're in this mode.
**

**Edit mode: Simply edit mode just let you edit your cell or write your code.
**

you can switch between these two modes by ESC (for Command mode) ENTER (for edit mode)

on this foundation all of the shortcuts works, whenever you're in the edit mode no shortcut will work, you have to be in command mode to use the shortcuts.

a = insert cell above

b = insert cell below

dd = delete cell (you need to press 'd' two times)

z = to undo your previous action.

c = copy the particular cell

v = paste the copied cell

these all are enough to make you productive enough until your subconscious gets used to these keys there will be a bit of struggle.

let's understand another simple yet time-saving keyboard shortcut, to run the code some people usually use CTRL + ENTER or SHIFT + ENTER but there is a difference between those two.

when we use:

CTRL + ENTER: execution of code gets completed, and you stay at the same cell.

SHIFT + ENTER: execution of code gets completed, and you move to the next cell.

this little understanding can save seconds daily eventually hours in long run.

when you are in the command mode and want to convert a cell into code or markdown simply press:

y = turn markdown into code

m = turn code into markdown

i = interrupt the kernel (when code is stuck in some processes)

Alright, now you know a bunch of shortcuts that can be the foundations to write a code in most effective and fast manner, this is it for now.

I hope you had fun reading this article, and you found everything easy to understand. If you need further information on any topic, let me know in the response.

Thanks for reading small introductory stuff on jupyter environment, if you are interested in more stuff you will be added to the mailing list once you subscribe to the newsletters...

happy coding!

]]>Numpy is one of the most important libraries for Data Science projects in Python.

If you're learning python next thing you will need is to learn is Numpy.

there are other important libraries that are built on top of NumPy like pandas SciKit-Learn, so basically Numpy makes execution easy, fast, and uses less memory.

Below are the most important Numpy methods which will be enough to get started.

๐๐ป

Numpy (Numeric Python)

a basic Python package that provides an alternative to a regular Python list, a Numpy n-dimensional homogeneous array.

It's one of the easy peasy tasks...using cmd

```
pip install numpy
```

it will start installing the Numpy library...if you want to install using Jupiter notebook

```
!pip install numpy
```

Just add "!" It will do the rest.

importing numpy as np (by convention)

```
import numpy as np
```

importing numpy as np is the alias used to write code which is useful in a way we don't need to type the whole word 'numpy' just 'np' will do the work.

Hold on, But why use numpy when you already have lists in Python?

The short answer: Speed, numpy lists use lesser memory and are much faster to run than vanilla Python lists.

Let's see how we can use Numpy.

np.array() : It can be used to define any kind of array of any dimension and is the most common method. dtype argument can be added while defining the array for the output to be of our preferred data type.

```
import numpy as np
# Array with elements of floating integers
a = np.array([1,2,3,4,5,6,7,8], dtype = float)
print(a)
```

np.zeros() : Using np.zeros() would produce an array of zeroes only. You just need to define the dimension of the zero array.

```
import numpy as np
b = np.zeros(5)
print(b)
```

np.ones() : Using np.ones() would produce an array of ones only. Again, you just need to define the dimension of the array.

```
import numpy as np
c = np.ones(5)
print(c)
```

while '5' is the parameter that gives 5 column matrix

np.arange() : When you want an array with values at a regular interval within a given range, you can use np.arange() to achieve so. It takes in three values, the first two for the range, and the third for the skip value. Note that the second value of the range is non-inclusive.

```
import numpy as np
#with 2 parameters It prints all the values in given range
d = np.arange(4,12)
print(d)
#with third parameter it prints values in steps
e = np.arange(4,12,3)
print(e)
```

1 np.size : It returns the number of elements in the array, no matter its dimension.

```
import numpy as np
a = np.arange(12)
print(a)
print(a.size)
```

2 np.shape : It returns the number of rows and the columns of the array in the form of (row, columns).

```
a.shape
```

3 np.reshape() : It lets you change the dimension of the given array to the dimension of your choice.

```
print(a.reshape(3,4))
```

4 np.resize() : It is the same as a shape in its operation, the only difference being that using resize() would change the original array.

```
print(a.resize(3,4))
# #it would print None, but the change has taken place on the original array
```

```
import numpy as np
a = [1,2,3,4,5,6]
b = np.array(a)
# python list is converted to numpy array.
```

```
import numpy as np
f = np.eyes(5)
print(f)
```

List with evenly spaced elements in between, the 3 here tells us that we need a list with 3 elements from 0-10, evenly spaced (both included in this case).

```
import numpy as np
#array with evenly spaced elements
g = np.linespace(0,10,3)
```

```
import numpy as np
a = np.arrange(1,3)
#[0,1,2]
a + 5
# [5,6,7]
a + a
#[0,2,4]
```

```
import numpy as np
a = np.arrange(1,3)
# [0,1,2]
a.sum()
#sum of all elements
a.sum(axis=0)
#sum of all columns
a.sum(axis=1)
# sum of all rows
```

```
import numpy as np
arr1 = np.arange(1,10)
arr1.reshape(3,3)
```

```
import numpy as np
np.arr1.mean()
#OR
np.mean(np_arr)
```

```
import numpy as np
np.median(arr1)
```

```
import numpy as np
np.var(arr1)
```

```
import numpy as np
np.std(np_arr)
```

There are many other methods like sum(), sort(), corcoeff(), etc., that comes in handy while doing an in-depth data analysis.

Alright, now you know a bunch of tools which are the foundations to numpy, this is it for now.

I hope you had fun reading this article, and you found everything easy to understand. If you need further information on any topic, let me know in the response.

Thanks for reading small introductory stuff on numpy, if you are interested in more stuff you will be added to the mailing list once you subscribe to the newsletters..

happy coding!

]]>