Python has huge number of libraries and functions using that we can easily do data profiling, data analysis and data visualisations.
Specially for Data Analysts and Data Architect its very common and day-to-day challenges to analyse and profile huge data thats siting on heterogeneous data sources. Like some of data available in flat file, some are need to be copy from internet and some of data need to be taken from relational database after joining all data together only analysis can perform.
Use case :
We have a dataset that copied from internet and other dataset given in CSV need join together and show " Life expectancy and fertility rate statistics by country .
Data sources :
Datasets in List format :
Dataset_1:
Country_Code = list (["ABW","AFG","AGO","ALB","ARE","ARG"]
Life_Expectancy_At_Birth_2001 = list ([65.5693658536586,32.328512195122,32.9848292682927,62.2543658536585,52.2432195121951,65.2155365853659]
Country_Code = list (["ABW","AFG","AGO","ALB","ARE","ARG"]
Life_Expectancy_At_Birth_2001 = list ([65.5693658536586,32.328512195122,32.9848292682927,62.2543658536585,52.2432195121951,65.2155365853659]
Dataset_2 :
Countries_2001_Dataset = list (["Aruba","Afghanistan","Angola","Albania","United Arab Emirates","Argentina"]
Codes_2001_Dataset = list (["ABW","AFG","AGO","ALB","ARE","ARG"]
Dataset_3 : ( CSV format): File LinkCodes_2001_Dataset = list (["ABW","AFG","AGO","ALB","ARE","ARG"]
Code :
import pandas as pd;
import numpy as np;
demographic= pd.read_csv("/Users/perx/desktop/vikas/learning/source data/P4-Demographic-Data.csv")
# Give your local file path where you have downloaded .csv
# Validate data
demographic.head(4)
# convert list into Matrix
matrix_facts={}
matrix_facts["country_code"]=Country_Code
matrix_facts["Life_Expectancy_At_Birth_1960"]=Life_Expectancy_At_Birth_1960
matrix_facts["Life_Expectancy_At_Birth_2013"]=Life_Expectancy_At_Birth_2013
matrix_dim={}
matrix_dim["Countries_2012_Dataset"]=Countries_2012_Dataset
matrix_dim["Codes_2012_Dataset"]=Codes_2012_Dataset
matrix_dim["Regions_2012_Dataset"]=Regions_2012_Dataset
#converting matrix into table
dataset_facts= pd.DataFrame(matrix_facts)
dataset_dim=pd.DataFrame(matrix_dim)
#Validate second dataset
dataset_facts
dataset_dim
# Joining Dataset
joined_dataset_1=pd.merge(dataset_facts, dataset_dim, left_on="country_code", right_on="Codes_2012_Dataset")
# Final Dataset given in list format
joined_dataset_1.head(10)
#Dataset given in file
demographic.head(4)
#Joining File Data set and list dataset
final_dataset=pd.merge(demographic,joined_dataset_1, left_on="Country Code",right_on="country_code", how="outer")
final_dataset.head(5)
#Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
viz1=sns.lmplot(data=final_dataset, x="Birth rate", y="Life_Expectancy_At_Birth_1960", hue="Regions_2012_Dataset",scatter_kws={"s": 50})
import numpy as np;
demographic= pd.read_csv("/Users/perx/desktop/vikas/learning/source data/P4-Demographic-Data.csv")
# Give your local file path where you have downloaded .csv
# Validate data
demographic.head(4)
# convert list into Matrix
matrix_facts={}
matrix_facts["country_code"]=Country_Code
matrix_facts["Life_Expectancy_At_Birth_1960"]=Life_Expectancy_At_Birth_1960
matrix_facts["Life_Expectancy_At_Birth_2013"]=Life_Expectancy_At_Birth_2013
matrix_dim={}
matrix_dim["Countries_2012_Dataset"]=Countries_2012_Dataset
matrix_dim["Codes_2012_Dataset"]=Codes_2012_Dataset
matrix_dim["Regions_2012_Dataset"]=Regions_2012_Dataset
#converting matrix into table
dataset_facts= pd.DataFrame(matrix_facts)
dataset_dim=pd.DataFrame(matrix_dim)
#Validate second dataset
dataset_facts
dataset_dim
# Joining Dataset
joined_dataset_1=pd.merge(dataset_facts, dataset_dim, left_on="country_code", right_on="Codes_2012_Dataset")
# Final Dataset given in list format
joined_dataset_1.head(10)
#Dataset given in file
demographic.head(4)
#Joining File Data set and list dataset
final_dataset=pd.merge(demographic,joined_dataset_1, left_on="Country Code",right_on="country_code", how="outer")
final_dataset.head(5)
#Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
viz1=sns.lmplot(data=final_dataset, x="Birth rate", y="Life_Expectancy_At_Birth_1960", hue="Regions_2012_Dataset",scatter_kws={"s": 50})
Now we are ready visualise our data
Here in above example we see, you can combine many number of dataset and can do analysis with just python.
This is really very nice blog, your content is very interesting and worth reading it. Thanks for providing such a valuable Knowledge on Data Analysis With Python. Keep sharing. Very knowledgeable Blog.
ReplyDeleteThank you very much for encouraging.
DeleteThanks Vihan
ReplyDeleteThis is blog is really eye catching, the content that you have provided is worth reading especially for students/people who are interested in the field. I really appreciate you providing such information on. cannot wait for more information also if you want you could check out data science course that would intrigue you
ReplyDeleteI am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
ReplyDeleteData Visualization Service in USA