MODUL 1 PRAKTIKUM SAINS DATA

Kembali ke Sains Data

Outline:

  • Pengenalan Pandas dan review data frame
  • Review Matplotlib
#import module dan package yang diperlukan
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Pengenalan Pandas

Pandas adalah salah satu package andalan python untuk melakukan analysis. Pada praktikum kali ini, kami akan memperkenalkan beberapa fungsi penting yang sering digunakan pada pandas

Pada module kali ini, akan digunakan data csv Fuel Consumption of CO2 (FuelConsumptionCo2.csv) yang bisa didownload dari:

#memuat data frame dari file .csv menggunakan pd.read
df = pd.read_csv('FuelConsumptionCo2.csv')
#melihat jumlah baris dan kolom ari data frame
df.shape
(1067, 13)
#melihat summary dari dataframe, memberikan jumlah baris, kolom, serta tipe data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   MODELYEAR                 1067 non-null   int64  
 1   MAKE                      1067 non-null   object 
 2   MODEL                     1067 non-null   object 
 3   VEHICLECLASS              1067 non-null   object 
 4   ENGINESIZE                1067 non-null   float64
 5   CYLINDERS                 1067 non-null   int64  
 6   TRANSMISSION              1067 non-null   object 
 7   FUELTYPE                  1067 non-null   object 
 8   FUELCONSUMPTION_CITY      1067 non-null   float64
 9   FUELCONSUMPTION_HWY       1067 non-null   float64
 10  FUELCONSUMPTION_COMB      1067 non-null   float64
 11  FUELCONSUMPTION_COMB_MPG  1067 non-null   int64  
 12  CO2EMISSIONS              1067 non-null   int64  
dtypes: float64(4), int64(4), object(5)
memory usage: 108.5+ KB
#melihat cuplikan dari data frame
df.head()
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244
#kita juga dapat melihat beberapa kolom pertama dari data frame, misalnya kita ingin menampilkan 9 kolom teratas
df.head(9)
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244
5 2014 ACURA RLX MID-SIZE 3.5 6 AS6 Z 11.9 7.7 10.0 28 230
6 2014 ACURA TL MID-SIZE 3.5 6 AS6 Z 11.8 8.1 10.1 28 232
7 2014 ACURA TL AWD MID-SIZE 3.7 6 AS6 Z 12.8 9.0 11.1 25 255
8 2014 ACURA TL AWD MID-SIZE 3.7 6 M6 Z 13.4 9.5 11.6 24 267
#mencetak entri masing2 baris dari data frame dalam bentuk 2D numpy array
print(df.values)
[[2014 'ACURA' 'ILX' ... 8.5 33 196]
 [2014 'ACURA' 'ILX' ... 9.6 29 221]
 [2014 'ACURA' 'ILX HYBRID' ... 5.9 48 136]
 ...
 [2014 'VOLVO' 'XC70 AWD' ... 11.8 24 271]
 [2014 'VOLVO' 'XC70 AWD' ... 11.3 25 260]
 [2014 'VOLVO' 'XC90 AWD' ... 12.8 22 294]]
#mencetak nama-nama kolom dari dataframe
print(df.columns)
Index(['MODELYEAR', 'MAKE', 'MODEL', 'VEHICLECLASS', 'ENGINESIZE', 'CYLINDERS',
       'TRANSMISSION', 'FUELTYPE', 'FUELCONSUMPTION_CITY',
       'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB',
       'FUELCONSUMPTION_COMB_MPG', 'CO2EMISSIONS'],
      dtype='object')
#mencetak nama nama index dari dataframe
print(df.index)
RangeIndex(start=0, stop=1067, step=1)
#Selanjutnya, kita dapat melakukan eksploratory data analysis (EDA) menggunakan pandas
df.describe()
MODELYEAR ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
count 1067.0 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000
mean 2014.0 3.346298 5.794752 13.296532 9.474602 11.580881 26.441425 256.228679
std 0.0 1.415895 1.797447 4.101253 2.794510 3.485595 7.468702 63.372304
min 2014.0 1.000000 3.000000 4.600000 4.900000 4.700000 11.000000 108.000000
25% 2014.0 2.000000 4.000000 10.250000 7.500000 9.000000 21.000000 207.000000
50% 2014.0 3.400000 6.000000 12.600000 8.800000 10.900000 26.000000 251.000000
75% 2014.0 4.300000 8.000000 15.550000 10.850000 13.350000 31.000000 294.000000
max 2014.0 8.400000 12.000000 30.200000 20.500000 25.800000 60.000000 488.000000

describe akan menampilkan statistik deskriptif dari dataframe tersebut - count (jumlah baris), - mean (rata rata), - std (standar deviation/simpangan baku), - min (nilai minimum), - 25% (quartil pertama), - 50% (quartil kedua), - 75% (quartil ketiga) - max (nilai maksimum)

#kita juga dapat mengambil beberapa kolom tertentu dari dataframe, misal kita akan mengambil kolom enginesize, cylinder, fuelconsumption_city, fuelcomsumtion_hwy, fuelcomsumption_comb, dan co2emissions
new_df = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
new_df.head(9)
ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB CO2EMISSIONS
0 2.0 4 9.9 6.7 8.5 196
1 2.4 4 11.2 7.7 9.6 221
2 1.5 4 6.0 5.8 5.9 136
3 3.5 6 12.7 9.1 11.1 255
4 3.5 6 12.1 8.7 10.6 244
5 3.5 6 11.9 7.7 10.0 230
6 3.5 6 11.8 8.1 10.1 232
7 3.7 6 12.8 9.0 11.1 255
8 3.7 6 13.4 9.5 11.6 267
#kita juga dapat mengambil row dengan kondisi tertentu, untuk value numeric
big_engine = df[df["ENGINESIZE"]>4]
big_engine.head()
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
12 2014 ASTON MARTIN DB9 MINICOMPACT 5.9 12 A6 Z 18.0 12.6 15.6 18 359
13 2014 ASTON MARTIN RAPIDE SUBCOMPACT 5.9 12 A6 Z 18.0 12.6 15.6 18 359
14 2014 ASTON MARTIN V8 VANTAGE TWO-SEATER 4.7 8 AM7 Z 17.4 11.3 14.7 19 338
15 2014 ASTON MARTIN V8 VANTAGE TWO-SEATER 4.7 8 M6 Z 18.1 12.2 15.4 18 354
16 2014 ASTON MARTIN V8 VANTAGE S TWO-SEATER 4.7 8 AM7 Z 17.4 11.3 14.7 19 338
#kita juga dapat mengambil row dengan kondisi tertentu, untuk value lainnya
aston_martin_cars = df[df["MAKE"].isin(["ASTON MARTIN"])]
aston_martin_cars
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
12 2014 ASTON MARTIN DB9 MINICOMPACT 5.9 12 A6 Z 18.0 12.6 15.6 18 359
13 2014 ASTON MARTIN RAPIDE SUBCOMPACT 5.9 12 A6 Z 18.0 12.6 15.6 18 359
14 2014 ASTON MARTIN V8 VANTAGE TWO-SEATER 4.7 8 AM7 Z 17.4 11.3 14.7 19 338
15 2014 ASTON MARTIN V8 VANTAGE TWO-SEATER 4.7 8 M6 Z 18.1 12.2 15.4 18 354
16 2014 ASTON MARTIN V8 VANTAGE S TWO-SEATER 4.7 8 AM7 Z 17.4 11.3 14.7 19 338
17 2014 ASTON MARTIN V8 VANTAGE S TWO-SEATER 4.7 8 M6 Z 18.1 12.2 15.4 18 354
18 2014 ASTON MARTIN VANQUISH MINICOMPACT 5.9 12 A6 Z 18.0 12.6 15.6 18 359
#sorting rows dari yg kecil ke yang besar
sorted_df = new_df.sort_values("ENGINESIZE")
sorted_df.head()
ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB CO2EMISSIONS
938 1.0 3 6.9 5.7 6.4 147
939 1.0 3 6.9 5.7 6.4 147
394 1.0 4 7.5 5.5 6.6 152
824 1.2 3 6.4 5.4 6.0 138
260 1.2 4 7.6 6.0 6.9 159
#sorting rows dari yg besar ke yg kecil
sorted_df_des = new_df.sort_values("ENGINESIZE", ascending = False)
sorted_df_des.head()
ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB CO2EMISSIONS
940 8.4 10 20.0 13.0 16.9 389
941 8.4 10 20.0 13.0 16.9 389
75 6.8 8 22.3 14.9 19.0 437
349 6.8 10 23.9 17.8 21.2 488
924 6.7 12 20.7 12.8 17.1 393
#menambah kolom baru, pada contoh ini menambahkan kolom FUELCON_AVERAGE
df["FUELCON_AVERAGE"]=(df["FUELCONSUMPTION_CITY"]+df["FUELCONSUMPTION_HWY"]+df["FUELCONSUMPTION_COMB"])/3
df.head()
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS FUELCON_AVERAGE
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196 8.366667
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221 9.500000
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136 5.900000
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255 10.966667
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244 10.466667

REVIEW MATPLOTLIB

Membuat Histogram

#membuat histogram
viz = new_df[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()

Membuat plot biasa

#membuat plot biasa
x= sorted_df.ENGINESIZE
y= sorted_df.CO2EMISSIONS
plt.plot(x,y,'g-')

Membuat scatter plot

#membuat scatter plot
plt.scatter(new_df.ENGINESIZE, new_df.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

Untuk membuat variasi lain terkait sscatter plot, silakan lihat Dokumentasi Scatter Plot