Modul 1 Sains Data: Pengenalan Pandas, Transformasi Data

Modul 1 Sains Data: Pengenalan Pandas, Transformasi Data

Kembali ke Sains Data

Prerequisites

Pada module ini kita akan coba mememahami package pandas, yang merupakan package inti dalam sains-data. kita akan coba melakukan beberapa transformasi data menggunakan pandas.

sebelum itu, python module di bawah ini yang akan digunakan selama praktikum.

import numpy as np
import pandas as pd

Apabila ada yang belum terinstal, silakan instal terlebih dahulu menggunakan pip:

!pip install numpy
!pip install pandas

atau conda jika sedang menggunakan Anaconda:

conda install numpy
conda install pandas

Series

pandas.Series sangat mirip dengan array NumPy (bahkan dibangun di atas objek array NumPy). Yang membedakan array NumPy dari sebuah Series adalah bahwa sebuah Series dapat memiliki label index, yang berarti dapat diindeks dengan label, bukan hanya lokasi nomor saja. Selain itu, sebuah Series tidak perlu menyimpan data numerik, ia dapat menyimpan objek Python sembarang.

Membuat pd.Series dengan list

Paling mudah, ktia dapat membuat pd.Series dengan python list

my_index= ['a','b','c','d','e']
my_data= [1,2,3,4,5]
my_series= pd.Series(data=my_data, index=my_index)
print(my_series)
a    1
b    2
c    3
d    4
e    5
dtype: int64
print(my_series.__class__)
<class 'pandas.core.series.Series'>

Membuat pd.Series dengan dictionary

Kita juga dapat membuat pd.Series dengan dictionary

# creating a series from a dictionary
my_dict= {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
my_series_dict= pd.Series(my_dict)
print(my_series_dict)
a    1
b    2
c    3
d    4
e    5
dtype: int64
print(my_series_dict.__class__)
<class 'pandas.core.series.Series'>

Operasi pada Series

# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
q1 = {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q2 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}


# Creating a Series from a Dictionary q1 and q2
q1_series= pd.Series(q1)
q2_series= pd.Series(q2)
print(q1_series)
Japan     80
China    450
India    200
USA      250
dtype: int64

Kita dapat mengindeks dengan label

# call values of q1_series based on named index
print(q1_series['Japan'])
print(q1_series['China'])
print(q1_series['India'])
80
450
200

kita dapat tetap dapat mengindeks dengan integer

# u can also call values of q1_series based on positional index
print(q1_series[0])
print(q1_series[1])
print(q1_series[2])
80
450
200

hati-hati dalam melakukan indexing dengan label. bisa saja terjadi error jika label tidak ada di dalam pd.series

# remember named index is case sensitive
try:
    print(q1_series['japan'])
except:
    print('something went wrong')
something went wrong

Operasi aritmatik sederhana pada pd.Series bersifat broadcasting, yaitu diterapkan ke masing-masing elemen

# operations with arithmetic on series are broadcasted to all values
print(q1_series*2)
Japan    160
China    900
India    400
USA      500
dtype: int64
print(q1_series+1000)
Japan    1080
China    1450
India    1200
USA      1250
dtype: int64

Untuk penjumlahan antara dua pd.Series, apabila ada label yang hanya muncul di salah satu series, maka label tersebut akan muncul di hasil jumlah dengan data NaN (not a number, di sini artinya tidak ada data).

(Kebetulan, keterangan NaN hanya bisa muncul untuk tipe data float atau koma-komaan, sehingga tipe data terpaksa diubah menjadi float.)

# operation between series are also broadcasted
print(q1_series+q2_series)
Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64

Mengapa tidak nol saja? Ketiadaan label pada salah satu series dianggap sebagai ketidaktahuan data untuk label tersebut, bukan dianggap nol.

Apabila diinginkan agar data yang tiada dianggap nol terlebih dahulu baru dijumlahkan, bisa seperti berikut:

print(q1_series.add(q2_series, fill_value=0))
Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

data frame

Sebuah pd.DataFrame terdiri dari beberapa pd.Series yang berbagi nilai indeks.

Misalkan kita punya data seperti berikut.

my_data = np.array([
    [25, 59, 18],
    [75, 54, 65],
    [29, 21,  7],
    [32, 68, 16]
])
my_data
array([[25, 59, 18],
       [75, 54, 65],
       [29, 21,  7],
       [32, 68, 16]])

Kita akan membuat pd.Dataframe melalui python list. Perhatikan bahwa kita dapat memberikan nama pada kolom dan baris

my_index= ["Toko A", "Toko B", "Toko C", "Toko D"]
my_columns= ["Apel", "Jeruk", "Pisang"]

df= pd.DataFrame(data=my_data, index=my_index, columns=my_columns)
df
Apel Jeruk Pisang
Toko A 25 59 18
Toko B 75 54 65
Toko C 29 21 7
Toko D 32 68 16
df_2 = pd.DataFrame(data=my_data)
df_2
0 1 2
0 25 59 18
1 75 54 65
2 29 21 7
3 32 68 16
df_3 = pd.DataFrame(data=my_data, columns=my_columns)
df_3
Apel Jeruk Pisang
0 25 59 18
1 75 54 65
2 29 21 7
3 32 68 16

membaca file csv sebagai pd.DataFrame

Jika berkas .py atau .ipynb Anda berada di lokasi folder yang sama persis dengan berkas .csv yang ingin Anda baca, cukup berikan nama berkas sebagai string, misalnya:

df = pd.read_csv('some_file.csv')

Berikan s berkas jika Anda berada di direktori yang berbeda. Jalur berkas harus 100% benar agar ini berfungsi. Misalnya:

df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")

sebelum itu, kalian dapat mendownload dataset “Waiter’s Tips Dataset” melalui salah satu link berikut:

df_tips = pd.read_csv('./tips.csv')
df_tips
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
... ... ... ... ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17
243 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139 Thur672

244 rows × 11 columns

Operasi sederhana pada DataFrame

# mengecek nama kolom
df_tips.columns
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')
# mengecek 
df_tips.index
RangeIndex(start=0, stop=244, step=1)
df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
df_tips.head(10)
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
5 25.29 4.71 Male No Sun Dinner 4 6.32 Erik Smith 213140353657882 Sun9679
6 8.77 2.00 Male No Sun Dinner 2 4.38 Kristopher Johnson 2223727524230344 Sun5985
7 26.88 3.12 Male No Sun Dinner 4 6.72 Robert Buck 3514785077705092 Sun8157
8 15.04 1.96 Male No Sun Dinner 2 7.52 Joseph Mcdonald 3522866365840377 Sun6820
9 14.78 3.23 Male No Sun Dinner 2 7.39 Jerome Abbott 3532124519049786 Sun3775
df_tips.tail()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17
243 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139 Thur672
df_tips.tail(10)
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
234 15.53 3.00 Male Yes Sat Dinner 2 7.76 Tracy Douglas 4097938155941930 Sat7220
235 10.07 1.25 Male No Sat Dinner 2 5.04 Sean Gonzalez 3534021246117605 Sat4615
236 12.60 1.00 Male Yes Sat Dinner 2 6.30 Matthew Myers 3543676378973965 Sat5032
237 32.83 1.17 Male Yes Sat Dinner 2 16.42 Thomas Brown 4284722681265508 Sat2929
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17
243 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139 Thur672
df_tips.describe()
total_bill tip size price_per_person CC Number
count 244.000000 244.000000 244.000000 244.000000 2.440000e+02
mean 19.785943 2.998279 2.569672 7.888197 2.563496e+15
std 8.902412 1.383638 0.951100 2.914234 2.369340e+15
min 3.070000 1.000000 1.000000 2.880000 6.040679e+10
25% 13.347500 2.000000 2.000000 5.800000 3.040731e+13
50% 17.795000 2.900000 2.000000 7.255000 3.525318e+15
75% 24.127500 3.562500 3.000000 9.390000 4.553675e+15
max 50.810000 10.000000 6.000000 20.270000 6.596454e+15
df_tips.describe().transpose()
count mean std min 25% 50% 75% max
total_bill 244.0 1.978594e+01 8.902412e+00 3.070000e+00 1.334750e+01 1.779500e+01 2.412750e+01 5.081000e+01
tip 244.0 2.998279e+00 1.383638e+00 1.000000e+00 2.000000e+00 2.900000e+00 3.562500e+00 1.000000e+01
size 244.0 2.569672e+00 9.510998e-01 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 6.000000e+00
price_per_person 244.0 7.888197e+00 2.914234e+00 2.880000e+00 5.800000e+00 7.255000e+00 9.390000e+00 2.027000e+01
CC Number 244.0 2.563496e+15 2.369340e+15 6.040679e+10 3.040731e+13 3.525318e+15 4.553675e+15 6.596454e+15

Transformasi data (row-wise)

filtering

df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
print(df_tips["size"] == 3)
0      False
1       True
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Name: size, Length: 244, dtype: bool
conditional_size = (df_tips["size"] == 3)
df_tips[conditional_size]
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
16 10.33 1.67 Female No Sun Dinner 3 3.44 Elizabeth Foster 4240025044626033 Sun9715
17 16.29 3.71 Male No Sun Dinner 3 5.43 John Pittman 6521340257218708 Sun2998
18 16.97 3.50 Female No Sun Dinner 3 5.66 Laura Martinez 30422275171379 Sun2789
19 20.65 3.35 Male No Sat Dinner 3 6.88 Timothy Oneal 6568069240986485 Sat9213
35 24.06 3.60 Male No Sat Dinner 3 8.02 Joseph Mullins 5519770449260299 Sat632
36 16.31 2.00 Male No Sat Dinner 3 5.44 William Ford 3527691170179398 Sat9139
37 16.93 3.07 Female No Sat Dinner 3 5.64 Erin Lewis 5161695527390786 Sat6406
38 18.69 2.31 Male No Sat Dinner 3 6.23 Brandon Bradley 4427601595688633 Sat4056
39 31.27 5.00 Male No Sat Dinner 3 10.42 Mr. Brandon Berry 6011525851069856 Sat6373
40 16.04 2.24 Male No Sat Dinner 3 5.35 Adam Edwards 3544447755679420 Sat8549
48 28.55 2.05 Male No Sun Dinner 3 9.52 Austin Fisher 6011481668986587 Sun4142
64 17.59 2.64 Male No Sat Dinner 3 5.86 Michael Johnson 2222114458088108 Sat1667
65 20.08 3.15 Male No Sat Dinner 3 6.69 Justin Dixon 180021262464926 Sat6840
71 17.07 3.00 Female No Sat Dinner 3 5.69 Teresa Fisher 5442222963796367 Sat3469
102 44.30 2.50 Female Yes Sat Dinner 3 14.77 Heather Cohen 379771118886604 Sat6240
112 38.07 4.00 Male No Sun Dinner 3 12.69 Jeff Lopez 3572865915176463 Sun591
114 25.71 4.00 Female No Sun Dinner 3 8.57 Katie Smith 5400160161311292 Sun6492
129 22.82 2.18 Male No Thur Lunch 3 7.61 Raymond Torres 4855776744024 Thur9424
146 18.64 1.36 Female No Thur Lunch 3 6.21 Kelly Estrada 60463302327 Thur3941
152 17.26 2.74 Male No Sun Dinner 3 5.75 Gregory Smith 4292362333741 Sun5205
162 16.21 2.00 Female No Sun Dinner 3 5.40 Jennifer Baird 4227834176859693 Sun5521
165 24.52 3.48 Male No Sun Dinner 3 8.17 Jacob Hansen 4031116007387 Sun9043
170 50.81 10.00 Male Yes Sat Dinner 3 16.94 Gregory Clark 5473850968388236 Sat1954
182 45.35 3.50 Male Yes Sun Dinner 3 15.12 Jose Parsons 4112207559459910 Sun2337
186 20.90 3.50 Female Yes Sun Dinner 3 6.97 Heidi Atkinson 4422858423131187 Sun4254
188 18.15 3.50 Female Yes Sun Dinner 3 6.05 Glenda Wiggins 578329325307 Sun430
189 23.10 4.00 Male Yes Sun Dinner 3 7.70 Richard Stevens 3560193117506187 Sun1821
200 18.71 4.00 Male Yes Thur Lunch 3 6.24 Jason Conrad 4581233003487 Thur6048
205 16.47 3.23 Female Yes Thur Lunch 3 5.49 Carly Reyes 4787787236486 Thur8084
206 26.59 3.41 Male Yes Sat Dinner 3 8.86 Daniel Owens 38971087967574 Sat1
210 30.06 2.00 Male Yes Sat Dinner 3 10.02 Shawn Mendoza 30184049218122 Sat8361
214 28.17 6.50 Female Yes Sat Dinner 3 9.39 Marissa Jackson 4922302538691962 Sat3374
223 15.98 3.00 Female No Fri Lunch 3 5.33 Mary Rivera 5343428579353069 Fri6014
231 15.69 3.00 Male Yes Sat Dinner 3 5.23 Jason Parks 4812333796161 Sat6334
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
conditional = (df_tips["size"] == 3) & (df_tips["total_bill"] > 20)
print(conditional)
0      False
1      False
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Length: 244, dtype: bool
df_tips[conditional]
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
19 20.65 3.35 Male No Sat Dinner 3 6.88 Timothy Oneal 6568069240986485 Sat9213
35 24.06 3.60 Male No Sat Dinner 3 8.02 Joseph Mullins 5519770449260299 Sat632
39 31.27 5.00 Male No Sat Dinner 3 10.42 Mr. Brandon Berry 6011525851069856 Sat6373
48 28.55 2.05 Male No Sun Dinner 3 9.52 Austin Fisher 6011481668986587 Sun4142
65 20.08 3.15 Male No Sat Dinner 3 6.69 Justin Dixon 180021262464926 Sat6840
102 44.30 2.50 Female Yes Sat Dinner 3 14.77 Heather Cohen 379771118886604 Sat6240
112 38.07 4.00 Male No Sun Dinner 3 12.69 Jeff Lopez 3572865915176463 Sun591
114 25.71 4.00 Female No Sun Dinner 3 8.57 Katie Smith 5400160161311292 Sun6492
129 22.82 2.18 Male No Thur Lunch 3 7.61 Raymond Torres 4855776744024 Thur9424
165 24.52 3.48 Male No Sun Dinner 3 8.17 Jacob Hansen 4031116007387 Sun9043
170 50.81 10.00 Male Yes Sat Dinner 3 16.94 Gregory Clark 5473850968388236 Sat1954
182 45.35 3.50 Male Yes Sun Dinner 3 15.12 Jose Parsons 4112207559459910 Sun2337
186 20.90 3.50 Female Yes Sun Dinner 3 6.97 Heidi Atkinson 4422858423131187 Sun4254
189 23.10 4.00 Male Yes Sun Dinner 3 7.70 Richard Stevens 3560193117506187 Sun1821
206 26.59 3.41 Male Yes Sat Dinner 3 8.86 Daniel Owens 38971087967574 Sat1
210 30.06 2.00 Male Yes Sat Dinner 3 10.02 Shawn Mendoza 30184049218122 Sat8361
214 28.17 6.50 Female Yes Sat Dinner 3 9.39 Marissa Jackson 4922302538691962 Sat3374
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
df_tips[(df_tips["size"] == 3) & (df_tips["total_bill"] > 20)]
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
19 20.65 3.35 Male No Sat Dinner 3 6.88 Timothy Oneal 6568069240986485 Sat9213
35 24.06 3.60 Male No Sat Dinner 3 8.02 Joseph Mullins 5519770449260299 Sat632
39 31.27 5.00 Male No Sat Dinner 3 10.42 Mr. Brandon Berry 6011525851069856 Sat6373
48 28.55 2.05 Male No Sun Dinner 3 9.52 Austin Fisher 6011481668986587 Sun4142
65 20.08 3.15 Male No Sat Dinner 3 6.69 Justin Dixon 180021262464926 Sat6840
102 44.30 2.50 Female Yes Sat Dinner 3 14.77 Heather Cohen 379771118886604 Sat6240
112 38.07 4.00 Male No Sun Dinner 3 12.69 Jeff Lopez 3572865915176463 Sun591
114 25.71 4.00 Female No Sun Dinner 3 8.57 Katie Smith 5400160161311292 Sun6492
129 22.82 2.18 Male No Thur Lunch 3 7.61 Raymond Torres 4855776744024 Thur9424
165 24.52 3.48 Male No Sun Dinner 3 8.17 Jacob Hansen 4031116007387 Sun9043
170 50.81 10.00 Male Yes Sat Dinner 3 16.94 Gregory Clark 5473850968388236 Sat1954
182 45.35 3.50 Male Yes Sun Dinner 3 15.12 Jose Parsons 4112207559459910 Sun2337
186 20.90 3.50 Female Yes Sun Dinner 3 6.97 Heidi Atkinson 4422858423131187 Sun4254
189 23.10 4.00 Male Yes Sun Dinner 3 7.70 Richard Stevens 3560193117506187 Sun1821
206 26.59 3.41 Male Yes Sat Dinner 3 8.86 Daniel Owens 38971087967574 Sat1
210 30.06 2.00 Male Yes Sat Dinner 3 10.02 Shawn Mendoza 30184049218122 Sat8361
214 28.17 6.50 Female Yes Sat Dinner 3 9.39 Marissa Jackson 4922302538691962 Sat3374
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
conditional_or = (df_tips["tip"] > 4) | (df_tips["total_bill"] > 20)
df_tips[conditional_or]
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
5 25.29 4.71 Male No Sun Dinner 4 6.32 Erik Smith 213140353657882 Sun9679
7 26.88 3.12 Male No Sun Dinner 4 6.72 Robert Buck 3514785077705092 Sun8157
... ... ... ... ... ... ... ... ... ... ... ...
237 32.83 1.17 Male Yes Sat Dinner 2 16.42 Thomas Brown 4284722681265508 Sat2929
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880

101 rows × 11 columns

weekend = ["Sun", "Sat"]
conditional_in = df_tips["day"].isin(weekend)
df_tips[conditional_in]
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
... ... ... ... ... ... ... ... ... ... ... ...
238 35.83 4.67 Female No Sat Dinner 3 11.94 Kimberly Crane 676184013727 Sat9777
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17

163 rows × 11 columns

df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251

mencari nilai unik

df_tips["day"].unique()
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
df_tips[["day","time"]]
day time
0 Sun Dinner
1 Sun Dinner
2 Sun Dinner
3 Sun Dinner
4 Sun Dinner
... ... ...
239 Sat Dinner
240 Sat Dinner
241 Sat Dinner
242 Sat Dinner
243 Thur Dinner

244 rows × 2 columns

df_tips.drop_duplicates(["day","time"])[["day","time"]]
day time
0 Sun Dinner
19 Sat Dinner
77 Thur Lunch
90 Fri Dinner
220 Fri Lunch
243 Thur Dinner

Transforming Data (Column Wise)

Selecting Columns

print(df_tips["day"])
0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object
print(df_tips.day)
0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object
df_tips[["day","time"]]
day time
0 Sun Dinner
1 Sun Dinner
2 Sun Dinner
3 Sun Dinner
4 Sun Dinner
... ... ...
239 Sat Dinner
240 Sat Dinner
241 Sat Dinner
242 Sat Dinner
243 Thur Dinner

244 rows × 2 columns

Mutating (create new column)

df_tips["tips_percentage"]= df_tips["tip"]/df_tips["total_bill"]*100

df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID tips_percentage
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959 5.944673
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608 16.054159
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458 16.658734
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260 13.978041
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251 14.680765

renaming column

df_tips.rename(columns={"tips_percentage": "tips_%"}, inplace=True)
df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID tips_%
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959 5.944673
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608 16.054159
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458 16.658734
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260 13.978041
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251 14.680765

relocate columns

#relocate tips_percentage_% column to the rightmost
cols = list(df_tips.columns)
cols = [cols[-1]]+ cols[:-2]

df_tips = df_tips[cols]
df_tips
tips_% total_bill tip sex smoker day time size price_per_person Payer Name CC Number
0 5.944673 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410
1 16.054159 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230
2 16.658734 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322
3 13.978041 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994
4 14.680765 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221
... ... ... ... ... ... ... ... ... ... ... ...
239 20.392697 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842
240 7.358352 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404
241 8.822232 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196
242 9.820426 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950
243 15.974441 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139

244 rows × 11 columns

Export DataFrame ke CSV

df_tips.to_csv("tips_modified.csv")