import numpy as np
import pandas as pd
Modul 1 Sains Data: Pengenalan Pandas, Transformasi Data
Modul 1 Sains Data: Pengenalan Pandas, Transformasi Data
Kembali ke Sains Data
Prerequisites
Pada module ini kita akan coba mememahami package pandas, yang merupakan package inti dalam sains-data. kita akan coba melakukan beberapa transformasi data menggunakan pandas.
sebelum itu, python module
di bawah ini yang akan digunakan selama praktikum.
Apabila ada yang belum terinstal, silakan instal terlebih dahulu menggunakan pip
:
!pip install numpy !pip install pandas
atau conda
jika sedang menggunakan Anaconda:
conda install numpy conda install pandas
Series
pandas.Series
sangat mirip dengan array NumPy (bahkan dibangun di atas objek array NumPy). Yang membedakan array NumPy dari sebuah Series adalah bahwa sebuah Series dapat memiliki label index, yang berarti dapat diindeks dengan label, bukan hanya lokasi nomor saja. Selain itu, sebuah Series tidak perlu menyimpan data numerik, ia dapat menyimpan objek Python sembarang.
Membuat pd.Series dengan list
Paling mudah, ktia dapat membuat pd.Series
dengan python list
= ['a','b','c','d','e']
my_index= [1,2,3,4,5]
my_data= pd.Series(data=my_data, index=my_index) my_series
print(my_series)
a 1
b 2
c 3
d 4
e 5
dtype: int64
print(my_series.__class__)
<class 'pandas.core.series.Series'>
Membuat pd.Series dengan dictionary
Kita juga dapat membuat pd.Series
dengan dictionary
# creating a series from a dictionary
= {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
my_dict= pd.Series(my_dict) my_series_dict
print(my_series_dict)
a 1
b 2
c 3
d 4
e 5
dtype: int64
print(my_series_dict.__class__)
<class 'pandas.core.series.Series'>
Operasi pada Series
# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
= {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q1 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}
q2
# Creating a Series from a Dictionary q1 and q2
= pd.Series(q1)
q1_series= pd.Series(q2) q2_series
print(q1_series)
Japan 80
China 450
India 200
USA 250
dtype: int64
Kita dapat mengindeks dengan label
# call values of q1_series based on named index
print(q1_series['Japan'])
print(q1_series['China'])
print(q1_series['India'])
80
450
200
kita dapat tetap dapat mengindeks dengan integer
# u can also call values of q1_series based on positional index
print(q1_series[0])
print(q1_series[1])
print(q1_series[2])
80
450
200
hati-hati dalam melakukan indexing dengan label. bisa saja terjadi error jika label tidak ada di dalam pd.series
# remember named index is case sensitive
try:
print(q1_series['japan'])
except:
print('something went wrong')
something went wrong
Operasi aritmatik sederhana pada pd.Series
bersifat broadcasting, yaitu diterapkan ke masing-masing elemen
# operations with arithmetic on series are broadcasted to all values
print(q1_series*2)
Japan 160
China 900
India 400
USA 500
dtype: int64
print(q1_series+1000)
Japan 1080
China 1450
India 1200
USA 1250
dtype: int64
Untuk penjumlahan antara dua pd.Series
, apabila ada label yang hanya muncul di salah satu series, maka label tersebut akan muncul di hasil jumlah dengan data NaN (not a number, di sini artinya tidak ada data).
(Kebetulan, keterangan NaN hanya bisa muncul untuk tipe data float
atau koma-komaan, sehingga tipe data terpaksa diubah menjadi float
.)
# operation between series are also broadcasted
print(q1_series+q2_series)
Brazil NaN
China 950.0
India 410.0
Japan NaN
USA 510.0
dtype: float64
Mengapa tidak nol saja? Ketiadaan label pada salah satu series dianggap sebagai ketidaktahuan data untuk label tersebut, bukan dianggap nol.
Apabila diinginkan agar data yang tiada dianggap nol terlebih dahulu baru dijumlahkan, bisa seperti berikut:
print(q1_series.add(q2_series, fill_value=0))
Brazil 100.0
China 950.0
India 410.0
Japan 80.0
USA 510.0
dtype: float64
data frame
Sebuah pd.DataFrame
terdiri dari beberapa pd.Series
yang berbagi nilai indeks.
Misalkan kita punya data seperti berikut.
= np.array([
my_data 25, 59, 18],
[75, 54, 65],
[29, 21, 7],
[32, 68, 16]
[ ])
my_data
array([[25, 59, 18],
[75, 54, 65],
[29, 21, 7],
[32, 68, 16]])
Kita akan membuat pd.Dataframe
melalui python list
. Perhatikan bahwa kita dapat memberikan nama pada kolom dan baris
= ["Toko A", "Toko B", "Toko C", "Toko D"]
my_index= ["Apel", "Jeruk", "Pisang"]
my_columns
= pd.DataFrame(data=my_data, index=my_index, columns=my_columns) df
df
Apel | Jeruk | Pisang | |
---|---|---|---|
Toko A | 25 | 59 | 18 |
Toko B | 75 | 54 | 65 |
Toko C | 29 | 21 | 7 |
Toko D | 32 | 68 | 16 |
= pd.DataFrame(data=my_data)
df_2 df_2
0 | 1 | 2 | |
---|---|---|---|
0 | 25 | 59 | 18 |
1 | 75 | 54 | 65 |
2 | 29 | 21 | 7 |
3 | 32 | 68 | 16 |
= pd.DataFrame(data=my_data, columns=my_columns)
df_3 df_3
Apel | Jeruk | Pisang | |
---|---|---|---|
0 | 25 | 59 | 18 |
1 | 75 | 54 | 65 |
2 | 29 | 21 | 7 |
3 | 32 | 68 | 16 |
membaca file csv sebagai pd.DataFrame
Jika berkas .py atau .ipynb Anda berada di lokasi folder yang sama persis dengan berkas .csv yang ingin Anda baca, cukup berikan nama berkas sebagai string, misalnya:
df = pd.read_csv('some_file.csv')
Berikan s berkas jika Anda berada di direktori yang berbeda. Jalur berkas harus 100% benar agar ini berfungsi. Misalnya:
df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")
sebelum itu, kalian dapat mendownload dataset “Waiter’s Tips Dataset” melalui salah satu link berikut:
= pd.read_csv('./tips.csv') df_tips
df_tips
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 | Thur672 |
244 rows × 11 columns
Operasi sederhana pada DataFrame
# mengecek nama kolom
df_tips.columns
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
dtype='object')
# mengecek
df_tips.index
RangeIndex(start=0, stop=244, step=1)
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
10) df_tips.head(
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | 6.32 | Erik Smith | 213140353657882 | Sun9679 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | 4.38 | Kristopher Johnson | 2223727524230344 | Sun5985 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | 6.72 | Robert Buck | 3514785077705092 | Sun8157 |
8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 | 7.52 | Joseph Mcdonald | 3522866365840377 | Sun6820 |
9 | 14.78 | 3.23 | Male | No | Sun | Dinner | 2 | 7.39 | Jerome Abbott | 3532124519049786 | Sun3775 |
df_tips.tail()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 | Thur672 |
10) df_tips.tail(
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
234 | 15.53 | 3.00 | Male | Yes | Sat | Dinner | 2 | 7.76 | Tracy Douglas | 4097938155941930 | Sat7220 |
235 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 | 5.04 | Sean Gonzalez | 3534021246117605 | Sat4615 |
236 | 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 | 6.30 | Matthew Myers | 3543676378973965 | Sat5032 |
237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 | 16.42 | Thomas Brown | 4284722681265508 | Sat2929 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 | Thur672 |
df_tips.describe()
total_bill | tip | size | price_per_person | CC Number | |
---|---|---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 | 244.000000 | 2.440000e+02 |
mean | 19.785943 | 2.998279 | 2.569672 | 7.888197 | 2.563496e+15 |
std | 8.902412 | 1.383638 | 0.951100 | 2.914234 | 2.369340e+15 |
min | 3.070000 | 1.000000 | 1.000000 | 2.880000 | 6.040679e+10 |
25% | 13.347500 | 2.000000 | 2.000000 | 5.800000 | 3.040731e+13 |
50% | 17.795000 | 2.900000 | 2.000000 | 7.255000 | 3.525318e+15 |
75% | 24.127500 | 3.562500 | 3.000000 | 9.390000 | 4.553675e+15 |
max | 50.810000 | 10.000000 | 6.000000 | 20.270000 | 6.596454e+15 |
df_tips.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
total_bill | 244.0 | 1.978594e+01 | 8.902412e+00 | 3.070000e+00 | 1.334750e+01 | 1.779500e+01 | 2.412750e+01 | 5.081000e+01 |
tip | 244.0 | 2.998279e+00 | 1.383638e+00 | 1.000000e+00 | 2.000000e+00 | 2.900000e+00 | 3.562500e+00 | 1.000000e+01 |
size | 244.0 | 2.569672e+00 | 9.510998e-01 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
price_per_person | 244.0 | 7.888197e+00 | 2.914234e+00 | 2.880000e+00 | 5.800000e+00 | 7.255000e+00 | 9.390000e+00 | 2.027000e+01 |
CC Number | 244.0 | 2.563496e+15 | 2.369340e+15 | 6.040679e+10 | 3.040731e+13 | 3.525318e+15 | 4.553675e+15 | 6.596454e+15 |
Transformasi data (row-wise)
filtering
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
print(df_tips["size"] == 3)
0 False
1 True
2 True
3 False
4 False
...
239 True
240 False
241 False
242 False
243 False
Name: size, Length: 244, dtype: bool
= (df_tips["size"] == 3)
conditional_size df_tips[conditional_size]
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 | 3.44 | Elizabeth Foster | 4240025044626033 | Sun9715 |
17 | 16.29 | 3.71 | Male | No | Sun | Dinner | 3 | 5.43 | John Pittman | 6521340257218708 | Sun2998 |
18 | 16.97 | 3.50 | Female | No | Sun | Dinner | 3 | 5.66 | Laura Martinez | 30422275171379 | Sun2789 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 | 6.88 | Timothy Oneal | 6568069240986485 | Sat9213 |
35 | 24.06 | 3.60 | Male | No | Sat | Dinner | 3 | 8.02 | Joseph Mullins | 5519770449260299 | Sat632 |
36 | 16.31 | 2.00 | Male | No | Sat | Dinner | 3 | 5.44 | William Ford | 3527691170179398 | Sat9139 |
37 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 | 5.64 | Erin Lewis | 5161695527390786 | Sat6406 |
38 | 18.69 | 2.31 | Male | No | Sat | Dinner | 3 | 6.23 | Brandon Bradley | 4427601595688633 | Sat4056 |
39 | 31.27 | 5.00 | Male | No | Sat | Dinner | 3 | 10.42 | Mr. Brandon Berry | 6011525851069856 | Sat6373 |
40 | 16.04 | 2.24 | Male | No | Sat | Dinner | 3 | 5.35 | Adam Edwards | 3544447755679420 | Sat8549 |
48 | 28.55 | 2.05 | Male | No | Sun | Dinner | 3 | 9.52 | Austin Fisher | 6011481668986587 | Sun4142 |
64 | 17.59 | 2.64 | Male | No | Sat | Dinner | 3 | 5.86 | Michael Johnson | 2222114458088108 | Sat1667 |
65 | 20.08 | 3.15 | Male | No | Sat | Dinner | 3 | 6.69 | Justin Dixon | 180021262464926 | Sat6840 |
71 | 17.07 | 3.00 | Female | No | Sat | Dinner | 3 | 5.69 | Teresa Fisher | 5442222963796367 | Sat3469 |
102 | 44.30 | 2.50 | Female | Yes | Sat | Dinner | 3 | 14.77 | Heather Cohen | 379771118886604 | Sat6240 |
112 | 38.07 | 4.00 | Male | No | Sun | Dinner | 3 | 12.69 | Jeff Lopez | 3572865915176463 | Sun591 |
114 | 25.71 | 4.00 | Female | No | Sun | Dinner | 3 | 8.57 | Katie Smith | 5400160161311292 | Sun6492 |
129 | 22.82 | 2.18 | Male | No | Thur | Lunch | 3 | 7.61 | Raymond Torres | 4855776744024 | Thur9424 |
146 | 18.64 | 1.36 | Female | No | Thur | Lunch | 3 | 6.21 | Kelly Estrada | 60463302327 | Thur3941 |
152 | 17.26 | 2.74 | Male | No | Sun | Dinner | 3 | 5.75 | Gregory Smith | 4292362333741 | Sun5205 |
162 | 16.21 | 2.00 | Female | No | Sun | Dinner | 3 | 5.40 | Jennifer Baird | 4227834176859693 | Sun5521 |
165 | 24.52 | 3.48 | Male | No | Sun | Dinner | 3 | 8.17 | Jacob Hansen | 4031116007387 | Sun9043 |
170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 16.94 | Gregory Clark | 5473850968388236 | Sat1954 |
182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 15.12 | Jose Parsons | 4112207559459910 | Sun2337 |
186 | 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 | 6.97 | Heidi Atkinson | 4422858423131187 | Sun4254 |
188 | 18.15 | 3.50 | Female | Yes | Sun | Dinner | 3 | 6.05 | Glenda Wiggins | 578329325307 | Sun430 |
189 | 23.10 | 4.00 | Male | Yes | Sun | Dinner | 3 | 7.70 | Richard Stevens | 3560193117506187 | Sun1821 |
200 | 18.71 | 4.00 | Male | Yes | Thur | Lunch | 3 | 6.24 | Jason Conrad | 4581233003487 | Thur6048 |
205 | 16.47 | 3.23 | Female | Yes | Thur | Lunch | 3 | 5.49 | Carly Reyes | 4787787236486 | Thur8084 |
206 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 8.86 | Daniel Owens | 38971087967574 | Sat1 |
210 | 30.06 | 2.00 | Male | Yes | Sat | Dinner | 3 | 10.02 | Shawn Mendoza | 30184049218122 | Sat8361 |
214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 | 9.39 | Marissa Jackson | 4922302538691962 | Sat3374 |
223 | 15.98 | 3.00 | Female | No | Fri | Lunch | 3 | 5.33 | Mary Rivera | 5343428579353069 | Fri6014 |
231 | 15.69 | 3.00 | Male | Yes | Sat | Dinner | 3 | 5.23 | Jason Parks | 4812333796161 | Sat6334 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
= (df_tips["size"] == 3) & (df_tips["total_bill"] > 20)
conditional print(conditional)
0 False
1 False
2 True
3 False
4 False
...
239 True
240 False
241 False
242 False
243 False
Length: 244, dtype: bool
df_tips[conditional]
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 | 6.88 | Timothy Oneal | 6568069240986485 | Sat9213 |
35 | 24.06 | 3.60 | Male | No | Sat | Dinner | 3 | 8.02 | Joseph Mullins | 5519770449260299 | Sat632 |
39 | 31.27 | 5.00 | Male | No | Sat | Dinner | 3 | 10.42 | Mr. Brandon Berry | 6011525851069856 | Sat6373 |
48 | 28.55 | 2.05 | Male | No | Sun | Dinner | 3 | 9.52 | Austin Fisher | 6011481668986587 | Sun4142 |
65 | 20.08 | 3.15 | Male | No | Sat | Dinner | 3 | 6.69 | Justin Dixon | 180021262464926 | Sat6840 |
102 | 44.30 | 2.50 | Female | Yes | Sat | Dinner | 3 | 14.77 | Heather Cohen | 379771118886604 | Sat6240 |
112 | 38.07 | 4.00 | Male | No | Sun | Dinner | 3 | 12.69 | Jeff Lopez | 3572865915176463 | Sun591 |
114 | 25.71 | 4.00 | Female | No | Sun | Dinner | 3 | 8.57 | Katie Smith | 5400160161311292 | Sun6492 |
129 | 22.82 | 2.18 | Male | No | Thur | Lunch | 3 | 7.61 | Raymond Torres | 4855776744024 | Thur9424 |
165 | 24.52 | 3.48 | Male | No | Sun | Dinner | 3 | 8.17 | Jacob Hansen | 4031116007387 | Sun9043 |
170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 16.94 | Gregory Clark | 5473850968388236 | Sat1954 |
182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 15.12 | Jose Parsons | 4112207559459910 | Sun2337 |
186 | 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 | 6.97 | Heidi Atkinson | 4422858423131187 | Sun4254 |
189 | 23.10 | 4.00 | Male | Yes | Sun | Dinner | 3 | 7.70 | Richard Stevens | 3560193117506187 | Sun1821 |
206 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 8.86 | Daniel Owens | 38971087967574 | Sat1 |
210 | 30.06 | 2.00 | Male | Yes | Sat | Dinner | 3 | 10.02 | Shawn Mendoza | 30184049218122 | Sat8361 |
214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 | 9.39 | Marissa Jackson | 4922302538691962 | Sat3374 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
"size"] == 3) & (df_tips["total_bill"] > 20)] df_tips[(df_tips[
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 | 6.88 | Timothy Oneal | 6568069240986485 | Sat9213 |
35 | 24.06 | 3.60 | Male | No | Sat | Dinner | 3 | 8.02 | Joseph Mullins | 5519770449260299 | Sat632 |
39 | 31.27 | 5.00 | Male | No | Sat | Dinner | 3 | 10.42 | Mr. Brandon Berry | 6011525851069856 | Sat6373 |
48 | 28.55 | 2.05 | Male | No | Sun | Dinner | 3 | 9.52 | Austin Fisher | 6011481668986587 | Sun4142 |
65 | 20.08 | 3.15 | Male | No | Sat | Dinner | 3 | 6.69 | Justin Dixon | 180021262464926 | Sat6840 |
102 | 44.30 | 2.50 | Female | Yes | Sat | Dinner | 3 | 14.77 | Heather Cohen | 379771118886604 | Sat6240 |
112 | 38.07 | 4.00 | Male | No | Sun | Dinner | 3 | 12.69 | Jeff Lopez | 3572865915176463 | Sun591 |
114 | 25.71 | 4.00 | Female | No | Sun | Dinner | 3 | 8.57 | Katie Smith | 5400160161311292 | Sun6492 |
129 | 22.82 | 2.18 | Male | No | Thur | Lunch | 3 | 7.61 | Raymond Torres | 4855776744024 | Thur9424 |
165 | 24.52 | 3.48 | Male | No | Sun | Dinner | 3 | 8.17 | Jacob Hansen | 4031116007387 | Sun9043 |
170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 16.94 | Gregory Clark | 5473850968388236 | Sat1954 |
182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 15.12 | Jose Parsons | 4112207559459910 | Sun2337 |
186 | 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 | 6.97 | Heidi Atkinson | 4422858423131187 | Sun4254 |
189 | 23.10 | 4.00 | Male | Yes | Sun | Dinner | 3 | 7.70 | Richard Stevens | 3560193117506187 | Sun1821 |
206 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 8.86 | Daniel Owens | 38971087967574 | Sat1 |
210 | 30.06 | 2.00 | Male | Yes | Sat | Dinner | 3 | 10.02 | Shawn Mendoza | 30184049218122 | Sat8361 |
214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 | 9.39 | Marissa Jackson | 4922302538691962 | Sat3374 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
= (df_tips["tip"] > 4) | (df_tips["total_bill"] > 20)
conditional_or df_tips[conditional_or]
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | 6.32 | Erik Smith | 213140353657882 | Sun9679 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | 6.72 | Robert Buck | 3514785077705092 | Sun8157 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 | 16.42 | Thomas Brown | 4284722681265508 | Sat2929 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
101 rows × 11 columns
= ["Sun", "Sat"]
weekend = df_tips["day"].isin(weekend)
conditional_in df_tips[conditional_in]
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 11.94 | Kimberly Crane | 676184013727 | Sat9777 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
163 rows × 11 columns
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
mencari nilai unik
"day"].unique() df_tips[
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
"day","time"]] df_tips[[
day | time | |
---|---|---|
0 | Sun | Dinner |
1 | Sun | Dinner |
2 | Sun | Dinner |
3 | Sun | Dinner |
4 | Sun | Dinner |
... | ... | ... |
239 | Sat | Dinner |
240 | Sat | Dinner |
241 | Sat | Dinner |
242 | Sat | Dinner |
243 | Thur | Dinner |
244 rows × 2 columns
"day","time"])[["day","time"]] df_tips.drop_duplicates([
day | time | |
---|---|---|
0 | Sun | Dinner |
19 | Sat | Dinner |
77 | Thur | Lunch |
90 | Fri | Dinner |
220 | Fri | Lunch |
243 | Thur | Dinner |
Transforming Data (Column Wise)
Selecting Columns
print(df_tips["day"])
0 Sun
1 Sun
2 Sun
3 Sun
4 Sun
...
239 Sat
240 Sat
241 Sat
242 Sat
243 Thur
Name: day, Length: 244, dtype: object
print(df_tips.day)
0 Sun
1 Sun
2 Sun
3 Sun
4 Sun
...
239 Sat
240 Sat
241 Sat
242 Sat
243 Thur
Name: day, Length: 244, dtype: object
"day","time"]] df_tips[[
day | time | |
---|---|---|
0 | Sun | Dinner |
1 | Sun | Dinner |
2 | Sun | Dinner |
3 | Sun | Dinner |
4 | Sun | Dinner |
... | ... | ... |
239 | Sat | Dinner |
240 | Sat | Dinner |
241 | Sat | Dinner |
242 | Sat | Dinner |
243 | Thur | Dinner |
244 rows × 2 columns
Mutating (create new column)
"tips_percentage"]= df_tips["tip"]/df_tips["total_bill"]*100
df_tips[
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | tips_percentage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 | 5.944673 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 | 16.054159 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 | 16.658734 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 | 13.978041 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 | 14.680765 |
renaming column
={"tips_percentage": "tips_%"}, inplace=True)
df_tips.rename(columns df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | tips_% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 | 5.944673 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 | 16.054159 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 | 16.658734 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 | 13.978041 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 | 14.680765 |
relocate columns
#relocate tips_percentage_% column to the rightmost
= list(df_tips.columns)
cols = [cols[-1]]+ cols[:-2]
cols
= df_tips[cols] df_tips
df_tips
tips_% | total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.944673 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 |
1 | 16.054159 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 |
2 | 16.658734 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 |
3 | 13.978041 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 |
4 | 14.680765 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 20.392697 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 |
240 | 7.358352 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 |
241 | 8.822232 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 |
242 | 9.820426 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 |
243 | 15.974441 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 |
244 rows × 11 columns
Export DataFrame ke CSV
"tips_modified.csv") df_tips.to_csv(