Pertemuan 5 : Time Series Visualization

Handling time series data

Published

May 7, 2024

Kembali ke EDA

`datetime` object

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

df = pd.read_csv('https://raw.githubusercontent.com/farhanage/dataset-for-study/main/Electric_Production.csv', names=['date', 'electric_production'], header=0)

df.head(15)

	date	electric_production
0	1/1/1985	72.5052
1	2/1/1985	70.6720
2	3/1/1985	62.4502
3	4/1/1985	57.4714
4	5/1/1985	55.3151
5	6/1/1985	58.0904
6	7/1/1985	62.6202
7	8/1/1985	63.2485
8	9/1/1985	60.5846
9	10/1/1985	56.3154
10	11/1/1985	58.0005
11	12/1/1985	68.7145
12	1/1/1986	73.3057
13	2/1/1986	67.9869
14	3/1/1986	62.2221

# Mengecek tipe data tiap kolom
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 397 non-null    object 
 1   electric_production  397 non-null    float64
dtypes: float64(1), object(1)
memory usage: 6.3+ KB

Perhatikan bahwa kolom date memiliki tipe data object (string). Pada pertemuan ini, kita akan membahas suatu tipe data baru yang bernama datetime yang digunakan untuk memanipulasi data runtun waktu (time series).

Converting `object` to `datetime`

Perhatikan code cell berikut untuk mengubah data bertipe object menjadi datetime

# Mengubah data `object` -> `datetime`
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

Date Format

Argumen format='%m/%d/%Y' digunakan untuk membaca format penanggalan yang tertulis pada kolom yang ingin kita ubah. %d menandakan hari, %m untuk bulan, %y untuk tahun 2 digit dan %Y untuk tahun 4 digit.

e.g. : '2024-05-26' -> '%Y-%m-%d'

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 397 non-null    datetime64[ns]
 1   electric_production  397 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.3 KB

Terlihat bahwa kolom date kini memiliki tipe data datetime64[ns].

df.head()

	date	electric_production
0	1985-01-01	72.5052
1	1985-02-01	70.6720
2	1985-03-01	62.4502
3	1985-04-01	57.4714
4	1985-05-01	55.3151

Terlihat setelah dilakukan perubahan tipe data, format penanggalan pada kolom date juga berubah.

Apa perbedaan `string` dan `datetime`?

Object datetime memiliki attribut tanggal, bulan, waktu, hari, jam, menit, bahkan detik suatu observasi yang dapat kita akses melalui library datetime.

# Ambil tahun dari setiap observasi
df['date'].dt.year

0      1985
1      1985
2      1985
3      1985
4      1985
       ... 
392    2017
393    2017
394    2017
395    2017
396    2018
Name: date, Length: 397, dtype: int32

# Ambil Bulan dari setiap observasi
df['date'].dt.month

0       1
1       2
2       3
3       4
4       5
       ..
392     9
393    10
394    11
395    12
396     1
Name: date, Length: 397, dtype: int32

# Ambil Tanggal dari setiap observasi
df['date'].dt.day

0      1
1      1
2      1
3      1
4      1
      ..
392    1
393    1
394    1
395    1
396    1
Name: date, Length: 397, dtype: int32

object datetime dapat digunakan sebagai nilai numerik dalam visualisasi data

sns.lineplot(x='date', y='electric_production', data=df)

plt.title('Jumlah produksi listrik per tahun (1985-2018)')

plt.show()

Grouping

Bagaimana cara kita memanfaatkan attribut-attribut datetime? Gunakan grouping method (groupby) untuk mendapatkan insight baru dari data per satuan waktu tertentu.

Informasi jumlah total produksi energi per tahunnya

df['electric_production'].groupby(df['date'].dt.year).sum() # sum of electric_production, grouped by year

date
1985     745.9880
1986     752.5187
1987     788.8833
1988     836.5963
1989     862.7420
1990     879.7612
1991     901.3422
1992     901.4509
1993     932.1479
1994     951.0607
1995     984.7304
1996    1012.5890
1997    1010.8343
1998    1038.5289
1999    1070.2907
2000    1101.4815
2001    1097.5295
2002    1129.2849
2003    1148.4013
2004    1165.4307
2005    1190.4989
2006    1186.4787
2007    1223.8100
2008    1219.9679
2009    1185.9574
2010    1231.1626
2011    1227.4857
2012    1199.9999
2013    1228.0482
2014    1244.7934
2015    1236.6349
2016    1233.1725
2017    1223.3381
2018     129.4048
Name: electric_production, dtype: float64

Informasi rata-rata produksi energi setiap bulan selama 33 tahun terakhir

df['electric_production'].groupby(df['date'].dt.month).mean() # mean of electric_production, grouped by month

date
1     102.100241
2      94.095521
3      87.902691
4      79.134518
5      79.414367
6      87.415906
7      94.944915
8      95.169848
9      86.484391
10     80.209121
11     83.135879
12     95.757606
Name: electric_production, dtype: float64

Plots

sum_per_year = df['electric_production'].groupby(df['date'].dt.year).sum() # sum of electric_production, grouped by year

sns.lineplot(sum_per_year)

plt.title('Jumlah produksi listrik per tahun (1985-2018)')

plt.show()

kenapa tahun 2018 turun drastis? cek semua observasi pada tahun 2018

# Ambil data dengan tahun == 2018
df[df['date'].dt.year == 2018]

	date	electric_production
396	2018-01-01	129.4048

Ternyata tahun 2018 hanya memiliki 1 observasi, sehingga tidak dapat dibandingkan dengan jumlah produksi tahun-tahun sebelumnya.

# buang data tahun 2018
df.drop(df[df['date'].dt.year == 2018].index, axis=0, inplace=True)

df[df['date'].dt.year == 2018]

	date	electric_production

Data sudah berhasil dibuang, sehingga dapat dilakukan visualisasi yang lebih akurat

sum_per_year = df['electric_production'].groupby(df['date'].dt.year).sum() # sum of electric_production, grouped by year

sns.lineplot(sum_per_year)

plt.xlabel('year')

plt.ylabel('electric production')

plt.title('Jumlah produksi listrik per tahun (1985-2017)')

plt.show()

Interpretasinya?

avg_per_month = df['electric_production'].groupby(df['date'].dt.month).mean() # mean of electric_production, grouped by month

sns.barplot(avg_per_month)

plt.xlabel('month')

plt.ylabel('average electric production')

plt.title('Rata-rata produksi listrik per bulan (1985-2017)')

plt.show()

Interpretasi? Kaitkan dengan musim?

Case Study

Download dataset di sini : Energy Time Series

Metadata :

energy_reading.csv

building_id - Foreign key for the building metadata.
meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
timestamp - When the measurement was taken
meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error. UPDATE: as discussed here, the site 0 electric meter readings are in kBTU.

building_metadata.csv

site_id - Foreign key for the weather files.
building_id - Foreign key for training.csv
primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
square_feet - Gross floor area of the building
year_built - Year building was opened
floor_count - Number of floors of the building

weather.csv Weather data from a meteorological station as close as possible to the site.

site_id
air_temperature - Degrees Celsius
cloud_coverage - Portion of the sky covered in clouds, in oktas
dew_temperature - Degrees Celsius
precip_depth_1_hr - Millimeters
sea_level_pressure - Millibar/hectopascals
wind_direction - Compass direction (0-360)
wind_speed - Meters per second

Challenge : Gali informasi sebanyak-banyaknya dari data tersebut

datetime object

Converting object to datetime

Apa perbedaan string dan datetime?