DEVSTORY #15: RDBMS data cleansing: Using pandas/modin instead of SQL

FTR Analysts source a significant amount of nodal data from the ISOs across the US.

A large FTR market participant used automated scrapes to assemble data collected at 15 min, hourly and daily intervals across the following ISOs NEISO, NYISO, PJM, ERCOT, MISO and CAISO.

The extraction process was not very efficient as zip files were downloaded from the various ISO. These zip files were then opened and SQL stored procedures would act on the underlying data to

expand the compressed data
extract the information,
assemble the information into a dataset
transform the data extracted
and then store the data in a wide format

to one or more Oracle database table.

The Problem

FTR Analysts would then extract this tabular data in to spreadsheets where opportunities to profit from congestion would be identified.

During the ingestion process using SQL, multiple rows for each node and date would result due to the inefficient ETL processes resulted in bad data. The FTR analysts would then have to clean data manually in their spreadsheets so as to prepare the data for analysis. Errors were a daily occurrence and had a direct impact on the effectiveness of the FTR analyst to identify trading opportunities.

The tabular data was in a wide format in the following format

where each node and date combination was unique. Each numeral corresponds to the hour ending of the day. 25 hours to cater for DST wherever observed.

The problem with the dataset was that there would be several rows with the same node identifier + date combination which broke the FTR analysis model. These errors were persistent in the database for over two years. Repeated efforts to clean up the data with SQL lead to more issues being introduced due to the complexity of the SQL procedures required for data cleansing.

A typical scenario would be three rows for one node id + date combination when only one row was required

<Create date><Node identifier > <Observation Date > <1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18><19><20><21><22><23><24><25>
12/2/2020 , QWD12331  , 12/1/2020 , 1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
12/2/2020 ,QWD12331  , 12/1/2020 , ,,,,,,,,9,10,11,12,13,14,15,16,17,18,19,20,,,,,
12/2/2020 , QWD12331  , 12/1/2020 , ,,,,,,,,,,,,,,,,,,,,21,22,23,24,25,

The model expected one row for each node id + date combination. The expected cleansed row should be as follows.

<Create date><Node identifier > <Observation Date > <1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18><19><20><21><22><23><24><25>
12/2/2020 , QWD12331  , 12/1/2020 , 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25

Every analyst spent inordinate amounts of time cleaning this data on a daily basis as it occured as the source data was bad.

ENTER PANDAS !

Pandas ( a python data library ) has the ability to read in large amounts of data in a table into a pandas dataframe.

The Solution

To clean this dataset, the data was first read into a dataframe as is. The data is in wide format.

The data frame was then converted from wide to long format allowing for duplicates. The long format data frame was keyed on the create date, node identifier, observation date and hour ending ( which was a column name ) allowing for duplicates.

The long format data frame was then sorted on date fields <Create date> <Observation Date ><hour ending> for chunks of data filtered by observation date ranges.

This long format data frame contained duplicates where only the most recent observation was relevant. So the frame was grouped by <node identifier> <observation date> and <hour ending>. The max create date record for each group was extracted into a separate long data frame removing duplicates.

This long format data frame was then converted to a wide format again and stored to a temp table. Once completed for all chunks the source table was renamed and the temp table replaced the old table.

The Result

An enormous amount analysis time was saved when the data was cleansed. Immediate increase in throughput was realized with analysts spending more time on analysis and less time on data cleansing. Since the window of time to put bids in is limited, increase in time afforded multiple network simulation and analysis passes to identify opportunities.

Clean data makes data scientists more efficient. Using simple tools data wrangling tools like Pandas/Modin, DASK and Vaex instead of traditional SQL procedures can result in cleaner data and better analytics results.