Using Python To Transform Data From Multiple CSV Files
Have you tried to perform an exploratory analysis, but your data is fragmented across multiple CSV files and you're struggling to join them?
Hi, I’m Da Data Guy! Before I begin, if you are interested in Power BI, Python, Data visualization and SQL, please give me a follow on Medium and on my LinkedIn. I focus on writing quality articles breaking down each step in the process so you can follow along and learn too. At the bottom of this article will be my Github link and the embedded Python notebooks for you to use.
To follow along, I’m using data from Maven Analytics called “Pizza Place Sales”. This is part two of three where I provide a step-by-step tutorial of my Exploratory Data Analysis (EDA) using SQL, Python and Power BI. For this article, I’ll be using Python to create a new data frame that has the data correctly transformed and then exported to a CSV file.
To start, I import several libraries into my Jupyter Notebook, set the Pandas library to display the maximum number of columns, and change the percent format into the common (0.00%) format. I prefer to see all the columns rather than have Pandas condense it.
At the same time, I start the timer to time how long it takes to load my four CSV files. I always use this method because sometimes the CSV files are very large and it’s important to know how long it takes to import.
# importing libraries
import pandas as pd
import numpy as np
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
#pd.set_option('display.max_rows', None)
I then review the first two rows of each data frame.
od.head(2)
o.head(2)
pt.head(2)
p.head(2)
Normally, I would also check for any null values, but I cheated and looked at them in Excel prior. In Python, you would write:
df.isnull().sum()
Splitting Strings Into Its Own Column
Splitting strings can be tricky and resource intensive but luckily this is a small file and it’s ordered in such a way that we can use the comma as a delimiter.
The code block below describes how this process is done;
- Splits the text in the ingredients column by each comma.
- Next, it takes each split text and places it into a list and I added a prefix called “items_”. This prefix helps me organize each of the new columns rather than calling them 0–7.
- After this is performed, I place it into a new data frame called pt_model.
- The final step is to concatenate pt and pt_model to create a “new” pt data frame.
# Using the built-in string split function.
pt['ingredients']=pt['ingredients'].str.split(pat = ',')pt_model = pd.DataFrame(pt['ingredients'].tolist()).fillna('None').add_prefix('items_')pt = pd.concat([pt, pt_model], axis=1)pt.head(2)
Next, I will remove the word “the” at the beginning of each name of the pizza. There is a preference, not a requirement.
pt['name']=pt['name'].str[3:]# double checking my work
pt[pt['pizza_type_id'] == 'hawaiian']
Merging Multiple Data Frames Together Using a Primary Key
An important step when trying to merge multiple data frames together is finding a primary key or creating a primary key. To learn about a primary key, click this link to learn more about it.
Essentially, a primary key is a distinct/unique value for at least one of the data frames. They can have different names, but the primary key will always be the same. In the code block below, I use “order_id”, “pizza_id” and “pizza_type_id” as my primary keys to join the multiple data frames on. I also use a left join because I want to use the original data that also includes the new data but only on new columns. Merging can be confusing and there are several articles that describe this process. I think this article (link) does a good job at explaining it.
# Joining the first data frame
df=od.merge(o, how='left', on='order_id')# Joining the second data frame
df=df.merge(p, how='left', on='pizza_id')# Joining the third data frame
df=df.merge(pt, how='left', on='pizza_type_id')# Reviewing the new data frame called df
df.head(2)
Export to CSV
# I labeled it CleanedDF, but you can call it whatever you'd likedf.to_csv('CleanedDF.csv', index = False)
Summary
This completes my process for merging multiple CSV files into one. I hope you were able to follow along and learn a thing or two. The link to my Github is here and has all the files needed to follow along. I also posted my notebook below.
Don’t forget to click follow and stay tuned for my next article that shows how to create python visualizations using Seaborn!