Modifying files to match a template
I had three files of data, containing data retrieved from a database. The first file had a layout I was happy with. It contained 61 columns of data. The other two files contained an extra 100+ columns of unnecessary data and had a different column order different to the first file.
Working in Python, I imported the three files into pandas DataFrames, extracted the column names that I wanted from the first DataFrame and used the filter function to select only those columns from the other two DataFrames.
I then exported each consistently-formatted dataset into its own worksheet in a single Excel file.
Combining data from multiple files
A similar problem I've encountered is combining multiple files containing similar data into a single file. Sometimes column orders are different in each file, column names do not necessarily match and some files might contain extra columns of data.
Tasks like this can also be completed with just a few lines of Python code.
Import the datasets into pandas DataFrames and combine the data:
The datasets look like this:
The fill.na() method can be used to merge data from columns that were named differently in the different files. Duplicate columns can then be deleted.
With data coming from multiple files, I found that some data entries (rows) had been duplicated. Due to data entry errors and omissions, the duplicates did not match exactly. I was able to easily extract duplicated rows (based on values in some of the columns) to examine them so I could decide which ones to keep and which ones to delete.
Reusing scripts and automation
The code above can now be reused whenever needed. Working in Python, we also have the ability to adapt the code to automate these processes for multiple and variable numbers of files.
Turning the code into a function makes it easy to reuse and modify for different files. In the example below, the function modify_files modifies all Excel files in a folder, to match an existing template. The user can specify the folder, the template file and the name of the Excel file that should be created to store the cleaned datasets.
To learn more about how you can use Python to improve the way you and your staff work, take a look at our Python training courses.