Data Frames vs DataFrames
The data frame (or DataFrame) is an essential tool for anyone using R or Python to analyse data. They are rectangular or tabular data structures, that hold rows and columns of data, similar to tables in a database or an Excel spreadsheet.
Data frames allow users to store related data together in a familiar format that simplifies manipulation of the data as a whole. In R, many functions assume or require that data are available in a data frame. Whilst in Python, pandas DataFrames contain numerous methods for applying common operations to the data held within them.
Products that integrate Python and R scripts or send data between software products, for example SAS, Power BI and Azure Machine Learning, also often rely on data frames to store or transfer data.
Data frames in R
The simplest data structure available in R is the vector. Each element in a vector can store a single value and all values must have the same data type, i.e. they must all be character, integer etc.
In R, data frames are essentially a collection of vectors, where each vector represents a column of data. Each column is limited to storing one type of data only but different columns may store data of different types. In order to maintain the rectangular shape of the data structure, each vector must have the same number of values.
Referring to columns in data frames by name is simplified by the use of the dollar-sign notation, e.g. dataframe$column_name.
Knowledge of data frames is essential for anyone importing data into R. Importing data via the command line or through the GUI in RStudio creates data frames. If tidyverse import functions are used, the data frames will be a special kind of modern, simplified data frame known as a tibble.
DataFrames in Python
DataFrames are available in Python through the pandas library. The pandas DataFrames are conceptually very similar to data frames in R. They have similar limitations, in that each column may contain data of one type only but data types can be different between columns.
Just as in R, data subsets can be specified by column or row names or positions within the DataFrame. Within Python, pandas greatly enhances the ease and efficiency with which data can be manipulated. This is because pandas supports vectorisation, something not available in standard Python. Vectorisation (which is built in to many R functions and operations) applies an operation or function to each element individually. This makes it very simple to create a new column in a DataFrame calculated from other column values or to, for example, divide each value in a column by 1000.
pandas provides many import functions that read data into DataFrames as well as methods to export data from DataFrames to a range of file types and formats.
pandas DataFrames are also equipped with numerous methods for carrying out common data transformation, reshaping, aggregation, merging, joining and plotting functions on the data they hold.