Pandas is built around data structures called
Series and
DataFrames. Data for these collections can be imported from various file formats such as
comma-separated values,
JSON,
Parquet,
SQL database tables or queries, and
Microsoft Excel.
Series A
Series is a one-dimensional array-like object that stores a sequence of values together with an associated set of labels, called an index. It is built on top of
NumPy's
array and affords many similar functionalities, but instead of using implicit
integer positions, a Series allows explicit index labels of many data types. A Series can be created from Python
lists,
dictionaries, or NumPy
arrays. If no index is provided, pandas automatically assigns a default integer index ranging from 0 to n-1, where n is the number of items in the Series. A simple example with customized labels is:
DataFrame A
DataFrame is a two-dimensional,
tabular data structure with labeled rows and columns. Each column is stored internally as a Series and may hold a different data type (
numeric,
string,
boolean, etc.). DataFrames can be created by a variety of means, including dictionaries of lists, NumPy arrays, and external files such as CSV or Excel spreadsheets: df1 = pd.Series(['A', 'B', 'C']).to_frame() df2 = pd.DataFrame({"grade": ["A", "B", "C"], "score": [100, 80, 60]}) df3 = pd.read_csv('path/classgrades.csv') To retrieve a DataFrame column as a Series, use either 1) the index (
dict-like notation) or 2) the name of column if the name is a valid Python
identifier (
attribute-like access). DataFrames support operations such as column
assignment, row and column deletion, label-based indexing with loc, position-based indexing with iloc, reshaping, grouping, and
joining.
Merge operations implement a subset of
relational algebra and allow one-to-one, many-to-one, and many-to-many joins. Some common attributes of a DataFrame include dtypes (data type of each column), shape (dimensions of the DataFrame returned as a tuple with form (number of rows, number of columns)), index/columns (labels of the DataFrame's rows/columns, respectively, returned as an Index object), values (data in the DataFrame returned as a 2D array), and empty (returns True if the DataFrame is empty).
Index Index objects hold metadata for Series and Dataframe objects, such as axis labels and names, and are automatically created from input data. By default, a pandas index is a series of integers ascending from 0, similar to the indices of Python
arrays. However, indices can also use any NumPy data type, including floating point, timestamps, or strings. Indices are also immutable, which allows them to be safely shared across multiple objects. pandas' syntax for mapping index values to relevant data is the same syntax Python uses to map dictionary keys to values. For example, if s is a Series, s['a'] will return the data point at index a. Unlike dictionary keys, index values are not guaranteed to be unique. If a Series uses the index value a for multiple data points, then s['a'] will instead return a new Series containing all matching values. A DataFrame's column names are stored and implemented identically to an index. As such, a DataFrame can be thought of as having two indices: one column-based and one row-based. Because column names are stored as an index, these are not required to be unique. If data is a Series, then data['a'] returns all values with the index value of a. However, if data is a DataFrame, then data['a'] returns all values in the column(s) named a. To avoid this ambiguity, Pandas supports the syntax data.loc['a'] as an alternative way to filter using the index. Pandas also supports the syntax data.iloc[n], which always takes an integer
n and returns the
nth value, counting from 0. This allows a user to act as though the index is an array-like sequence of integers, regardless of how it is actually defined. pandas also supports hierarchical indices with multiple values per data point through the "MultiIndex" class. MultiIndex objects allow a single DataFrame to represent multiple dimensions, similar to a
pivot table in
Microsoft Excel, where each level can optionally carry its own unique name. In practice, data with more than 2 dimensions is often represented using DataFrames with hierarchical indices, instead of the higher-dimension
Panel and
Panel4D data structures. == Functionality ==