DataFrames are the central data structure in the pandas API. It‘s like a spreadsheet, with numbered rows and named columns.
為方便引入例程,先導入對應模塊。
1 import pandas as pd
1. Create, access and modify.
Read a .csv file into a pandas DataFrame:
chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")
Basic argument of read_csv() :
- filepath_or_buffervarious
Either a path to a file (a str, pathlib.Path, or py:py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), or any object with a read() method (such as an open file or StringIO).
The following code instantiates a pd.DataFrame class to generate a DataFrame.
1 # Create and populate a 5x2 NumPy array. 2 my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]]) 3 4 # Create a Python list that holds the names of the two columns. 5 my_column_names = ['temperature', 'activity'] 6 7 # Create a DataFrame. 8 my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names) 9 10 # Print the entire DataFrame 11 print(my_dataframe)
See its .index attribute, the result is
RangeIndex(start=0, stop=5, step=1)
If index argument was passed at the definition, say modifying line 8 of the above code to
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names,index = ([10,20,30,40,50]))
the index attribute is
Index([10, 20, 30, 40, 50], dtype='int64')
len(my_dataframe.index) counts the number of rows of the DataFrame.
DataFrame.reset_index(level=None, *, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates=_NoDefault.no_default, names=None) method resets the index, where
- level: int, str, tuple, or list
Only remove the given levels from the index. Removes all levels by default.
- drop: bool, try to insert the original index into dataframe columns if False.
- inplace: bool, whether to modify the DataFrame rather than creating a new one.
- names: int, str or 1-dimensional list
Using the given string, rename the DataFrame column which contains the original index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.
It returns a new DataFrame or None if inplace=True.
DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False) does reversely, where
- keys: label or array-like or list of labels/arrays. This parameter can be either a single column key, or other types described HERE.
- drop: bool, whether to delete columns to be used as the new index.
- append: bool, whether to append columns to existing index or just replace the original index. For more information, click HERE
- inplace: bool, the same as the one above.
You may add a new column to an existing pandas DataFrame just by assigning values to a new column name.
1 # Create a new column named adjusted. 2 my_dataframe["adjusted"] = my_dataframe["activity"] + 2 3 # Print the entire DataFrame 4 print(my_dataframe)
Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.
1 print(my_dataframe.activity) #Equal to the corresponding column selection above 2 print("Rows #0, #1, and #2:") 3 print(my_dataframe.head(3), '\n') 4 5 print("Row #2:") 6 print(my_dataframe.iloc[[2]], '\n') # The type of result is DataFrame. 7 print("Row #2:") 8 print(my_dataframe.iloc[2], '\n') # The type of the result is Series. 9 print("Rows #1, #2, and #3:") 10 print(my_dataframe[1:4], '\n') # Note the index starts from the second row not 11 # 1st 12 13 print("Column 'temperature':") 14 print(my_dataframe['temperature']) 15 16 training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE'] 17 training_df.head(200)
To get random samples of a DataFrame, use DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False) method, where
- n: int, optional
Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
- frac: float, optional
Fraction of axis items to return. Cannot be used with n.
- replace: bool, whether to allow sampling of the same row more than once. Supposed to be set to True if frac > 1.
- random_state: If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
- axis: {0 or ‘index’, 1 or ‘columns’, None}, default is stat axis for given data type.
- ignore_index: bool, deciding if the resulting index will be labeled 0, 1, …, n-1.
It returns a new object of same type as caller containing n items.
Q: What's the difference between Series and DataFrame?
A: The former is a column(Google Gemini insists row but I don't know why) of the latter.
How to index a particular cell of the DataFrame?
1 # Create a Python list that holds the names of the four columns. 2 my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason'] 3 4 # Create a 3x4 numpy array, each cell populated with a random integer. 5 my_data = np.random.randint(low=0, high=101, size=(3, 4)) 6 7 # Create a DataFrame. 8 df = pd.DataFrame(data=my_data, columns=my_column_names) 9 10 # Print the entire DataFrame 11 print(df) 12 13 # Print the value in row #1 of the Eleanor column. 14 print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1]) #Chained # indexing
Q: How to convert a Series to ndarray?
A: Series.values property returns Series as ndarray or ndarray-like depending on the dtype. However, it is recommended using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.
The following code shows how to create a new column to an existing DataFrame through row-by-row calculation between or among columns:
1 # Create a column named Janet whose contents are the sum 2 # of two other columns. 3 df['Janet'] = df['Tahani'] + df['Jason'] 4 5 # Print the enhanced DataFrame 6 print(df)
Pandas provides two different ways to duplicate a DataFrame:
- Referencing: 藕不斷絲連。
- Copying: 相互獨立。
1 # Create a reference by assigning my_dataframe to a new variable. 2 print("Experiment with a reference:") 3 reference_to_df = df 4 5 # Print the starting value of a particular cell. 6 print(" Starting value of df: %d" % df['Jason'][1]) 7 print(" Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1]) 8 9 # Modify a cell in df. 10 df.at[1, 'Jason'] = df['Jason'][1] + 5 # Why not using Chained Indexing for #DataFrame assignment? 11 print(" Updated df: %d" % df['Jason'][1]) 12 print(" Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])
There're a lot of differences among .iloc
Note: Deprecated since version 2.2.0: Returning a tuple from a callable is deprecated for pandas.DataFrame.iloc . Better not input a callable function.、
Confusion Solutions What's the difference among .at , .loc , .iloc, .iat and chained indexing?
.iloc is primarily integer position based (from 0 to length-1 of the axis): Allowed inputs are
-
An integer.
-
A list or array of integers.
-
A slice object with ints, e.g.
1:7. -
A boolean array.
- A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g.
(0, 1).
but may also be used with a boolean array.
To make a copy of a DataFrame, use DataFrame.copy(deep=True) method. When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Things might be a lit more complicated in terms of deep=False(This inline label is a link, too) after pandas 3.0.
The following code shows an experiment of a copy
1 copy_of_my_dataframe = my_dataframe.copy()
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') drops specified labels from rows or columns, where
- labels: single label or list-like
Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
-
axis: {0 or ‘index’, 1 or ‘columns’}, whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index: single label or list-like
Alternative to specifying axis (
labels, axis=0is equivalent toindex=labels). -
columns: single label or list-like
Alternative to specifying axis (
labels, axis=1is equivalent tocolumns=labels). -
level: int or level name, optional
For MultiIndex, level from which the labels will be removed.
-
inplace: bool, whether do operation in place and return None or return a copy.
- errors: {‘ignore’, ‘raise’}. If ‘ignore’, suppress error and only existing labels are dropped.
It returns DataFrame with the specified index or column labels removed or None.
2. Data exploration.
To preview first n rows of a large DataFrame, use DataFrame.head(n=5) and it returns the same type as caller.
use the DataFrame.describe(percentiles=None, include=None, exclude=None) method to view descriptive statistics about the dataset, where
- percentiles: list-like of numbers, optional. All should fall between 0 and 1. The default is
[.25, .5, .75], which returns the 25th, 50th, and 75th percentiles. - include: ‘all’, list-like of dtypes or None, optional. A white list of data types to include in the result. Ignored for Series .
-
‘all’ : All columns of the input will be included in the output.
-
training_df.describe(include='all') results in
| TRIP_MILES | TRIP_SECONDS | FARE | COMPANY | PAYMENT_TYPE | TIP_RATE | |
| count | 31694.000000 | 31694.000000 | 31694.000000 | 31694 | 31694 | 31694.000000 |
| unique | NaN | NaN | NaN | 31 | 7 | NaN |
| top | NaN | NaN | NaN | Flash Cab | Credit Card | NaN |
| freq | NaN | NaN | NaN | 7887 | 14142 | NaN |
| mean | 8.289463 | 1319.796397 | 23.905210 | NaN | NaN | 12.965785 |
| std | 7.265672 | 928.932873 | 16.970022 | NaN | NaN | 15.517765 |
| min | 0.500000 | 60.000000 | 3.250000 | NaN | NaN | 0.000000 |
| 25% | 1.720000 | 548.000000 | 9.000000 | NaN | NaN | 0.000000 |
| 50% | 5.920000 | 1081.000000 | 18.750000 | NaN | NaN | 12.200000 |
| 75% | 14.500000 | 1888.000000 | 38.750000 | NaN | NaN | 20.800000 |
| max | 68.120000 | 7140.000000 | 159.250000 | NaN | NaN | 648.600000 |
DataFrame.nunique(axis=0, dropna=True) method, where- axis: {0 or ‘index’, 1 or ‘columns’}
- dropna: bool, whether to exclude NaN in the counts.
It returns Series. Series.nunique(dropna=True) method returns an int.
num_unique_companies = training_df['COMPANY'].nunique()
DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
, where- subset: label or list of labels of columns to use when counting unique combinations, optional.
- normalize: bool, whether to return proportions rather than frequencies.
- sort: bool. Sort by frequencies when True, otherwise by DataFrame column values(original order).
- ascending: bool.
- dropna: bool, whether to exclude rows containing NaN.
It returns a Series. Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True) method returns a Series containing counts of unique values, where
- bins: int, optional. Number of bins.
Rather than count values, group them into half-open bins, a convenience for
pd.cut, only works with numeric data.
Series.idxmax(axis=0, skipna=True, *args, **kwargs) method, where- axis: {0 or ‘index’} isn't used. It's needed for compatibility with DataFrame.
- skipna: bool, whether to exclude NA/null values. If the entire Series is NA, the result will be NA.
- *args, **kwargs: Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
It returns an Index, label of max.
most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()
- include, exclude: scalar or list-like. For example,
-
To select all numeric types, use
np.numberor'number'
-
It returns the subset of the DataFrame.
- axis: {index (0)} is unsed, either.
- skipna: bool, same as the above one.
- numeric_only: not implemented.
It returns scalar.
查看代碼 max_fare = training_df['FARE'].max()
The same method for DataFrame is DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs) , where
- axis: {index (0), columns (1)}
For DataFrames, specifying axis=None will apply the aggregation across both axes.
- numeric_only: bool, whether to include only float, int or boolean data.
- ddof: int, Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1).
Series.mean(axis=0, skipna=True, numeric_only=False, **kwargs). Its return type and parameters are the same as above. mean_distance = training_df['TRIP_MILES'].mean()
isnull() method,
missing_values = training_df.isnull()
It returns a boolean same-sized DataFrame indicating if the values are NA.
Second, count numbers of True by DataFrame.sum(axis=0, skipna=True, numeric_only=False, min_count=0, **kwargs) method, where
- axis: {index (0), columns (1)}
- skipna: bool, whether to exclude NA/null values when computing the result.
- numeric_only: bool, whether to include only float, int, boolean columns.
- min_count: int, required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present, the result will be NA.
It returns a Series or scalar. The same method of Series is Series.sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs), where
- axis, numeric_only are the same as
Series.max.
查看代碼
missing_values = training_df.isnull().sum().sum()
To view correlation matrix among features(may including label), try DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)(Note that this inline code is also a link) method, where
- method: of correlation:
-
pearson : standard correlation coefficient
-
kendall : Kendall Tau correlation coefficient
-
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays and returning a float.
-
- numeric_only: Same as the one in DataFrame.mean().
It returns a DataFrame of correlation matrix.
查看代碼
training_df.corr(numeric_only = True)
get the following result:
| TRIP_MILES | TRIP_SECONDS | FARE | TIP_RATE | |
|---|---|---|---|---|
| TRIP_MILES | 1.000000 | 0.800855 | 0.975344 | -0.049594 |
| TRIP_SECONDS | 0.800855 | 1.000000 | 0.830292 | -0.084294 |
| FARE | 0.975344 | 0.830292 | 1.000000 | -0.070979 |
| TIP_RATE | -0.049594 | -0.084294 | -0.070979 | 1.000000 |
4. Options and settings
pd.options.display.max_rows = 10
If max_rows is exceeded, switch to truncate view. Depending on `large_repr`, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited.
display.float_format : callable The callable should accept a floating point number and return a string with the desired format of the number. For example,
Format Specification Mini-Language: options.display.float_format = "{:.1f}".format
浙公網安備 33010602011771號