您现在的位置是：网站首页> 编程资料编程资料

Python中的pandas表格模块、文件模块和数据库模块_python_

2023-05-26 376人已围观

简介 Python中的pandas表格模块、文件模块和数据库模块_python_

支持文件存取操作，支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
支持增删改查、切片、高阶函数、分组聚合等单表操作，以及和dict、list的互相转换。
支持多表拼接合并操作。
支持简单的绘图操作。
支持简单的统计分析操作。

一、Series数据结构

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。

Series比较像列表（数组）和字典的结合体

import numpy as np import pandas as pd df = pd.Series(0, index=['a', 'b', 'c', 'd']) print(df) # a 0 # b 0 # c 0 # d 0 # dtype: int64 print(df.values) # 值 # [0 0 0 0] print(df.index) # 索引 # Index(['a', 'b', 'c', 'd'], dtype='object')

1、Series的创建

import numpy as np import pandas as pd df = pd.Series(np.array([1, 2, 3, 4, np.nan]), index=['a', 'b', 'c', 'd', 'e']) # 1、从ndarray创建Series print(df) # a 1.0 # b 2.0 # c 3.0 # d 4.0 # e NaN # dtype: float64 df = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': np.nan}) # 2、也可以从字典创建Series dates = pd.date_range('20190101', periods=6, freq='M') print(type(dates)) # print(dates) # DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', # '2019-05-31', '2019-06-30'], # dtype='datetime64[ns]', freq='M') df=pd.Series(0,index=dates) # 3、时间序列索引 print(df) # 2019-01-31 0 # 2019-02-28 0 # 2019-03-31 0 # 2019-04-30 0 # 2019-05-31 0 # 2019-06-30 0 # Freq: M, dtype: int64

产生时间对象数组：date_range参数详解：

start:开始时间
end:结束时间
periods:时间长度
freq:时间频率，默认为'D'，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

2、Series属性

print(df ** 2) # 3、与标量运算 # a 1.0 # b 4.0 # c 9.0 # d 16.0 # e NaN # dtype: float64 print(df + df) # 4、两个Series运算 # a 2.0 # b 4.0 # c 6.0 # d 8.0 # e NaN # dtype: float64 print(df[0] ) # 5、数字索引； 1.0 print(df[[0, 1, 2]]) # 行索引 # a 1.0 # b 2.0 # c 3.0 # dtype: float64 print(df['a'] ) # 6、键索引（行标签） ；1.0 print(df[['b','c']]) print('a' in df) # 7、in运算；True print(df[0:2] ) # 8、切片 # a 1.0 # b 2.0 # dtype: float64 print(np.sin(df)) # 9、通用函数 # a 0.841471 # b 0.909297 # c 0.141120 # d -0.756802 # e NaN # dtype: float64 print(df[df > 1] ) # 10、布尔值过滤 # b 2.0 # c 3.0 # d 4.0 # dtype: float64

2、Series缺失数据处理

df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e']) print(df) # a 1.0 # b 2.0 # c 3.0 # d 4.0 # e NaN # dtype: float64 print(df.dropna() ) # 1、过滤掉值为NaN的行 # a 1.0 # b 2.0 # c 3.0 # d 4.0 # dtype: float64 print(df.fillna(5) ) # 2、用指定值填充缺失数据 # a 1.0 # b 2.0 # c 3.0 # d 4.0 # e 5.0 # dtype: float64 print(df.isnull() ) # 3、返回布尔数组，缺失值对应为True # a False # b False # c False # d False # e True # dtype: bool print(df.notnull() ) # 4、返回布尔数组，缺失值对应为False # a True # b True # c True # d True # e False # dtype: bool

二、DataFrame数据结构

DataFrame是一个表格型的数据结构，含有一组有序的列。

DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

1、DataFrame的创建

import numpy as np import pandas as pd df1 = pd.DataFrame(np.zeros((3, 4))) # 创建一个三行四列的DataFrame print(df1) # 0 1 2 3 # 0 0.0 0.0 0.0 0.0 # 1 0.0 0.0 0.0 0.0 # 2 0.0 0.0 0.0 0.0 dates = pd.date_range('20190101', periods=6, freq='M') np.random.seed(1) arr = 10 * np.random.randn(6, 4) print(arr) # [[ 16.24345364 -6.11756414 -5.28171752 -10.72968622] # [ 8.65407629 -23.01538697 17.44811764 -7.61206901] # [ 3.19039096 -2.49370375 14.62107937 -20.60140709] # [ -3.22417204 -3.84054355 11.33769442 -10.99891267] # [ -1.72428208 -8.77858418 0.42213747 5.82815214] # [-11.00619177 11.4472371 9.01590721 5.02494339]] df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4']) # 自定义index和column print(df) # c1 c2 c3 c4 # 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 # 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 # 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 # 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 # 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 # 2019-06-30 -11.006192 11.447237 9.015907 5.024943

2、DataFrame属性

print(df.dtypes) # 1、查看数据类型 # 0 float64 # 1 float64 # 2 float64 # 3 float64 # dtype: object print(df.index) # 2、查看行索引 # DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', # '2019-05-31', '2019-06-30'], # dtype='datetime64[ns]', freq='M') print(df.columns) # 3、查看各列的标签 # Index(['c1', 'c2', 'c3', 'c4'], dtype='object') print(df.values) # 4、查看数据框内的数据，也即不含行标签和列头的数据 # [[ 16.24345364 -6.11756414 -5.28171752 -10.72968622] # [ 8.65407629 -23.01538697 17.44811764 -7.61206901] # [ 3.19039096 -2.49370375 14.62107937 -20.60140709] # [ -3.22417204 -3.84054355 11.33769442 -10.99891267] # [ -1.72428208 -8.77858418 0.42213747 5.82815214] # [-11.00619177 11.4472371 9.01590721 5.02494339]] print(df.describe()) # 5、查看数据每一列的极值，均值，中位数，只可用于数值型数据 # c1 c2 c3 c4 # count 6.000000 6.000000 6.000000 6.000000 # mean 2.022213 -5.466424 7.927203 -6.514830 # std 9.580084 11.107772 8.707171 10.227641 # min -11.006192 -23.015387 -5.281718 -20.601407 # 25% -2.849200 -8.113329 2.570580 -10.931606 # 50% 0.733054 -4.979054 10.176801 -9.170878 # 75% 7.288155 -2.830414 13.800233 1.865690 # max 16.243454 11.447237 17.448118 5.828152 print(df.T) # 6、transpose转置，也可用Ｔ来操作 # 2019-01-31 2019-02-28 2019-03-31 2019-04-30 2019-05-31 2019-06-30 # c1 16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192 # c2 -6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237 # c3 -5.281718 17.448118 14.621079 11.337694 0.422137 9.015907 # c4 -10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943 print(df.sort_index(axis=0)) # 7、排序，axis=0 可按行标签排序输出； 按行标签][2019-01-01, 2019-01-02...]从大到小排序 # c1 c2 c3 c4 # 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 # 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 # 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 # 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 # 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 # 2019-06-30 -11.006192 11.447237 9.015907 5.024943 print(df.sort_index(axis=1)) # 7、排序，axis=1 可按列头标签排序输出；按列标签[c1, c2, c3, c4从大到小排序 # c1 c2 c3 c4 # 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 # 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 # 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 # 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 # 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 # 2019-06-30 -11.006192 11.447237 9.015907 5.024943 print(df.sort_values(by='c2')) # 8、按数据值来排序 ；按c2列的值从大到小排序 # c1 c2 c3 c4 # 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 # 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 # 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 # 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 # 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 # 2019-06-30 -11.006192 11.447237 9.015907 5.024943

3、DataFrame取值

print(df['c2']) # 1、通过columns标签取值 # 2019-01-31 -6.117564 # 2019-02-28 -23.015387 # 2019-03-31 -2.493704 # 2019-04-30 -3.840544 # 2019-05-31 -8.778584 # 2019-06-30 11.447237 # Freq: M, Name: c2, dtype: float64 print(df[['c2', 'c3']]) # c2 c3 # 2019-01-31 -6.117564 -5.281718 # 2019-02-28 -23.015387 17.448118 # 2019-03-31 -2.493704 14.621079 # 2019-04-30 -3.840544 11.337694 # 2019-05-31 -8.778584 0.422137 # 2019-06-30 11.447237 9.015907 print(df[0:3]) # 2、通过columns索引取值 # c1 c2 c3 c4 # 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 # 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 # 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 print(df.loc['20200228':'20200430']) # 3、loc 通过行标签取值： # c1 c2 c3 c3 # 2020-02-29 8.654076 -23.015387 17.448118 -7.612069 # 2020-03-31 3.190391 -2.493704 14.621079 -20.601407 # 2020-04-30 -3.224172 -3.840544 11.337694 -10.998913 print(df.iloc[1:3]) # 4、iloc 通过行索引选择数据,取第二行到三行。 # c1 c2 c3 c3 # 2020-02-29 8.654076 -23.015387 17.448118 -7.612069 # 2020-03-31 3.190391 -2.493704 14.621079 -20.601407 print(df.iloc[2, 1]) # 第三行第二列值：-2.493703754774101 print(df.iloc[1:4, 1:4]) # 第 2-4行与第2-4列： # c2 c3 c4 # 2019-02-28 -23.015387 17.448118 -7.612069 # 2019-03-31 -2.493704 14.621

上一篇：numpy中的converters和usecols用法详解_python_

下一篇：Python实现奇数列与偶数列调换的方法详解_python_