「Python」用mysql處理excel數據
1 前言
這是一個python的英文題目,原文如下:
The UK Department of Health and Social Care provides a multitude of medical services to people. The UK government is considering developing a new information system to get faster and better decision-support information. The government has limited experience with information system development, so it engages a consulting company to plan and implement it. You need to take the role of an employee of the consulting company and are in charge of this project. The first business process the government will implement is the hospital admission process.
The government looks after many public hospitals with a range of medical officers(MO)within different medical fields(Specialty). Whenever a patient is admitted, a record is created with an admission number(EpisodeNo), date and time, urgency and admission, as well as hospital, patient no., and other patient data. During the stay,all the procedures performed and costs are recorded. When patients are discharged,again the date and time as well as the status get recorded.
At this stage, the government is providing you with test data only. The spreadsheet will be provided with different worksheets in which data are explained further.
Your project consists of a range of tasks,beginning with designing the databaseimplementinc it,creating queries,documenting the design and implementation stepsand answering clients questions. You are required to answer each of the following tasks in sequence:
You need to set up the database operating environment in Python and then use Python to
answer and discuss the following questions:
- Admission count per hospital, by specialty,in a certain period of time;
- Average length of stay by hospital, by class, by medical officer;
- Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
簡單說就是有一組數量比較多的患者就醫情況數據和醫院相關信息,實際上是一份excel表格,包括多個sheet,要求用python腳本對數據庫進行操作,實現三個查詢問題。
2 分析
2.1 數據庫建庫方案
使用mysql建立數據庫,并將excel表格導入。在此之前,需要先構建一個數據庫及其中的各個表。建庫方案如下:

灰色格子為表名,其他為字段名,箭頭為外鍵。
2.2 程序分析
- python使用第三方庫pymysql實現對數據庫的操作;
- python使用第三方庫pandas將excel表格數據導入數據庫表;
- 程序開始時判斷數據庫是否已經存在,若不存在則建庫建表并導入數據,若存在則進行查詢操作。
3 代碼
3.1 判斷是否存在相應的庫
import os
try:
import pymysql
import pandas as pd
import openpyxl # pandas讀取excel的.xlsx以來此庫
except:
os.system("pip install pandas pymysql openpyxl")
import pymysql
import pandas as pd
import openpyxl
3.2 基本框架
class myDatabase():
def __init__(self) -> None:
self.createTable()
self.insertValue()
def createTables(self): # 建表
pass
def insertValue(self): # 導入excel表格數據
pass
def CountbyClass(self): # 實現查詢1.Admission count per hospital, by specialty,in a certain period of time;
pass
def averageAdmtime(self): # 實現查詢2.Average length of stay by hospital, by class, by medical officer;
pass
def averageCost(self): #實現查詢3.Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
pass
if __name__ == "__main__":
mysql = myDatabase() # 實例化一個myDatabase對象
mysql.CountbyClass() # 查詢1
mysql.averageAdmtime() # 查詢2
mysql.averageCost() # 查詢3
3.3 pymysql庫的使用
首先,使用pymysql需要連接到mysql或mysql下的一個數據庫,pymysql提供了這樣一個connect函數:
db=pymysql.connect(host,user,password,charset)
# host是mysql所在的ip,本機的可以不傳入這個參數或者host="localhost"
# user和password是mysql用戶和密碼
# charset是字符編碼
其次,連接上mysql后,需要用在指針進行操作,所以連接成功后需要有一個變量存儲指針:
cursor=db.cursor()
最后,對mysql的操作還是需要使用sql語句進行的,也就是說python將sql命令傳給mysql執行。用到的函數:
cursor.execute(sql)
# sql是sql語句,可以不帶分號;
3.4 sql語句
/*創建數據庫*/
create database databaseName;
/*創建數據表*/
create table tableName(word1 class, word2 class, word3 class……); -- 直接創建,如果已存在這個表會報錯
create table if not exists tableName(word1 class, word2 class, word3 class……); -- 判斷如果不存在則創建表
-- word是字段名,class是數據類型(如int,char,double,date等)
/*插入數據*/
insert into tableName(word1,word2,word3……) value (value1,value2,value3……);
/*刪除表*/
drop table tableName;
/*刪除數據庫*/
drop database databaseName;
/*查詢*/
select word1,word2 from tableName; -- 從表tableName中查詢word1和word2字段的記錄
select * from tableName; -- 顯示tableName表的全部記錄
select count(*) from tableName; -- 統計tableName的全部記錄數量
select count(*) from tableName group by word1; -- 以word1為條件進行分組統計
select avg(word1) from tableName; -- 計算tableName的word1字段的平均值
select word11,word21 from (select word11,word12,word21,word22 from table1,table2 where table1.word1n=table2.word2m) -- 嵌套查詢
運用以上命令即可完成開頭提到的三個查詢問題:
- Admission count per hospital, by specialty,in a certain period of time;
- Average length of stay by hospital, by class, by medical officer;
- Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
以1為例,查詢按醫院、類別、醫務人員分列的平均住院時間。查詢語句如下:
select Hospital,
avg(timestampdiff(day,AdmDT,DischDT)) -- avg函數計算平均值,timestampdiff函數計算時間差
from(
select Hospital,
str_to_date(concat(AdmDate," ",AdmTime),"%Y-%m-%d %H:%i:%s") as AdmDT, -- concat函數將原本分為兩個字段的入院日期(AdmDate)和入院時間(AdmTime)合為一個字符串,str_to_date函數將字符串轉為時間格式,as AdmDT給這個數據取一個別名
str_to_date(concat(DischDate," ",DischTime),"%Y-%m-%d %H:%i:%s") as DischDT
from Patientinfo,AdmDisch -- 醫院數據Hospital來自表Patientinfo,出入院時間來自表AdmDisch
where Patientinfo.EpisodeNo=AdmDisch.EpisodeNo) as hospital -- 將Patientinfo表和AdmDisch表按EpisodeNo進行合并
group by Hospital; -- 按醫院分組求平均住院時間
浙公網安備 33010602011771號