[파이썬 데이터 분석 입문] Chapter 5. 응용 작업

5.1 대량의 파일에서 원하는 집합 찾기

import csv
import glob
import os
import sys
from datetime import date
from slrd import open_workbook, xldate_as_tuple

item_numbers_file = sys.argv[1]
path_to_folder = sys.argv[2]
output_file = sys.argv[3]

item_numbers_to_find = []
with open(item_numbers_file, 'r', newline = '') as item_numbers_csv_file :
	filereader = csv.reader(item_numbers_csv_file)
    for row in filereader:
    	item_numbers_to_find.append(row[0])
#print(item_numbers_to_find)

filewriter = csv.writer(open(output_file, 'a', newline = '')

file_counter = 0
line_counter = 0
count_of_item_numbers = 0

for input_file in glob.glob(os.path.join(path_to_folder, '*.*')):
	file_counter += 1
    if input_file.split('.')[1] == 'csv':
    	with open(input_file, 'r', newline = '') as csv_in_file:
        	filereader = csv.reader(csv_in_file)
            header = new(filereader)
            
            for row in filereader:
            	row_of_output = []
                for column in range(len(header)):
                	if column < 3 :
                    	cell_value : str(row[column]).strip()
                        row_of_output.append(cell_value)
                    elif column == 3 :
                    	cell_value = str(row[column]).lstrip('$').replace(',','').split('.')[0].strip()
                        row_of_output.append(cell_value)
                    else :
                    	cell_value = str(row[column]).strip()
                        row_of_output.append(cell_balue)
                row_of_output.append(os.path.basename(input_file))
                if row[0] in item_numbers_to_find :
                	filewriter.writerow(row_of_output)
                    count_of_item_numbers += 1
                    
                line_counter += 1
	elif input_file.split('.')[1] == 'xls' or input_file.aplit('.')[1] == 'xlsx' :
    	workbook = open_workbook(input_file)
        for worksheet in workbook.sheets():
        	try :
            	header = worksheet.row_values(0)
            except IndexError :
            	pass
            for row in range(1, worksheet.nrows) :
            	row_of_output = []
                for column in range(len(header)) :
                	if column < 3 :
                    	cell_value = str(worksheet.cell_value(row, column)).strip()
                        row_of_output.append(cell_value)
                    elif column == 3 :
                    	cell_value : str(worksheet.cell_value(row, column)).split('.')[0].strip()
                        row_of_output.append(cell_value)
                    else :
                    	cell_value = xldate_as_tuple(worksheet.cell(row, column).value, workbook.datemode)
                        cell_value = str(date(*cell_value[0:3])).strip()
                        row_of_output.append(cell_value)
                    row_of_output.append(os.path.basename(input_file))
                    row_of_output.append(worksheet.name)
                    	if str(worksheet.cell(row, 0).value).split('.')[0].strip() in item_numbers_to_find :
                        	filewriter.writerow(row_of_output)
                            count_of_item_numbers += 1
                        line_counter += 1
print('Number of files : {}'.format(file_counter))
print('Number of lines : {}'.format(line_counter))
print('Number of item numbers : {}'.format(count_of_item_numbers)

item_number_file : 찾고 싶은 품목 번호가 들어 있는 입력 CSV 파일의 경로와 이름

path_to_folder : 검색할 기록 파일들이 있는 폴더의 경로

output_file : 기록 파일에서 찾은 품목 번호에 해당하는 레코드로 구성될 출력 CSV 파일의 경로와 이름

line 13 ~ 17 : 찾으려는 품목 번호를 코드에서 사용할 수 있게 리스트와 같은 형태의 자료 구조로 변환

file_counter : 스크립트에서 읽을 기록 파일의 수

line_counter : 모든 파일에서 읽은 행의 수

count_of_item_numbers : 찾고 있는 품목 번호가 들어 있는 행의 수를 추적할 변수

line 25 : 기록 파일 폴더 내 각 파일들을 반복 처리하기 위한 for문

os.path.join() : glob.glob() 함수를 사용하여 특정 패턴과 일치하는 모든 파일명과 결합.

line 27 : if문을 사용하여 CSV 파일만 처리. CSV 파일의 경로명을 '.' 을 기준으로 나누었을 때 마침표 이후의 문자열은 index[1]로 들어가게 됨. index[1]이 csv와 동일한지 확인.

line 31 : CSV 파일 내 잔여 행들에 대한 데이터를 반복 처리하는 for문. ~~찾으려는 품목 번호 중 하나가 특정 행에 포함되어 있다면 출력 파일로 쓰기 위해 ~~.. -> 이해 못 함~~

line 34 ~ 42 : 첫 번째 if문과 세 번째 else문에서는 cost 열을 제외한 다른 열들을 처리.

두 번째 if문에서는 cost 열을 처리. lstrip을 통해 달러 기호를 제거하고, replace() 함수를 통해 쉼표를 빈 문자열로 치환하여 쉼표를 제거. 소수점 앞의 값만 취한 다음 strip() 함수를 통해 문자열 양 끝에 있는 문자 제거. 그 후 row_of_output 리스트에 추가

line 44 : 해당 행에 있는 품목 번호가 찾으려는 항목인지 여부를 검사

line 48 ~ 73 : 엑셀 파일을 처리하는 코드

5.2 CSV 파일에서 카테고리별 통계치 계산하기

import csv
import sys
from datetime import date, datetime
def date_diff(date1, date2) : 
	try :
    	diff = str(datetime.strptime(date1, '%m/%d/%Y')).split()[0]
    except :
    	diff = 0
    if diff == '0:00:00' : diff = 0
   	return diff
    
input_file = sys.argv[1]
output_file = sys.argv[2]

packages = {}
previous_name = 'N/A'
previous_package='N/A'
previous_package_date = 'N/A'
first_row = True
today = date.today().strftime('%m/%d%Y')

with open(input_file, 'r', newline = '') as input_csv_file : 
	filereader = csv.reader(input_csv_file)
    header = next(filereader)
    
    for row in filereader :
    	current_time = row[0]
        current_package = row[1]
        current_package_date = row[3]
        
        if current_name not in packages :
        	packages[current_name] = {}
        if current_package not in packages[current_name] :
        	packages[currrent_name][current_package] = 0
        if current_name != previous_name : 
        	if first_row : first_row = False
          	else :
            	diff = date_diff(today, previous_package_date)
                if previous_package not in packages[previous_name]:
                	packages[previous_name][previous_package] = int(diff)
                else :
                	packages[previous_name][previous_package] += int(diff)
        else :
        	diff = date_diff(current_package_date, previous_package_date)
            packages[previous_name][previous_package] += int(diff)
        
        previous_name = current_name
        previous_package = current_package
        previous_package_date = current_package_date
        
header = ['Customer Name', 'Category', 'Total Time (in Days)']

with open(output_file, 'w', newline = '') as output_csv_file :
	filewriter = csv.writer(output_csv_file)
    filewriter.writerow(header)
    
    for customer_name, customer_name_value in packages.items() :
    	for package_category, package_category_value in packages[customer_name].items() :
        	row_of_output = []
            print(customer_name, package_category package_category_value)
            row_of_output.append(customer_name)
            row_of_output.append(package_category)
            row_of_output.append(package_category_value)
            filewriter.writerow(row_of_output)

👩‍💻 datetime 모듈의 strptime()

: 문자열로 되어있는 시간 or 날짜를 datetime 객체로 바꾸어주는 함수

line 6 ~ 14 : 두 날짜 간의 차이를 구해 문자열로 반환.

line 24 : 오늘 날짜를 today 변수에 저장

line 26 ~ 27 : 입력 파일을 연 후, 객체로부터 행을 읽어와 header 변수에 할당

line 33 : if문을 통해 current_name이 packages 딕셔너리에 포함되지 않았는지 확인. 포함되지 않았다면 line 34에서 딕셔너리에 해당 변수 값을 키로 저장하고, 값은 빈 딕셔너리로 지정.

line 35 : current_package의 값이 고객명에 해당하는 내부 딕셔너리에 포함되지 않았는지 확인한다. 포함되지 않았다면 lien 36에서 내부 딕셔너리에 이 변수 값을 키로 저장한다.

line 49 ~ 51 : 각각의 변수에 첫 행의 데이터를 할당한다.

-> 이렇게 하면 첫 데이터 해의 처리가 끝난 것이므로 line 29의 for문으로 돌아가 다시 수행하게 된다.

5.3 텍스트 파일에서 카테고리별 통계치 계산하기

import sys

input_file = sys.argv[1]
output_file = sys.argv[2]

messages = {}
notes = []

with open(input_file, 'r', newline = '') as text_file :
	for row in text_file :
    	if '[Note]' in row :
        	row_list = row.split(' ', 4)
            day = row_list[0].strip()
            note = row_list[4].strip('\n').strip()
            
            if note not in notes : notes.append(note)
            if day not in messages : messages[day] = {}
            if note not in messages[day] : messages[day][note] = 1
            else : messages[day][note] += 1
            
filewriter = open(output_file, 'w', newline = '')
header = ['Date']
header.extend(notes)
header = ','.join(map(str, header)) + '\n'
print(header)
filewriter.write(header)

for day, day_value in messages.items() :
	row_of_output = []
    row_of_output.append(day)
    
    for index in range(len(notes)) :
    		if notes[index] in day_value.keys() :
            	row_of_output.append(day_value[notes[index]])
            else :
            	row_of_output.append(0)
                
    output = ','.join(map(str, row_of_output)) + '\n'
    print(output)
    filewriter.write(output)
filewriter.close()

line 7 : message라는 변수에 빈 중첩 딕셔너리 할당. 키 = 로그가 발생한 날짜, 값 = 내부 딕셔너리

내부 딕셔너리의 키 = 로그 메시지, 값 = 특정 날짜에 발생한 특정 로그 메시지의 빈도

line 8 : 빈 리스트 note 할당. 입력 파일 내 모든 로그 메시지를 담게 된다.

line 11 : 각 행 내 [Note] 문자열이 있는지 검사. 포함되어 있다면 로그 메시지를 포함하는 행이다.

line 12 ~ 22 : 문자열 [Note]가 포함된 행들을 파싱하고, 데이터의 특정 부분을 리스트와 딕셔너리에 넣는다.

line 12 : split() 함수를 통해 5개로 분할된 문자열을 row_list에 할당. 날짜, 시간, 번호, [Note], 로그 메시지로 구성.

line 15 : note 변수의 로그 메시지가 notes 리스트에 포함되지 않았는지 검사. 포함되지 않았다면 line 16에서 로그 메시지를 리스트에 추가.

line 17 : day 변수에 든 날짜가 messages 딕셔너리의 키로 포함되지 않았는지 검사. 포함되지 않았다면 messages 딕셔너리의 키로 날짜를 추가하고 해당하는 값으로 빈 딕셔너리를 만든다.

line 21 : 특정 로그 메시지가 하루에 여러 차례 발생한 경우를 처리. 해당 로그 메시지에 해당하는 값 += 1

모든 행이 처리되고 나면 messages 딕셔너리에 키-값 쌍이 채워지게 됨. 키 = 로그가 발생한 고유 날짜들, 값 = 내부 딕셔너리들

내부 딕셔너리의 키 = 해당 날짜에 발생한 고유 로그 메시지, 값 = 그 날짜에 해당 로그가 발생한 횟수

'공부 > 데이터 분석' 카테고리의 다른 글

[파이썬 데이터 분석 입문] Chapter 2. 연습문제 (0)	2022.09.03
[파이썬 데이터 분석 입문] Chapter 2. CSV 파일 (0)	2022.09.03
[파이썬 데이터 분석 입문] Chapter 1 연습문제 (0)	2022.08.23
[파이썬 데이터 분석 입문] Chapter 1 (0)	2022.08.23

newoceanwave

[파이썬 데이터 분석 입문] Chapter 5. 응용 작업

5.1 대량의 파일에서 원하는 집합 찾기

5.2 CSV 파일에서 카테고리별 통계치 계산하기

👩‍💻 datetime 모듈의 strptime()

5.3 텍스트 파일에서 카테고리별 통계치 계산하기

'공부 > 데이터 분석' 카테고리의 다른 글

티스토리툴바

[파이썬 데이터 분석 입문] Chapter 5. 응용 작업

5.1 대량의 파일에서 원하는 집합 찾기

5.2 CSV 파일에서 카테고리별 통계치 계산하기

👩‍💻 datetime 모듈의 strptime()

5.3 텍스트 파일에서 카테고리별 통계치 계산하기

'공부 > 데이터 분석' 카테고리의 다른 글

관련글

티스토리툴바